Anonymization of LBL FTP Traces

Here we describe in detail the anonymization scheme used for the LBNL-FTP traces. The anonymization was done in the context of the framework presented in "A High-level Programming Environment for Packet Trace Anonymization and Transformation", by Ruoming Pang and Vern Paxson, Proc. ACM SIGCOMM 2003. The paper (in particular, the section on trace anonymization) discusses the underlying principles, while here we delve into the particulars.

Anonymize with HMAC-MD5

In anonymization we frequently apply HMAC-MD5 to hash a data element. HMAC-MD5 takes a 128-bit secret key, which is randomly generated (i.e. read from /dev/random) for each trace. As MD5 confliction is extremely rare, HMAC-MD5 almost always sets up a one-to-one mapping between input and output values. Therefore one can compare equality between hash-anonymized values — for this reason, we often anonymize identifiers with HMAC-MD5. On the other hand, assuming both HMAC and MD5 are safe, one can neither derive the hash input from its output nor compute a hash value without knowing the key, which makes it difficult to guess the original value of hashed data. However, as discussed in the paper, the hash input must be very carefully chosen to prevent indirect exposure. Below we will always explicitly specify the hash input when a data type is anonymized with HMAC-MD5.

Anonymizing different types of data

Some data types are anonymized independent of the context (PCAP/IP/TCP headers, FTP requests/replies) in which the values appear.

Timestamp/date: left in the clear.
Remote client IP address: randomly re-numbered (with a one-to-one mapping).
Server IP address: addresses of a selected set of public servers are left in the clear. Addresses of other servers are randomly numbered (with a one-to-one mapping).
Port number: left in the clear.
User ID: left in the clear if: 1) it is "anonymous", "guest", or "ftp"; or 2) the login failed and it is one of: "backdoor", "bomb", "diag", "gdm", "issadmin", "msql", "netfrack", "netphrack", "own", "r00t", "root", "ruut", "smtp", "sundiag", "sync", "sys", "sysadm", "sysdiag", "sysop", "sysoper", "system", "toor", "tour", "y0uar3ownd"; or 3) it is in a trace-specific white list of user ID's, which may include, for example, "annonymous".

Otherwise the user ID is anonymized. (Note: The traces are preprocessed to filter out connections with successful non-anonymous logins.)

A user ID is anonymized with HMAC-MD5, and the hash input is the 3-tuple <ID, server-IP, whether-the-login-was-successful> rather than just the ID to prevent shared-text matching and known-text matching inference attacks (see section 4.3 of the paper).
Password: replaced with "<password>".
File/directory name: For files on the selected set of public servers (see discussion of Server IP Address above), file names are left in the clear. Other file names are anonymized with HMAC-MD5 of <absolute-path-name, server-IP-address> as hash input.
File size: left in the clear.
Server software version/configuration: left in the clear.

TCP/IP Header

IP addresses and TCP ports are processed as described above.
The transformed traces do not contain any IP fragmentation. IP fragments in the original traces have been reassembled.
TCP flags are preserved, except the FIN can be moved to later in the trace if data was sent after the FIN initially appeared, or if new data was added at the end of the trace.
IP options are stripped out.
The following TCP options are left in the clear: maximum segment size, window scaling, SACK option negotiation (but not SACK blocks, due to the ambiguity of the location of the SACK'd data in the transformed stream), and timestamps; other options are replaced with NOP.

FTP Request

An FTP request contains two parts: <command> <argument>.

The <command> part is left in the clear if it is a known FTP command, i.e., it is either a "legal" FTP command, or it appears in a white list. (The white list can be trace-specific to preserve commands with typos, such as "UUSER".)

The <argument> part is anonymized according to its data type, if it is one of the data types discussed above (e.g. file name, user ID). An empty argument is always left unchanged. We discuss other arguments below with their associated commands.

AUTH: if the server rejects the request and the argument is a well-known mechanism (e.g. "GSSAPI", "KERBEROS_V4"), the argument is left in the clear; otherwise, it is replaced by "<auth>".
Commands with no argument (PWD, PASV, CDUP, etc.): if the argument is in fact empty, the empty string is left there; otherwise, the argument string is anonymized with HMAC-MD5 of input <command, argument>.
Commands with pre-defined argument sets (TYPE, STRU, etc.): for example, RFC 959 defines that the argument for TYPE should match the regular expression /([AE]( [NTC])?)|I|(L [0-9]+)/. The anonymizer leaves the argument in the clear if it matches the predefined argument set; otherwise it anonymizes the argument string with HMAC-MD5 of input <command, argument>.
HELP: the argument is left in the clear if it is empty or one of the known FTP commands; otherwise, it is anonymized with HMAC-MD5 of input <command, argument>.
PORT: the anonymizer tries to parse the argument to a <host, port> pair. If the parsing succeeds, it anonymizes the host address but leaves the port number in the clear, and transforms the result back to the comma separated format. If the parsing fails, the argument string is anonymized with HMAC-MD5 of input <command, argument>.
SITE: The SITE command is associated with a number of attacks, and is not often used legitimately. The argument of a SITE command itself contains a command and an argument. The command is left in the clear if it is one of the well-known SITE commands (e.g. EXEC, CHMOD, HELP), and the argument is left in the clear if it is empty or in a white list. The argument is anonymized with HMAC-MD5 of input <command, argument>.
Unknown commands: arguments left in the clear if white-listed; otherwise, replaced with "<arg>".

FTP Reply

Each line of an FTP reply message consists of a reply code and a text message. The reply code is left in the clear as it does not reveal any private information. The text message is matched against a set of message templates. If there is a match, then the message is splitted into fields, each field being either a constant word or a variable with a known data type (e.g. a file name, IP address). We then process each field accordingly (see below) and anonymize parts that may reveal private information. If the message does not match any template, the whole message is replaced with "<message stripped out>".

Here is the list of data types of variable fields in reply messages: (IP addresses, port numbers, and file/directory names below are processed as introduced earlier)

cmd: the command part of the corresponding FTP request (preserved or anonymized as in the request)
arg: the argument part of the corresponding FTP request (preserved or anonymized as in the request)
num: a decimal number (preserved or replaced by "num" as specified in the message template)
port: six decimal numbers separated by commas, representing an <IP, port> pair
IP: an IP address in dotted format (xxx.xxx.xxx.xxx)
domain: a domain name (replaced by "<domain>")
time: a time string with format hh:mm[:ss][am,pm] (left in the clear)
url: an HTTP URL (replaced by "<url>")
email: an email address (replaced by "<email>")
path: a file/directory name
version: a version string (left in the clear)
file mode: a file mode string, such as "dr--w--x--" (replaced by "<file-mode>")
*: the wildcard type, e.g. "Bob" in "Welcome to Bob's FTP server" (replaced by "<*>")

Send questions or comments to Ruoming Pang (rpang@cs.princeton.edu). Thanks!