Here we describe in detail the anonymization scheme used for the LBNL-FTP traces. The anonymization was done in the context of the framework presented in "A High-level Programming Environment for Packet Trace Anonymization and Transformation", by Ruoming Pang and Vern Paxson, Proc. ACM SIGCOMM 2003. The paper (in particular, the section on trace anonymization) discusses the underlying principles, while here we delve into the particulars.



Anonymize with HMAC-MD5

In anonymization we frequently apply HMAC-MD5 to hash a data element. HMAC-MD5 takes a 128-bit secret key, which is randomly generated (i.e. read from /dev/random) for each trace. As MD5 confliction is extremely rare, HMAC-MD5 almost always sets up a one-to-one mapping between input and output values. Therefore one can compare equality between hash-anonymized values — for this reason, we often anonymize identifiers with HMAC-MD5. On the other hand, assuming both HMAC and MD5 are safe, one can neither derive the hash input from its output nor compute a hash value without knowing the key, which makes it difficult to guess the original value of hashed data. However, as discussed in the paper, the hash input must be very carefully chosen to prevent indirect exposure. Below we will always explicitly specify the hash input when a data type is anonymized with HMAC-MD5.


Anonymizing different types of data

Some data types are anonymized independent of the context (PCAP/IP/TCP headers, FTP requests/replies) in which the values appear.



TCP/IP Header



FTP Request

An FTP request contains two parts: <command> <argument>.

The <command> part is left in the clear if it is a known FTP command, i.e., it is either a "legal" FTP command, or it appears in a white list. (The white list can be trace-specific to preserve commands with typos, such as "UUSER".)

The <argument> part is anonymized according to its data type, if it is one of the data types discussed above (e.g. file name, user ID). An empty argument is always left unchanged. We discuss other arguments below with their associated commands.



FTP Reply

Each line of an FTP reply message consists of a reply code and a text message. The reply code is left in the clear as it does not reveal any private information. The text message is matched against a set of message templates. If there is a match, then the message is splitted into fields, each field being either a constant word or a variable with a known data type (e.g. a file name, IP address). We then process each field accordingly (see below) and anonymize parts that may reveal private information. If the message does not match any template, the whole message is replaced with "<message stripped out>".

Here is the list of data types of variable fields in reply messages: (IP addresses, port numbers, and file/directory names below are processed as introduced earlier)

  1. cmd: the command part of the corresponding FTP request (preserved or anonymized as in the request)

  2. arg: the argument part of the corresponding FTP request (preserved or anonymized as in the request)

  3. num: a decimal number (preserved or replaced by "num" as specified in the message template)

  4. port: six decimal numbers separated by commas, representing an <IP, port> pair

  5. IP: an IP address in dotted format (xxx.xxx.xxx.xxx)

  6. domain: a domain name (replaced by "<domain>")

  7. time: a time string with format hh:mm[:ss][am,pm] (left in the clear)

  8. url: an HTTP URL (replaced by "<url>")

  9. email: an email address (replaced by "<email>")

  10. path: a file/directory name

  11. version: a version string (left in the clear)

  12. file mode: a file mode string, such as "dr--w--x--" (replaced by "<file-mode>")

  13. *: the wildcard type, e.g. "Bob" in "Welcome to Bob's FTP server" (replaced by "<*>")



Send questions or comments to Ruoming Pang (rpang@cs.princeton.edu). Thanks!