Here we describe in detail the anonymization scheme used for the
LBNL-FTP traces.
The anonymization was done in the context of the framework
presented in "A
High-level Programming Environment for Packet Trace Anonymization and
Transformation", by Ruoming Pang and Vern Paxson, Proc. ACM SIGCOMM 2003.
The paper (in particular, the section on trace anonymization)
discusses the underlying principles, while here we delve into the particulars.
Anonymize with HMAC-MD5
In anonymization we frequently apply HMAC-MD5 to hash a data
element. HMAC-MD5 takes a 128-bit secret key, which is randomly
generated (i.e. read from /dev/random) for each trace. As MD5
confliction is extremely rare, HMAC-MD5 almost always sets up a
one-to-one mapping between input and output values. Therefore one can
compare equality between hash-anonymized values — for this
reason, we often anonymize identifiers with HMAC-MD5. On the other
hand, assuming both HMAC and MD5 are safe, one can neither derive the
hash input from its output nor compute a hash value without knowing
the key, which makes it difficult to guess the original value of
hashed data. However, as discussed in the paper, the hash input must
be very carefully chosen to prevent indirect exposure. Below we will
always explicitly specify the hash input when a data type is
anonymized with HMAC-MD5.
Anonymizing different types of data
Some data types are anonymized independent of the context
(PCAP/IP/TCP headers, FTP requests/replies) in which
the values appear.
- Timestamp/date: left in the clear.
- Remote client IP address: randomly re-numbered (with a
one-to-one mapping).
- Server IP address: addresses of a selected set of public
servers are left in the clear. Addresses of other servers are randomly
numbered (with a one-to-one mapping).
- Port number: left in the clear.
- User ID: left in the clear if: 1) it is "anonymous", "guest", or
"ftp"; or 2) the login failed and it is one of: "backdoor", "bomb",
"diag", "gdm", "issadmin", "msql", "netfrack", "netphrack", "own",
"r00t", "root", "ruut", "smtp", "sundiag", "sync", "sys", "sysadm",
"sysdiag", "sysop", "sysoper", "system", "toor", "tour", "y0uar3ownd";
or 3) it is in a trace-specific white list of user ID's, which may
include, for example, "annonymous".
Otherwise the user ID is anonymized. (Note: The traces are
preprocessed to filter out connections with successful non-anonymous
logins.)
A user ID is anonymized with HMAC-MD5, and the hash input is the
3-tuple <ID, server-IP, whether-the-login-was-successful> rather than
just the ID to prevent shared-text matching and known-text
matching inference attacks (see section 4.3 of the paper).
- Password: replaced with "<password>".
- File/directory name: For files on the selected set of public servers (see
discussion of Server IP Address above), file names are left in the clear.
Other
file names are anonymized with HMAC-MD5 of <absolute-path-name, server-IP-address> as hash input.
- File size: left in the clear.
- Server software version/configuration: left in the clear.
TCP/IP Header
- IP addresses and TCP ports are processed as described above.
- The transformed traces do not contain any IP fragmentation. IP
fragments in the original traces have been reassembled.
- TCP flags are preserved, except the FIN can be moved to
later in the trace if data was sent after the FIN initially
appeared, or if new data was added at the end of the trace.
- IP options are stripped out.
- The following TCP options are left in the clear: maximum segment
size, window scaling, SACK option negotiation (but not SACK
blocks, due to the ambiguity of the location of the SACK'd data in the
transformed stream), and timestamps; other options are replaced with
NOP.
FTP Request
An FTP request contains two parts: <command> <argument>.
The <command> part is left in the clear if it is a known FTP command,
i.e., it is either a "legal" FTP command, or it appears in a white
list. (The white list can be trace-specific to preserve commands with
typos, such as "UUSER".)
The <argument> part is anonymized according to its data type, if it is
one of the data types discussed above (e.g. file name, user ID). An
empty argument is always left unchanged. We discuss other arguments
below with their associated commands.
- AUTH: if the server rejects the request
and the argument is a well-known mechanism
(e.g. "GSSAPI", "KERBEROS_V4"), the
argument is left in the clear; otherwise, it is replaced by
"<auth>".
- Commands with no argument (PWD, PASV, CDUP, etc.):
if the argument is in fact empty, the empty string is left there;
otherwise, the argument string is anonymized with HMAC-MD5
of input
<command, argument>.
- Commands with pre-defined argument sets (TYPE, STRU, etc.):
for example, RFC 959 defines that the argument for TYPE should
match the regular expression /([AE]( [NTC])?)|I|(L [0-9]+)/.
The anonymizer leaves the argument in the clear if it matches
the predefined argument set; otherwise it anonymizes the argument
string with HMAC-MD5 of input <command, argument>.
- HELP: the argument is left in the clear if it is empty or
one of the known FTP commands; otherwise, it is anonymized with
HMAC-MD5 of input <command, argument>.
- PORT: the anonymizer tries to parse the argument to a <host,
port> pair. If the parsing succeeds, it anonymizes the host
address but leaves the port number in the clear,
and transforms the result back to the comma separated format.
If the parsing fails, the argument string is anonymized with
HMAC-MD5 of input <command, argument>.
- SITE: The SITE command is associated with a number
of attacks, and is not often used legitimately. The argument of a
SITE command itself contains a command and an argument. The
command is left in the clear if it is one of the well-known
SITE commands (e.g. EXEC, CHMOD,
HELP), and the argument is left in the clear if it is empty
or in a white list. The argument is anonymized with HMAC-MD5 of input
<command, argument>.
- Unknown commands: arguments left in the clear if white-listed;
otherwise, replaced with "<arg>".
FTP Reply
Each line of an FTP reply message consists of a reply code and a text message.
The reply code is left in the clear as it does not reveal any private information.
The text message is matched against a set of message templates. If
there is a match, then the message is splitted into fields, each
field being either a constant word or a variable with a known data type
(e.g. a file name, IP address). We then process each field accordingly
(see below) and anonymize parts that may reveal private
information. If the message does not match any template, the whole
message is replaced with "<message stripped out>".
Here is the list of data types of variable fields in reply messages:
(IP addresses, port numbers, and file/directory names below are
processed as introduced earlier)
- cmd: the command part of the corresponding FTP
request (preserved or anonymized as in the request)
- arg: the argument part of the corresponding FTP
request (preserved or anonymized as in the request)
- num: a decimal number (preserved or replaced by
"num" as specified in the message template)
- port: six decimal numbers separated by commas, representing an <IP, port> pair
- IP: an IP address in dotted format (xxx.xxx.xxx.xxx)
- domain: a domain name (replaced by "<domain>")
- time: a time string with format
hh:mm[:ss][am,pm] (left in the clear)
- url: an HTTP URL (replaced by "<url>")
- email: an email address (replaced by "<email>")
- path: a file/directory name
- version: a version string (left in the clear)
- file mode: a file mode string, such as
"dr--w--x--" (replaced by "<file-mode>")
- *: the wildcard type, e.g. "Bob" in "Welcome to
Bob's FTP server" (replaced by "<*>")
Send questions or comments to Ruoming Pang (rpang@cs.princeton.edu). Thanks!