[Originally posted on September 25, 2012. A day after I informed IEEE, and they restricted access to the log files]
Using the data to gain insights into the engineering and scientific community
IEEE suffered a data breach which I discovered on September 18 (UPDATE: the breach is now confirmed). For a few days I was uncertain what to do with the information and the data. On September 24, I let them know, and they fixed (at least partially) the problem. The usernames and passwords kept in plaintext were publicly available on their FTP server for at least one month prior to my discovery. Among the almost 100,000 compromised users are Apple, Google, IBM, Oracle and Samsung employees, as well as researchers from NASA, Stanford and many other places. I did not and will not make the raw data available to anyone else.
IEEE and the log story
IEEE (Institute of Electrical and Electronics Engineers) is renowned as one of the world-leading organizations in standard development and the promotion of scientific and educational development within the Electrical, Electronics, Communications, Computer Engineering, Computer Science and related fields. The organization has more than 415,000 members all over the world, almost half of them in the United States .
By the nature of the organization, IEEE members are highly specialised individuals, many of them working in critical industry, governmental and military projects. Furthermore, it would be reasonable to assume, that an organization publishing leading security-focused publications , would value the privacy of its members, and be proactive in keeping their data secure.
Due to several undoubtedly grave mistakes, the ieee.org account username and plaintext password of around 100,000 IEEE members were publicly available on the IEEE FTP server for at least one month. Furthermore, all the actions these users performed on the ieee.org website were also available. Separately, spectrum.ieee.org visitor activity is also publicly available.
The simplest and most important mistake on the part of the IEEE web administrators was that they failed to restrict access to their webserver logs for both ieee.org and spectrum.ieee.org allowing these to be viewed by anyone going to the address ftp://ftp.ieee.org/uploads/akamai/ (closed on September 24 around 13:00 UTC, after I reported it). On these logs, as is the norm, every web request was recorded (more than 376 million HTTP requests in total). Web server logs should never be publicly available, since they usually contain information that can be used to identify users (sometimes even after the log was anonymized as in the "AOL incident" ). However, this case is much worse, since 411.308 of the log entries contain both usernames and passwords. Out of these, there seem to be 99,979 unique usernames.
If leaving an FTP directory containing 100GB of logs publicly open could be a simple mistake in setting access permissions, keeping both usernames and passwords in plaintext is much more troublesome. Keeping a salted cryptographic hash of the password is considered best practice, since it would mitigate exactly such an access permission mistake. Also, keeping passwords in logs is inherently insecure, especially plaintext passwords, since any employee with access to logs (for the purpose of analysis, monitoring or intrusion detection) could pose a threat to the privacy of users.
It is certainly unfortunate this information was leaked out, and who knows who got it before it got fixed. Maybe there are access logs for the FTP so the damage can be assessed. Anyway, the affected users will probably have to be informed, since it is my understanding that the law requires it (UPDATE: IEEE informs members). In Europe there is Article 4 of the Directive on privacy and electronic communications (Directive 2002/58/EC) and its amendment (Directive 2009/136/EC). In the US, 46 states seem to have similar requirements .
While the cause of the data breach has been solved, one must point out the value of this dataset from a research perspective. It is rare that researchers gain access to such rich datasets. Various ethical and privacy-related considerations must be evaluated before such datasets can be publicly released. Deciding on how to anonymize the data is no easy challenge. Simply excluding any information making users directly identifiable is not enough, as past dataset releases have shown that some users can still be pinpointed. This resulted in lawsuits in the case of Netflix  and AOL , or the withdrawal of the data, like in the recent Wikipedia case . For this reason, companies such as Google prefer to keep such data for study by internal researchers and do not release it to the public . Furthermore, some companies release data to a trusted researcher with the condition to remain anonymous, like an unnamed European mobile phone operator did for Albert-László Barabási . This means academic researchers working in such fields as Information Retrieval have limited access to fresh real-world data, being at a disadvantage to their industrial counterparts.
For these reasons, I cannot not give in to the urge to perform a basic analysis of this serendipitously acquired data, although I acknowledge this might be ethically dubious. However, I did not, and plan not to release the raw log data to anyone else.
- Log data time span: 01/Aug/2012:20:46:28 +0000 to 18/Sep/2012:08:47:17 +0000
- Total number of log entries: 376,021,496
- log entries for ieee.org: 301,319,566
- log entries for spectrum.ieee.org: 74,701,930
- log entries with password details: 411,308 (of which 17,157 are password reset requests and have no username field)
The following analysis is only based on the 411,308 log entries with password details, accounting for 99,979 distinct username values.
2. Top Journals in Security & Privacy – Microsoft Academic Search
3. Researchers yearn to use AOL logs, but they hesitate, Katie Hafner – New York Times, 2006
5. Robust De-anonymization of Large Sparse Datasets [PDF], Arvind Narayanan and Vitaly Shmatikov – The University of Texas at Austin, 2007
6. A Face Is Exposed for AOL Searcher No. 4417749, Michael Barbaro and Tom Zeller Jr. – New York Times, 2006
7. What are readers looking for? Wikipedia search data now available – Wikimedia Blog, August 2012
8. Cellphone Tracking Study Shows We’re Creatures of Habit, John Schwartz – New York Times, 2008