AOL Releases Search Logs from 500,000 Users
AOL Releases Search Logs from 500,000 Users [1]Adam D'Angelo - 8/5/2006 AOL just released the logs of all searches done by 500,000 of their users over the course of three months earlier this year. That means that if you happened to be randomly chosen as one of these users, everything you searched for from March to May (2006) is now public information on the internet. This was not a leak - it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: "Please reference the following publication when using this collection..." is the message before the download. This is a blatant violation of users' privacy. The data is "anonymized", which to AOL means that each screenname was replaced with a unique number. "It is still a research question how much information needs to be anonymized to protect users," [9]says Abdur from AOL. Here are some examples of what you can find in the data: User 491577 searches for "florida cna pca lakeland tampa", "emt school training florida", "low calorie meals", "infant seat", and "fisher price roller blades". Among user 39509's hundreds of searches are: "ford 352", "oklahoma disciplined pastors", "oklahoma disciplined doctors", "home loans", and some other personally identifying and illegal stuff I'm going to leave out of here. Among user 545605's searches are "shore hills park mays landing nj", "frank william sindoni md", "ceramic ashtrays", "transfer money to china", and "capital gains on sale of house". Compared to some of the data, these examples are on the safe side. I'm leaving out the worst of it - searches for names of specific people, addresses, telephone numbers, illegal drugs, and more. There is no question that law enforcement, employers, or friends could figure out who some of these people are. I hope others can find more examples in the data, which is up for [10]download over here. The data set is very large when uncompressed which makes it pretty hard to work with, but someone should set up a web interface so people can browse it (or even 10% of it) without having to download the 400mb file. If you make a mirror or better interface to the data, or find other examples, let me know and I'll put a link up here. This is the same data that the DOJ wanted from Google back in March. [11]This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL. It's unclear if this is the type of data AOL released to the government [12]back when Google refused to comply. If nothing else, this should be a good example of why search history needs strong privacy protection. Thanks to Greg Linden for pointing this out [13]here. Update 2: The md5 of the file AOL posted (and now removed) is 31cd27ce12c3a3f2df62a38050ce4c0a. I'm posting it so you can make sure you have a valid copy, but so far none of the copies I've seen are fake. Update: Seems like AOL took it down. There are some mirrors of the data in the comments of the digg story, linked below. I estimate about 1000 people have the file, so it's definitely going to be circulated around. The [2]main AOL research page is still up, with some other data collections. The [3]google cache of the download page is still up, but you can't get the data. Here's discussion at other sites: * [4]siliconbeat * [5]techcrunch * [6]digg * [7]reddit * [8]zoli's blog References 1. http://www.ugcs.caltech.edu/~dangelo/ 2. http://research.aol.com/pmwiki/pmwiki.php?n=Main.Home 3. http://72.14.207.104/search?q=cache:2Qvd2z9VbuIJ:research.aol.com/ pmwiki/pmwiki.php%3Fn%3DResearch.500kUserQueriesSampledOver3Months +&hl=en&gl=us&ct=clnk&cd=1 4. http://www.siliconbeat.com/entries/2006/08/06/ aol_research_exposes_data_weve_got_a_little_sick_feeling.html 5. http://www.techcrunch.com/2006/08/06/aol-proudly-releases-massive- amounts-of-user-search-data/ 6. http://digg.com/tech_news/AOL_Releases_Search_Logs_from_500_000_Users 7. http://reddit.com/info/cfvt/comments 8. http://www.zoliblog.com/blog/_archives/2006/8/6/2204969.html 9. http://research.aol.com/pmwiki/pmwiki.php?n=Research. 500kUserQueriesSampledOver3Months 10. http://research.aol.com/pmwiki/pmwiki.php?n=Research. 500kUserQueriesSampledOver3Months 11. http://googleblog.blogspot.com/2006/03/judge-tells-doj-no-on- search-queries.html 12. http://www.boingboing.net/2006/01/20/aol_we_did_not_compl.html 13. http://glinden.blogspot.com/2006/08/chance-to-play-with-big- data.html -- Seth Finkelstein Consulting Programmer http://sethf.com Infothought blog - http://sethf.com/infothought/blog/ Interview: http://sethf.com/essays/major/greplaw-interview.php ------------------------------------- You are subscribed as eugen@leitl.org To manage your subscription, go to http://v2.listbox.com/member/?listname=ip Archives at: http://www.interesting-people.org/archives/interesting-people/ ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE [demime 1.01d removed an attachment of type application/pgp-signature which had a name of signature.asc]
participants (1)
-
Seth Finkelstein