Data Mining 101: Finding Subversives with Amazon Wishlists
(as boingboing correctly stated, the right way to do it is to use the Amazon API) http://www.applefritter.com/bannedbooks Data Mining 101: Finding Subversives with Amazon Wishlists Submitted by Tom Owad on January 4, 2006 - 7:37pm. Security Vast deposits of personal information sit in databases across the internet. Terms used in phone conversations have become the grounds for federal investigation. Reputable organizations like the Catholic Worker, Greenpeace, and the Vegan Community Project, have come under scrutiny by FBI "counterterrorism" agents. "Data mining" of all that information and communication is at the heart of the furor over the recent disclosure of government snooping. "U.S. President George W. Bush and his aides have said his executive order allowing eavesdropping without warrants was limited to monitoring international phone and e-mail communications linked to people with connections to al-Qaeda. What has not been acknowledged, according to the Times, is that NSA technicians combed large amounts of phone and Internet traffic seeking patterns pointing to terrorism suspects. "Some officials described the program as a large data mining operation, the Times said, and described it as much larger than the White House has acknowledged." (Reuters) Combining a data mining operation with the Patriot Act's power to access information makes it all too easy for the federal government to violate the Constitution's prohibition against unreasonable search. Ars Technica has an article, The new technology at the root of the NSA wiretap scandal, that describes the ease with which widespread wiretapping can now be implemented. It quotes Philip Zimmermann, the creator of the PGP encryption software: "A year after the CALEA [Communications Assistance for Law Enforcement Act] passed [in 1994], the FBI disclosed plans to require the phone companies to build into their infrastructure the capacity to simultaneously wiretap 1 percent of all phone calls in all major U.S. cities. This would represent more than a thousandfold increase over previous levels in the number of phones that could be wiretapped. In previous years, there were only about a thousand court-ordered wiretaps in the United States per year, at the federal, state, and local levels combined. It's hard to see how the government could even employ enough judges to sign enough wiretap orders to wiretap 1 percent of all our phone calls, much less hire enough federal agents to sit and listen to all that traffic in real time. The only plausible way of processing that amount of traffic is a massive Orwellian application of automated voice recognition technology to sift through it all, searching for interesting keywords or searching for a particular speaker's voice. If the government doesn't find the target in the first 1 percent sample, the wiretaps can be shifted over to a different 1 percent until the target is found, or until everyone's phone line has been checked for subversive traffic. The FBI said they need this capacity to plan for the future. This plan sparked such outrage that it was defeated in Congress. But the mere fact that the FBI even asked for these broad powers is revealing of their agenda." It used to be you had to get a warrant to monitor a person or a group of people. Today, it is increasingly easy to monitor ideas. And then track them back to people. Most of us don't have access to the databases, software, or computing power of the NSA, FBI, and other government agencies. But an individual with access to the internet can still develop a fairly sophisticated profile of hundreds of thousands of U.S. citizens using free and publicly available resources. Here's an example. There are many websites and databases that could be used for this project, but few things tell you as much about a person as the books he chooses to read. Isn't that why the Patriot Act specifically requires libraries to release information on who's reading what? For this reason, I chose to focus on the information contained in the popular Amazon wishlists. Amazon wishlists lets anyone bookmark books for later purchase. By default these lists are public and available to anybody who searches by name. If the wishlist creator specifies a shipping address, someone else can even purchase the book on Amazon and have it shipped directly as a gift. The wishlist creator's city and state are made public on the wishlist, but the street address remains private. Amazon's popularity has created a vast database of wishlists. No index of all wishlists is available, but it remains possible to view all wishlists by people of a particular first name. A recent search for people named Mark returned 124,887 publicly viewable wishlists. For an all inclusive search by name, you could compile a comprehensive list of first names and nicknames from the baby names databases available on the internet. Armed with this list, and by recording the search results for each first name, it is possible for you to retrieve the vast majority of public wishlists on Amazon. For the purposes of this exercise, only a single name was chosen . a common male name that returned over 260,000 wishlists. I'm not going to divulge what name was actually used. Let's pretend it was "Edgar," in honor of former FBI director J. Edgar Hoover. Before writing a script to download all the 260,000 "Edgar" wishlists, I confirmed that my actions would not violate Amazon's Conditions of Use. I also checked the robots.txt file which contains a list of directories Amazon requests not be traversed by scripts. User wishlists are not in this list, nor did the actions to be taken violate the conditions of use. I started by doing a wishlist search for people named "Edgar" and got back a page linking to the wishlists of the first 25 matches. The url looked something like this: http://www.amazon.com/gp/registry/search.html/?encoding=UTF8&type=wishlist&fi eld-name=edgar&page=1 Two variables extracted from the above url are of particular note: * field-name=edgar * page=1 Changing "edgar" to "george", would generate the first page of matches for people named George. Change '1' to '2' and you'd get matches 26 through 50 instead of 1 through 25. Using a simple 6-line shell script and the popular wget command line tool, I configured two computers on two different DSL connections to begin downloading all 260,000 wishlists in increments of 25,000. Each group of 25,000 wishlists took about four hours to download, for a total download time of less than one day. Each wishlist is located at an address like this: http://www.amazon.com/gp/registry/registry.html/?encoding=UTF8&type=wishlist& id=1DBHU3OCV72ZW 1DBHU3OCV72ZW is the wishlist owner's unique Amazon identification number. I made up the one you see here. By directing wget only to download pages at urls similar to this one, and by incrementing the search page from 1 to 10,400, it is possible to download all 260,000 wishlists without user intervention. Using a pair of 5-year-old computers, two home DSL connections, 42 hours of computer time, and 5 man hours, I now had documents describing the reading preferences of 260,000 U.S. citizens. I downloaded all the files to an external 120 GB Firewire drive in UFS format. The raw data occupied little more than 5 GB. I initially wanted to move all the files into a single directory to facilitate searching, but as the directory contents exceeded 100,000 items, the speed became glacially slow, so I kept the data divided into chunks of 25,000 wishlists. Next comes the fun part . what books are most dangerous? So many to choose from. Here's a sample of the list I made. Feel free to make up your own list if you decide to try some data mining. Send it to the FBI. I'm sure they'll appreciate your help in fighting terrorism. * On Liberty by Stuart Mill. First sentence: "The subject of this essay is not the so-called 'liberty of the will', so unfortunately opposed to the misnamed doctrine of philosophical necessity; but civil, or social liberty: the nature and limits of the power which can be legitimately exercised by society over the individual." What more do you need? * Slaughterhouse-Five by Kurt Vonnegut. The classic anti-war novel. * Fahrenheit 451 by Ray Bradbury. Dystopian. * Brave New World by Aldous Huxley. More dystopian. * 1984 by George Orwell. Most dystopian. * Critical Thinking by Alec Fisher. Can't have any of that. * Build Your Own Laser, Phaser, Ion Ray Gun and Other Working Space Age Projects by Robert Iannini. Obviously. * Apple I Replica Creation by Tom Owad. Building your own computer should be illegal. (ok, it's also here because I wrote it.) * The Catholic Worker Movement: Intellectual And Spiritual Origins by Mark & Louise Zwick. Keywords * Michael Moore. The fringe left. * Rush Limbaugh. The fringe right. * Ralph Nader. * Greenpeace. Because frankly, we all know there's only one sort of person who would want a "Greenpeace: Standing Up for the Earth" 2006 Calendar. * Torah. * Quran & Koran. Like the Catholic Worker and Greenpeace, the American-Arab Anti-Discrimination Committee has also been the subject of FBI investigations. * Bible. Sure, a lot of books use "Bible" in the title, but I cast a wide net. What harm are a few false positives? My Amazon seller ID is attached to these links. If I get any interesting statistics on how many copies of On Liberty, etc., are sold as a result of this article, I'll post them in a follow-up. If I get a call from the FBI, I'll let you know that, too. To search for specific books, I used ISBN numbers, for the rest, keywords. All the search terms were saved to terms.txt, one term per line, for use with grep: ls -1 | xargs grep -HiFof /Volumes/UFS/terms.txt > /Volumes/UFS/matches.txt This command searches all wishlists in the current directory for the terms in terms.txt, then saves the results to matches.txt. Results are stored one per line, in the format: filename:keyword Now that I have a list of which keywords appear in which wishlists, I can sort them. I created a new folder "results" and within it created subfolders for each search term. The TCL script below creates links (similar to aliases or shortcuts) for each matched file, and stores the links within the new subdirectories: #!/usr/bin/tclsh set fdgrep [open "/Volumes/UFS/matches.txt" "r"] while {![eof $fdgrep]} { gets $fdgrep line set mylist [split $line :] if {[llength $mylist] > 1} { lappend mylist [string toupper [lindex $mylist 1]] if {![file exists "/Volumes/UFS/results/[lindex $mylist 2]/[lindex $mylist 0]"]} { exec ln /Volumes/UFS/wishlists/[lindex $mylist 0] "/Volumes/UFS/results/[lindex $mylist 2]/[lindex $mylist 0]" } } } Now, for example, the folder called "Greenpeace" contains every wishlist with that term. Another folder named "Rush Limbaugh" contains the wishlists of all the those interested in reading Rush. On an aside, if you want to delete all the files beginning with the word "search" in a 25,000-file directory, the correct line is: find . -name 'search*' -print0 | xargs -0 rm This line deletes all the files: find . -print0 -name 'search*' | xargs -0 rm Good thing I had backups. There's also a bug, in grep 2.5.1 that corrupts output when grep is run with both the -i and -o flags. Version 2.5.1-1, available through the Fink project, fixes this problem. One curiousity revealed by this project is that there are quite a few people who show up for multiple books. Reading On Liberty and Build Your Own Laser, Phaser, Ion Ray Gun and Other Working Space Age Projects? We really should have a special list for you. Here are the books, along with the numbers of people interested in reading each: Book # of people interested On Liberty 7 Slaughterhouse-Five 82 Fahrenheit 451 63 Brave New World 1 1984 76 Critical Thinking 7 Build Your Own Laser 2 Apple I Replica Creation 4 The Catholic Worker Movement 1 Rebuilding Labor 2 Michael Moore 232 Rush Limbaugh 42 Ralph Nader 74 Greenpeace 5 Torah 42 Quran & Koran 74 Bible 3,771 The first match for "Bible," ironically, was a wishlist containing The Cannabis Grow Bible: The Definitive Guide to Growing Marijuana for Recreational and Medical Use. Right person. Wrong list. Another match was for The Linux Bible: GNU Testament. With Nader, I foolishly searched for last name alone. Thus, there are quite a few hits for The Lemonader along with the correct results. If some results look suspiciously low, it's probably because in many cases I searched for a specific ISBN while the book is available in multiple formats. Only the first page of each user's wishlist was downloaded. Books are always added to the front of the wishlist which pushes older titles off the first page, so there is also a slight bias in favor of newer books. It is possible for users to associate a shipping address with their wishlists, so that others can order them gifts. Though the full address is hidden, city and state remain visible. I already have first and last name. With this information, I can do a Yahoo People Search to obtain an exact street address and phone number. Viewing the wishlists that contained Apple I Replica Creation, I found that all four provided the user's city and state. Of these four, one was a common name that produced multiple hits in his town, two were unlisted (although one of them was in the Intelius database which I opted not to pay for), and the final individual was present on Yahoo People. So I sent him a signed copy and thanked him for his interest. Thanks to Google Maps (and many similar services) a street address is all we need to get a satellite image of a person's home. Tempted as I was to provide satellite images of the homes of the search subjects, it just seemed a bit extreme even for this article. Instead, I opted only to pinpoint the centers of the towns in which they live. So at least you'll know that there's somebody in your community reading Critical Thinking or some other dangerous text. City and state were extracted using a regular expression to create a file for each book containing the locations of its readers. Locations were stored one per line, in this format: Sunnyvale:CA Salt Lake City:Utah Reston:Virginia South Hadley:MA Nevada City:CA Walnut Creek:CA Eagle Nest:NM Memphis:TN North Hollywood:CA Seattle:WA . Using the free Ontok Geocoder service, I was able to quickly convert city and state to latitude and longitude coordinates. Ontok uses the public domain TIGER/Line data available from the U.S. Census Bureau to perform its conversion. It took less than an hour to convert all locations from city and state to longitude and latitude: -122.035011, 37.369011 -111.903656, 40.696415 -77.341591, 38.968300 -72.574860, 42.259102 -121.013496, 39.262192 -122.063980, 37.906521 -105.263031, 36.555302 -90.045448, 35.148762 -118.377838, 34.173100 -122.329430, 47.605701 . Google has released their Maps API, so a map of these locations can be embedded in this article. The API is simple. Plotting each point requires only three lines of code: var point = new GPoint(-122.035011, 37.369011); var marker = new GMarker(point); map.addOverlay(marker); This plots all of the locations on a satellite image of the United States that can be zoomed in to house level. Here are a few interactive samples: Readers of 1984. Readers of the Torah. You. The map pinpointing you (your local ISP, actually) requires a good bit of on-the-fly processing, so if the server is exceptionally busy it may not load correctly. In the future, I may make more sophisticated maps using additional data. Maybe a map that includes all the books in the 260,000 wishlists? Simply searching for any book would present a map of the United States showing the locations of all the people interested in reading it. All the tools used in this project are standard and free. The services, likewise, are all free. The technical skills required to implement this project are well within the abilities of anybody who has done any programming. The network connection used to download these files was a standard home DSL connection. The computer that processed the data was a 1.5 GHz PowerBook G4. The operating system is Mac OS X 10.4, though everything could have been done just as easily with Linux (and probably with Windows). Not a penny was spent in the writing of this article, just 30 hours of time. This is what's possible with publicly available information, but imagine if one had access to Amazon's entire database - which still contains every sale dating back to 1999 by the way. Under Section 251 of the Patriot Act, the FBI can require Amazon to turn over its records, without probable cause, for an "authorized investigation . . . to protect against international terrorism or clandestine intelligence activities." Amazon is forbidden to disclose that they have turned over any records, so that you would never know that the government is keeping records of your book purchases. And obviously it is quite simple to crossreference this info with data available in other databases. On a final note, the FBI is now hiring computer scientists to implement a project that sounds very similar to what I just did: "Currently, the FBI is strengthening systems engineering in order to tie new systems together architecturally and ensure that standards for custom and packaged applications are enforced, and it needs engineers to accomplish this goal, the agency said. "The FBI is also focusing on data warehousing as well as federated search technology, which allows a single search query to be deployed across a number of databases, regardless of whether those databases belong to the same protocol or platform. "'Warehousing has been very successful, yet enterprise extraction, translation and loading processes must be fine-tuned,. the FBI said. .Data engineers are needed to model legacy databases for federated search and participate in legacy transition planning.'"(Computerworld) This article is the first in a weekly series that will deal with security on the internet and practical steps you can take to protect your privacy. Much thanks goes to Robert Warwick for his help with this project and particularly for writing several of the scripts. Thanks also to Nancy Trump for editing, Michael Fincham for brainstorming and Dr. Bob for bandwidth. Article submissions are welcome. If you'd like to contact me, please do so via email. -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE [demime 1.01d removed an attachment of type application/pgp-signature which had a name of signature.asc]
participants (1)
-
Eugen Leitl