
[1]the physics arXiv blog [2]Anonymizing data without damaging it Posted: 05 Nov 2008 11:39 PM CST [3]graph.jpg If scientists are to study massive datasets such as mobile phone records, search queries and movie ratings, the owners of these datasets need to find a way to anonymize the data before releasing it. The high profile cracking of data sets such as the [4]Netflix prize dataset and the AOL search query data set means that people would be wise not to trust these kinds of releases until the anonymization problem has been solved. The general approach to anonymization is to change the data in some significant but subtle way to ensure that no individual is identifiable as a result. One way of doing this is to ensure that every record in the set is identical to at least one other record. That's sensible but not always easy, point out Rajeev Motwani and Shubha Nabar at Stanford University in Palo Alto. For example, a set of search queries can be huge, covering the search habits of millions of people over many months. The variety of searches people make over such a period make it hard to imagine that two entries would be identical. And analyzing and changing such a huge dataset in a reasonable period of time is tricky too. Motwani and Nabar make a number of suggestions. Why not break the data set into smaller, more manageable clusters, they say. And why not widen the criteria for what it means to be identical to allow similar searches to be replaced with identical terms. For example, replacing a search for "organic milk" with a search for "dairy product". These ideas seem eminently sensible. The problem becomes even more difficult when the data is in graph form, as it might be for mobile phone records or web chat statistics. So Nabar suggest a similar anonymizing technique: ensure that every node on the graph should share some number of its neighbors with a certain number of other nodes. The trouble is that the anonymization technique can destroy the very patterns that you are looking for in the data, for example in the [5]way mobile phones are used. And at present, there's no way of knowing what has been lost. So what these guys need to do next is find some kind of measure of data loss that their proposed changes cause, to give us a sense of how much damage is being done to the dataset during anonymization. In the meantime, dataset owners should show some caution over how, why and to whom they release their data. Ref: [6]arxiv.org/abs/0810.5582: Anonymizing Unstructured Data [7]arxiv.org/abs/0810.5578: Anonymizing Graphs [8][arXivblog?i=hpfrHu] [9][arXivblog?i=RCzNN] [10][arXivblog?i=QPK0N] [11][arXivblog?i=sa0en] [12][arXivblog?i=J17iN] [13][arXivblog?i=6potn] [14][arXivblog?i=uqYGN] [15][arXivblog?i=vse8n] [16][arXivblog?i=V3xyN] You are subscribed to email updates from [17]the physics arXiv blog To stop receiving these emails, you may [18]unsubscribe now. Email Delivery powered by FeedBurner Inbox too full? [19](feed) [20]Subscribe to the feed version of the physics arXiv blog in a feed reader. If you prefer to unsubscribe via postal mail, write to: the physics arXiv blog, c/o FeedBurner, 20 W Kinzie, 9th Floor, Chicago IL USA 60610 References 1. http://arxivblog.com/ 2. http://feeds.feedburner.com/~r/arXivblog/~3/444018253/ 3. http://arxivblog.com/wp-content/uploads/2008/11/graph.jpg 4. http://arxivblog.com/?p=142 5. http://arxivblog.com/?p=88 6. http://arxiv.org/abs/0810.5582 7. http://arxiv.org/abs/0810.5578 8. http://feeds.feedburner.com/~a/arXivblog?a=hpfrHu 9. http://feeds.feedburner.com/~f/arXivblog?a=RCzNN 10. http://feeds.feedburner.com/~f/arXivblog?a=QPK0N 11. http://feeds.feedburner.com/~f/arXivblog?a=sa0en 12. http://feeds.feedburner.com/~f/arXivblog?a=J17iN 13. http://feeds.feedburner.com/~f/arXivblog?a=6potn 14. http://feeds.feedburner.com/~f/arXivblog?a=uqYGN 15. http://feeds.feedburner.com/~f/arXivblog?a=vse8n 16. http://feeds.feedburner.com/~f/arXivblog?a=V3xyN 17. http://arxivblog.com/ 18. http://www.feedburner.com/fb/a/emailunsub?id=8632699&key=kesJ612ZsV 19. http://feeds.feedburner.com/arXivblog 20. http://feeds.feedburner.com/arXivblog ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE