the physics arXiv blog

the physics arXiv blog howdy at
Thu Nov 6 12:08:20 PST 2008

[1]the physics arXiv blog

   [2]Anonymizing data without damaging it

   Posted: 05 Nov 2008 11:39 PM CST


   If scientists are to study massive datasets such as mobile phone
   records, search queries and movie ratings, the owners of these
   datasets need to find a way to anonymize the data before releasing it.

   The high profile cracking of data sets such as the [4]Netflix prize
   dataset and the AOL search query data set means that people would be
   wise not to trust these kinds of releases until the anonymization
   problem has been solved.

   The general approach to anonymization is to change the data in some
   significant but subtle way to ensure that no individual is
   identifiable as a result. One way of doing this is to ensure that
   every record in the set is identical to at least one other record.

   That's sensible but not always easy, point out Rajeev Motwani and
   Shubha Nabar at Stanford University in Palo Alto. For example, a set
   of search queries can be huge, covering the search habits of millions
   of people over many months. The variety of searches people make over
   such a period make it hard to imagine that two entries would be
   identical. And analyzing and changing such a huge dataset in a
   reasonable period of time is tricky too.

   Motwani and Nabar make a number of suggestions. Why not break the data
   set into smaller, more manageable clusters, they say. And why not
   widen the criteria for what it means to be identical to allow similar
   searches to be replaced with identical terms. For example, replacing a
   search for "organic milk" with a search for "dairy product". These
   ideas seem eminently sensible.

   The problem becomes even more difficult when the data is in graph
   form, as it might be for mobile phone records or web chat statistics.
   So Nabar suggest a similar anonymizing technique: ensure that every
   node on the graph should share some number of its neighbors with a
   certain number of other nodes.

   The trouble is that the anonymization technique can destroy the very
   patterns that you are looking for in the data, for example in the
   [5]way mobile phones are used. And at present, there's no way of
   knowing what has been lost.

   So what these guys need to do next is find some kind of measure of
   data loss that their proposed changes cause, to give us a sense of how
   much damage is being done to the dataset during anonymization.

   In the meantime, dataset owners should show some caution over how, why
   and to whom they release their data.


   [6] Anonymizing Unstructured Data

   [7] Anonymizing Graphs

   [9][arXivblog?i=RCzNN] [10][arXivblog?i=QPK0N] [11][arXivblog?i=sa0en]
   [12][arXivblog?i=J17iN] [13][arXivblog?i=6potn]
   [14][arXivblog?i=uqYGN] [15][arXivblog?i=vse8n]
   You are subscribed to email updates from [17]the physics arXiv blog
   To stop receiving these emails, you may [18]unsubscribe now. Email
   Delivery powered by FeedBurner
   Inbox too full? [19](feed) [20]Subscribe to the feed version of the
   physics arXiv blog in a feed reader.
   If you prefer to unsubscribe via postal mail, write to: the physics
   arXiv blog, c/o FeedBurner, 20 W Kinzie, 9th Floor, Chicago IL USA



----- End forwarded message -----
Eugen* Leitl <a href="">leitl</a>
ICBM: 48.07100, 11.36820
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

More information about the cypherpunks-legacy mailing list