the physics arXiv blog
the physics arXiv blog
howdy at arxivblog.com
Thu Nov 6 12:08:20 PST 2008
[1]the physics arXiv blog
[2]Anonymizing data without damaging it
Posted: 05 Nov 2008 11:39 PM CST
[3]graph.jpg
If scientists are to study massive datasets such as mobile phone
records, search queries and movie ratings, the owners of these
datasets need to find a way to anonymize the data before releasing it.
The high profile cracking of data sets such as the [4]Netflix prize
dataset and the AOL search query data set means that people would be
wise not to trust these kinds of releases until the anonymization
problem has been solved.
The general approach to anonymization is to change the data in some
significant but subtle way to ensure that no individual is
identifiable as a result. One way of doing this is to ensure that
every record in the set is identical to at least one other record.
That's sensible but not always easy, point out Rajeev Motwani and
Shubha Nabar at Stanford University in Palo Alto. For example, a set
of search queries can be huge, covering the search habits of millions
of people over many months. The variety of searches people make over
such a period make it hard to imagine that two entries would be
identical. And analyzing and changing such a huge dataset in a
reasonable period of time is tricky too.
Motwani and Nabar make a number of suggestions. Why not break the data
set into smaller, more manageable clusters, they say. And why not
widen the criteria for what it means to be identical to allow similar
searches to be replaced with identical terms. For example, replacing a
search for "organic milk" with a search for "dairy product". These
ideas seem eminently sensible.
The problem becomes even more difficult when the data is in graph
form, as it might be for mobile phone records or web chat statistics.
So Nabar suggest a similar anonymizing technique: ensure that every
node on the graph should share some number of its neighbors with a
certain number of other nodes.
The trouble is that the anonymization technique can destroy the very
patterns that you are looking for in the data, for example in the
[5]way mobile phones are used. And at present, there's no way of
knowing what has been lost.
So what these guys need to do next is find some kind of measure of
data loss that their proposed changes cause, to give us a sense of how
much damage is being done to the dataset during anonymization.
In the meantime, dataset owners should show some caution over how, why
and to whom they release their data.
Ref:
[6]arxiv.org/abs/0810.5582: Anonymizing Unstructured Data
[7]arxiv.org/abs/0810.5578: Anonymizing Graphs
[8][arXivblog?i=hpfrHu]
[9][arXivblog?i=RCzNN] [10][arXivblog?i=QPK0N] [11][arXivblog?i=sa0en]
[12][arXivblog?i=J17iN] [13][arXivblog?i=6potn]
[14][arXivblog?i=uqYGN] [15][arXivblog?i=vse8n]
[16][arXivblog?i=V3xyN]
You are subscribed to email updates from [17]the physics arXiv blog
To stop receiving these emails, you may [18]unsubscribe now. Email
Delivery powered by FeedBurner
Inbox too full? [19](feed) [20]Subscribe to the feed version of the
physics arXiv blog in a feed reader.
If you prefer to unsubscribe via postal mail, write to: the physics
arXiv blog, c/o FeedBurner, 20 W Kinzie, 9th Floor, Chicago IL USA
60610
References
1. http://arxivblog.com/
2. http://feeds.feedburner.com/~r/arXivblog/~3/444018253/
3. http://arxivblog.com/wp-content/uploads/2008/11/graph.jpg
4. http://arxivblog.com/?p=142
5. http://arxivblog.com/?p=88
6. http://arxiv.org/abs/0810.5582
7. http://arxiv.org/abs/0810.5578
8. http://feeds.feedburner.com/~a/arXivblog?a=hpfrHu
9. http://feeds.feedburner.com/~f/arXivblog?a=RCzNN
10. http://feeds.feedburner.com/~f/arXivblog?a=QPK0N
11. http://feeds.feedburner.com/~f/arXivblog?a=sa0en
12. http://feeds.feedburner.com/~f/arXivblog?a=J17iN
13. http://feeds.feedburner.com/~f/arXivblog?a=6potn
14. http://feeds.feedburner.com/~f/arXivblog?a=uqYGN
15. http://feeds.feedburner.com/~f/arXivblog?a=vse8n
16. http://feeds.feedburner.com/~f/arXivblog?a=V3xyN
17. http://arxivblog.com/
18. http://www.feedburner.com/fb/a/emailunsub?id=8632699&key=kesJ612ZsV
19. http://feeds.feedburner.com/arXivblog
20. http://feeds.feedburner.com/arXivblog
----- End forwarded message -----
--
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
More information about the cypherpunks-legacy
mailing list