Statistical analysis of anonymous databases
data:image/s3,"s3://crabby-images/9bcd0/9bcd05f99f3c031d5206089a5eacd73bc4c59a5e" alt=""
I ran across an interesting problem on the STAT-L mailing list. I came up with an initial solution, but it didn't fully solve the problem. I will summarize: In medical research (this particular application - there are others I am sure) it is desirable to have a large database of individual medical histories available to search for correlations, risk factors, etc. The problem, of course, is that many individuals want their medical histories kept private. It is therefore necessary to maintain a database that is not traceable back to individuals. An additional requirement is that people must be able to add additional information to their records as it becomes available. The researcher who initially posed the question suggested adding random data to "encrypt anonymity". My first cut solution was to hash the individual's name (perhaps including some other info or random info to thwart dictionary attacks) and send the records in under the hashed name. If done correctly, this should protect the anonymity of the record. The problem with this is that with the volume of data available in a medical record, it is very probable that a person could be tied to that record. Does anyone have any insights into this problem? <disclaimer> This is of purely academic interest to me, I don't know the person who asked the intial question (other than through email). It just sounds like a neat problem. </disclaimer> Clay --------------------------------------------------------------------------- Clay Olbon II | Clay.Olbon@dynetics.com Systems Engineer | ph: (810) 589-9930 fax 9934 Dynetics, Inc., Ste 302 | http://www.msen.com/~olbon/olbon.html 550 Stephenson Hwy | PGP262 public key: on web page Troy, MI 48083-1109 | pgp print: B97397AD50233C77523FD058BD1BB7C0 TANSTAAFL ---------------------------------------------------------------------------
data:image/s3,"s3://crabby-images/ef6b1/ef6b1ba5514978ad4cbdeb57bb778abb31903715" alt=""
One solution to this is to have a database that 'generalizes' its answers as it provides them. For example, rather than returning Clay Olbon, 32, m, left handed, cholesterol 350, bp 200/160, 5'9", 175#, it would return: fooblat martin,25-35, m, left handed, cholest. 3-400, 5.5-6ft, heavy. researchers could then provide ranges to get answers. Thus, if I'm very concerned about the correlation between age and weight, I could get that information very specifically and nothing else. The generalization filter could be written to only allow N queries of a given level of detail, so that the more detail you wanted in one area, the more you give up in others. There could be a review comittee (This is the way hospitals & medical research works) to review requests for more specific data. Doctors like having names, so you could genrate arbitrary names for patients, or use a sylable genarator to come up with pronounceable nonsense. Adam Clay Olbon II wrote: | In medical research (this particular application - there are others I am | sure) it is desirable to have a large database of individual medical | histories available to search for correlations, risk factors, etc. The | problem, of course, is that many individuals want their medical histories | kept private. It is therefore necessary to maintain a database that is not | traceable back to individuals. An additional requirement is that people | must be able to add additional information to their records as it becomes | available. The researcher who initially posed the question suggested | adding random data to "encrypt anonymity". | -- "It is seldom that liberty of any kind is lost all at once." -Hume
data:image/s3,"s3://crabby-images/cb7d2/cb7d2d881e03a46f29dc9655e03417adf7c4ff56" alt=""
I would ask, is there any known medical gain that has resulted from such a data-base correllation. I do not accept a researcher's own statements as to the utility of the work (S)he's done with someone's funding. Seen too much of it at close quarters... Nor do I accept reeports in the lay press - these are nothing more than re-gurgitated press releases from PR depts of institutions.
data:image/s3,"s3://crabby-images/ef6b1/ef6b1ba5514978ad4cbdeb57bb778abb31903715" alt=""
The evidence about the dangers of smoking is largely based on huge data sets where large amounts of information was gathered and sifted through to eliminate other correlations, until only cigarettes were left. Adam Alan Horowitz wrote: | I would ask, is there any known medical gain that has resulted from | such a data-base correllation. | | I do not accept a researcher's own statements as to the utility of the work | (S)he's done with someone's funding. Seen too much of it at close | quarters... Nor do I accept reeports in the lay press - these are | nothing more than re-gurgitated press releases from PR depts of institutions. | -- "It is seldom that liberty of any kind is lost all at once." -Hume
data:image/s3,"s3://crabby-images/fa760/fa76007f0f6e1a73c64ebb7efa9801364b83488b" alt=""
In article <v01540b02add1fc6e4658@[193.239.225.200]>, Clay Olbon II <Clay.Olbon@dynetics.com> wrote:
In medical research (this particular application - there are others I am sure) it is desirable to have a large database of individual medical histories available to search for correlations, risk factors, etc. The problem, of course, is that many individuals want their medical histories kept private. It is therefore necessary to maintain a database that is not traceable back to individuals. An additional requirement is that people must be able to add additional information to their records as it becomes available.
How about a simple non-technical solution? Each patient picks a random pseudonym; the database is keyed off that pseudonym, and the person's True Name(tm) never appears in the database. Patients should remember their pseudonym (or write it down); then they can add information to the database. Ahh, anonymity. (Hey, I posted about something exportable-- that should fill my quota for the year. :-)
participants (4)
-
Adam Shostack
-
Alan Horowitz
-
Clay.Olbon@dynetics.com
-
daw@cs.berkeley.edu