[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Statistical analysis of anonymous databases



I ran across an interesting problem on the STAT-L mailing list.  I came up
with an initial solution, but it didn't fully solve the problem.  I will
summarize:

In medical research (this particular application - there are others I am
sure) it is desirable to have a large database of individual medical
histories available to search for correlations, risk factors, etc.  The
problem, of course, is that many individuals want their medical histories
kept private.  It is therefore necessary to maintain a database that is not
traceable back to individuals.  An additional requirement is that people
must be able to add additional information to their records as it becomes
available.  The researcher who initially posed the question suggested
adding random data to "encrypt anonymity".

My first cut solution was to hash the individual's name (perhaps including
some other info or random info to thwart dictionary attacks) and send the
records in under the hashed name.  If done correctly, this should protect
the anonymity of the record.  The problem with this is that with the volume
of data available in a medical record, it is very probable that a person
could be tied to that record.

Does anyone have any insights into this problem?  <disclaimer> This is of
purely academic interest to me, I don't know the person who asked the
intial question (other than through email).  It just sounds like a neat
problem. </disclaimer>

        Clay






---------------------------------------------------------------------------
Clay Olbon II            | [email protected]
Systems Engineer         | ph: (810) 589-9930 fax 9934
Dynetics, Inc., Ste 302  | http://www.msen.com/~olbon/olbon.html
550 Stephenson Hwy       | PGP262 public key: on web page
Troy, MI 48083-1109      | pgp print: B97397AD50233C77523FD058BD1BB7C0
                     TANSTAAFL
---------------------------------------------------------------------------