In databases where genetic information is stored anonymously, that anonymity does not necessarily stay that way, stirring up concern about how much privacy research project participants can anticipate in this Internet age.

A strained relationship has always existed between the need to share data to increase medical discoveries and the fact that many people do not want to make their health information public. The rising use of genetic sequencing makes this even more difficult, because genetic data give information about the individual as well as his or her family.

New research conducted by a team from Whitehead Institute and published in the journal Science, has identified 50 people who had given genetic material as participants in genomic studies, by using just the Internet and publicly available online resources.

The research team wanted to explore “vulnerability research”, a common practice in the area of information security. They proved that under specific circumstances, the names and identities of genomic research subjects can be found, even when their information is supposedly held in a private database in a “de-identified form”.

Led by Whitehead Fellow Yaniv Erlich, the research team examined unique genetic signs known as short tandem repeats on the Y chromosomes (Y-STRs), of men who had their genetic material taken by the Center for the Study of Human Polymorphisms (CEPH). The participants’ genomes were sequenced and made available to the public in conjunction with the 1000 Genomes Project.

As Y chromosomes are passed from father to son, as well as family last names, there is a strong link between last names and the DNA on the Y chromosome.

Seeing this relationship, genealogists and genetic genealogy companies have created databases that hold Y-STR data according to last name, that are available to the public. Through a method called “surname inference”, the investigators were able to identify the family names of the men by giving their Y-STRs to these databases.

Using the last names, the researchers searched other information sources including:

  • Obituaries
  • Genealogical websites
  • Internet search engines
  • Public demographic data from the National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository at New Jersey’s Coriell Institue

They identified close to 50 women and men in the U.S. who were CEPH participants.

Earlier research had considered the potential of genetic recognition by matching the DNA of one person, assuming that person’s DNA was filed in two different databases. However, this work abuses data between individuals distantly related via their fathers.

Conclusively, the research team points out that the posting of genetic data from one individual can expose deep genealogical ties and end up revealing a distantly-related person who may not know the person who originally released that genetic data.

Melissa Gymrek, a member of the Erlich lab and first author of the paper said:

“We show that if, for example, your Uncle Dave submitted his DNA to a genetic genealogy database, you could be identified. In fact, even your fourth cousin Patrick, whom you’ve never met, could identify you if his DNA is in the database, as long as he is paternally related to you.”

Erlich and his team notified officials at the National Human Genome Research Institute (NHGRI) and NIGMS about their findings prior to the release of this paper. In response, NHGRI and NIGMS relocated demographic information from the publicly-accessible portion to decrease the risk of future infringement.

Erlich concluded:

“Our aim is to better illuminate the current status of identifiability of genetic data. More knowledge empowers participants to weigh the risks and benefits and make more informed decisions when considering whether to share their own data. We also hope that this study will eventually result in better security algorithms, better policy guidelines, and better legislation to help mitigate some of the risks described.”

Written by Kelly Fitzgerald