Genetic data donated for research should remain anonymous. However, in the January 18 Science, researchers report that it is surprisingly easy to breach anonymity for some individuals in U.S. genetic research databases simply by using publicly available Internet resources. Led by Yaniv Erlich at the Whitehead Institute for Biomedical Research, Cambridge, Massachusetts, the authors entered Y chromosome sequences into genealogy databases and were able to infer the surnames of the genetic donors in about 12 percent of cases. That is because the male chromosome and surnames are handed down together. By combining this with other disclosed information in the research and public databases, the authors pinpointed the identity of some male donors. The security breach arises from recreational, not research, data.

What effect might this have on genetic research, and what can be done about it? Erlich does not believe people should stop sharing data. “Data sharing plays a crucial role in research,” he said. Instead, he suggests full disclosure of the risks of participating, combined with regulations that set clear guidelines for use of genetic data and penalties for misuse. In general, the commentators Alzforum spoke with agreed with these points.

Researchers have suspected for some time that genetic research data may be susceptible to breaches of privacy (see Lunshof et al., 2008; Gitschier, 2009). The number of public databases anyone can trawl for identifying information concerns some geneticists. For example, numerous recreational genealogy databases, such as Ysearch and Sorenson Molecular Genealogy Foundation (SMGF), pair Y chromosome sequences with surnames to aid people in identifying distant relatives. Erlich does not frown upon such services; he has donated his own DNA to such a database and has enjoyed using it to find cousins. However, he wondered if these resources could expose genetic donor identity. To test the accuracy of surname identification, first author Melissa Gymrek retrieved the Y chromosome data of about 900 men from YBase (now defunct), then entered the sequences in the other two databases. This process produced the correct surname about 12 percent of the time. That number would apply to upper- and middle-class U.S. Caucasians, the group that most frequently participates in genetic studies.

Could the same type of search be used to identify the surnames of research participants? The National Center for Biotechnology Information (NCBI) archives contain a small number of genomes from named individuals. As a test case, the authors took the Y sequence data from geneticist Craig Venter, entered it in genetic genealogy databases, and retrieved his surname. Combining the name with his age and state of residence, a search of public records returned just two possible matches, one of whom was Venter. Age and state are not protected by privacy regulations laid out in the Health Insurance Portability and Accountability Act (HIPAA), and are often disclosed along with raw genetic data, Erlich told Alzforum.

But would the same search strategy work for unnamed individuals in databases? Performing simulations using U.S. Census data, the authors found that in most cases, the combination of surname, age, and state produces a list of only about a dozen candidates. They further looked at a Utah cohort that participates heavily in genealogy studies, and found they could fully identify half the men in one small group.

How likely are such privacy breaches to happen? At the moment, the risk may be small. Gerard Schellenberg at the University of Pennsylvania, Philadelphia, pointed out that most genetic research databases are “qualified access,” meaning that only researchers who have obtained approval from institutional review boards at their institutions can download the data. Users also sign agreements stipulating they will not share the data or use it to identify donors. Breaking this agreement could result in getting fired, or disbarred by NIH, Schellenberg added. Some databases such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset have fewer restrictions, but still include a screening process, noted Robert C. Green at Brigham and Women’s Hospital, Boston, Massachusetts. He directs ADNI’s data committee. However, some publicly accessible databases already exist, such as the Personal Genome Project, and more such resources will probably become available in the future, suggesting this problem will grow, Green said.

What would be the consequences of having your genome revealed? Some laws to protect people from possible ill effects are already in place. The Genetic Insurance Non-discrimination Act forbids health, life, and employment insurers from discriminating against people on the basis of their genetics, but it does not extend to disability and long-term care insurance (see ARF related news story). The act represents a good beginning, but is not enough, Green said. Another possibility is that genetic data could be used to raise questions about someone’s health or mental status, for example, in a contentious divorce case, or to embarrass a public figure such as a presidential candidate (see Green and Annas, 2008). Erlich pointed out that genetic data could reveal paternity. Perhaps most fundamentally, it would allow other people to know details about someone’s medical status or family history that the person might prefer to keep private, Schellenberg told Alzforum. “The issue is invasion of privacy,” he noted.

The scientists agree that people should not stop sharing data. Research databases are crucial for understanding genetic diseases, Erlich believes. “In my lab, we found two genes for devastating neurological disorders, and the only way we could do that was by utilizing these public [research] databases,” he said. Likewise, trying to mask genetic data, for example, by removing Y chromosome sequences, to preserve anonymity would be impractical and would hamper research, Green told Alzforum.

Instead, the researchers suggest making sure that consent forms spell out the potential risks. “We need to alert [participants] to the possibility they could be identified. I don’t think this particular circumstance has been covered in consent forms,” Green said. (Most such forms do specify that anonymity cannot be guaranteed, however.) Such transparency would give people the ability to make an informed decision about donating DNA. “This situation may lead to people being more reluctant to participate, which could become a problem,” Lars Bertram at the Max-Planck Institute for Molecular Genetics in Berlin, Germany, wrote to Alzforum. However, the actual risk of having genetic data exposed is probably small, Green pointed out. “We have to distinguish between theoretical dangers and the likelihood of real harm.”

In addition, society must make sure there are clear regulations and penalties for misuse of genetic data, Schellenberg said, expressing a common view. Erlich suggested that scientists encourage public discussion of the issue. “We need better policies and better legislation. We need to hear from patient advocacy groups, patients, legislators, ethicists, scientists, and together think about how we should move forward,” he said. “The pace of this field is so fast; we should start to plan ahead so that decades from now, genetic information will be safe.”––Madolyn Bowman Rogers


  1. This paper shows the astonishing (and sometimes frightening) power of combining today's genetic research with public Internet databases. Unfortunately, there is not a lot that can be done in terms of "additional privacy safeguards" other than not releasing chromosome Y data to begin with—the information is out there and continues to grow. This situation, clearly, may lead to people being more reluctant to participate in research, which could become a problem, as the authors acknowledge. It may be very difficult to resolve this situation to everyone's satisfaction.

    View all comments by Lars Bertram

Make a Comment

To make a comment you must login or register.


News Citations

  1. GINA No Genie for Alzheimer Disease Patients and Relatives

Paper Citations

  1. . From genetic privacy to open consent. Nat Rev Genet. 2008 May;9(5):406-11. PubMed.
  2. . Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. Am J Hum Genet. 2009 Feb;84(2):251-8. PubMed.
  3. . The genetic privacy of presidential candidates. N Engl J Med. 2008 Nov 20;359(21):2192-3. PubMed.

External Citations

  1. Ysearch
  2. Sorenson Molecular Genealogy Foundation
  3. National Center for Biotechnology Information
  4. Personal Genome Project

Further Reading

Primary Papers

  1. . Identifying personal genomes by surname inference. Science. 2013 Jan 18;339(6117):321-4. PubMed.