ESOF2014: Genetic Privacy in the Genomic Era?
(August 6th, 2014) Public databases are full of genomic data from mice, worms but also human patients. Jan Korbel and Ewan Birney share their opinion on anonymisation and handling of the new wave of big data.
Sequencing technologies are ever-improving, making it easier and easier (and cheaper) to decipher whole genomes. Almost every day, researchers collect genomic data and share it with their colleagues or the public. This opens up new avenues in research but also bears risks for those who gave their DNA samples. Re-identifying the donor is almost always possible, so caution is needed. During ESOF2014, Lab Times met two experts, willing to discuss challenges and opportunities of big data.
Jan Korbel is a group leader at the Genome Biology Research Unit of the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. He is involved in the International Cancer Genome Consortium and is co-chair of the Structural Variation Analysis group in the 1000 Genomes Project, where he is analysing genomic structural variants in the germ line of several thousand individuals. Therefore, we thought he might be the right person to answer our questions about the challenges in handling anonymised genetic data and the differences between the genomes of healthy people and people, suffering from cancer.
Lab Times: First of all, did you get your genome sequenced?
No, not yet. I think I will do it at some point in the future. So far, I've only sequenced genomes, which we had funding for. And I cannot really apply for funding for my own genome.
LT: What genomes have you sequenced in your lab?
In our lab, we sequence only patient genomes because we focus on cancer research. But we have analysed genomes from many volunteers in the 1000 Genomes Project. In this case, we did not do the sequencing, only the bioinformatics to analyse the genome.
LT: What is the difference between sequencing cancer genomes and normal human genomes?
First of all, the approaches are slightly different. When we are sequencing a regular genome, we compare it to a standardised reference genome, which everyone uses and which gets updated from time to time. There, we record differences and get lots of variations, approximately one in a thousand bases in the genome is a single nucleotide polymorphism (SNP). When we are looking at cancer, it is a different approach. We are sequencing the genome of a patient twice: Once the germ line and then the tumor. Then we are comparing the two, looking at somatic mutations in the tumor, some of which represent those mutations that have driven the development of cancer. The number of mutations depends on the age of the patient and on the disease, also tissues have different mutation rates and environmental influences are also important.
LT: Regarding genetic privacy, how can you anonymise genomic data properly?
That is a good question. There are several ways to do this and we are still discussing, which are the best. What we are currently doing is the following: There are different layers of anonymisation that are possible. In the 1000 Genomes Project, we are putting all the genomic data on the Internet and everybody can download the genomes of individuals, who volunteered for the project. There is still a level of anonymisation because we don’t know who these people are, not even age information is given, only gender and ethnic background. In the context of patient genomes, we are doing it differently: We don’t put the genetic data on the Internet but in a secure space. We call this controlled data access; we can still share the data with others but not through a click on the Internet. Still, the data as such is not encrypted, meaning the colleagues can see all bases in the sequence but get no information on the patient’s identity. In this case, the anonymisation is so strict that even our group does not know, which patient we are sequencing. For the future, we are thinking of encrypting genetic data. Then one would need a specific encryption key to read the data. This might allow, in an even more controlled manner, to exchange information with others - as we would know whom to give the encryption key.
LT: Could you alter the sequence in a way, that a re-identification is not possible?
The problem is that only little information is needed to re-identify a person. There are so many variations in a genome, if the SNPs are at the right frequency, only very few SNPs are enough to identify a person. So, you would have to throw out almost everything, after which the sequence would not be useful any more. This is true for the germ line genome. If you have only the cancer genome, it is harder to identify a patient because somatic mutations are going to be specific to this particular cancer genome. So in principle, if we were to only exchange somatic mutations found in a cancer genome, we should not have problems with anonymisation.
LT: How was it in the beginning, when the first genomes were sequenced?
Of course, the standards have developed over time. But I think it was clear to most geneticists from the start that we have to be careful with this information. One of the first, who sequenced himself for medical purposes, was James Lupski from Baylor College of Medicine and, as I was told, he was required to demonstrate approval from his family members as well. Because once you get sequenced, 50% of your first relative’s genome is also being sequenced. In my opinion, the safest way to publish genomes currently is via controlled access databases but we probably have to get a bit more relaxed on these things in the future.
Ewan Birney shares Korbel’s opinion. He argues that even your shoe shopping behaviour tells more about yourself, your preferences and things you might consider as “private” than your genetic sequence. Birney was involved in annotating the human genome and the genome of several other organisms like mouse and chicken. He is one of the founders of the Ensembl genome browser and led the data analysis component of the ENCODE project. He also works on developing methods for using synthetic DNA to store data and tells us in the following interview why he is a big proponent of open data.
LT: Did you get your genome sequenced?
I have had it genotyped, which is like sequencing on the cheap. It was done by 23andme. Myself, my wife, my father and my father-in-law did it. I started and it has become a little family project.
LT: You are a big proponent of open data. Do you think it should be open for everyone or just for science?
I think there is a lot of scientific data, which is best made openly available for everyone. That includes things like the human reference genome, our understanding of human genes, but also things like the mouse genome, the rabbit genome and many, many things. And that is what EMBL-EBI does, we provide these totally open datasets. For information that can be easily tied to an individual, these individuals very often sign consents. Sometimes they sign consents saying “I am happy for this information to go totally out in the internet, totally open.” Then we comply with this consent and we release that openly. But most clinical research datasets are with a more restricted consent.
LT: Should researchers also learn more about how to handle this data?
Yes, for sure. This decade is going to involve a huge shift towards a lot more computational work. I think most groups will have some dry and some wet biologists, maybe 50% dry and 50% wet. It is the combination, which makes it. I think that in the future people will need to be hiring bioinformatians and computational biologists for all the labs.
LT: It is also quite complicated to deal with all these databases.
Well, I think it is not that complicated once you know how it works (laughs). This expertise needs to become part of the biological training.
LT: You are also working on DNA as storage for data. Do you think we will be using DNA as a storage device in, say, 50 or 100 years?
Yes, shorter than that. It is much more feasible than we thought originally about how to do it. So yes, I think it is going to be an incredible technology.
Photos: www.esof2014.org, J. Korbel, E. Birney