When Big Data isn’t Big Enough
(May 30th, 2016) Each day, shedloads of genomics data are generated in labs around the world. DNAdigest wants to make it available to all researchers, free of charge.
During the first years at University, I was taught two lessons, amongst other things, that prevailed during my studies: 1) Try to get as much data as possible, and 2) Science is a collaborative effort. Whereas lesson number one seems to stay with scientists as they progress through their careers, lesson number two proves to be more problematic to keep in mind. Now that we have entered the era of Big Data, it is especially important that the science realm does not forget that last lesson. And that is exactly where data sharing initiatives prove their worth.
Earlier this year, PLoS Biology published an article, presenting DNAdigest and its social enterprise spinout, Repositive. Whereas DNAdigest is characterised as a charity in the UK, working to engage the research community into exploring and solving the issue of human genomic data sharing, Repositive is a self-sustained business with its mission aligned to the social mission of the charity: “to facilitate efficient and ethical data sharing for genomics research”.
In short, this entails building an online platform that provides a single-point entry to search public human genomic data repositories, free of charge. Why this is necessary? Simple: we tend to forget that sharing is important to let science progress. And to be fair, we are also afraid to expose potential sensitive personal information to the world. In the PLoS Biology article, the authors have a clear argument as to why we should breach that barrier: it’s against the data donor’s interests and expectations to not utilise their data in the best possible way within the given consent. Due to high throughput methods, researchers are able to collect much more data than is needed for a specific experiment. Therefore, a vast amount of data is available that needs to be stored and made available to be mined for other purposes. However, Repositive does not store the data itself and is therefore not a data repository, as such.
Currently, it is still in beta-testing and hence, welcomes user feedback. After signing in, you can search through 42,891 human genomic datasets from ten repositories. A very user-friendly portal enables you to search for the relevant datasets with keywords as well as filter on the assay type, repository and accessibility, since not all the datasets are open access and therefore not immediately available.
A set of icons, depicted at the bottom of each dataset, identifies its features as access, how many times viewed and whether a discussion has started about it (the platform allows users to comment on the content and quality of datasets and add descriptions). Users can also post a request for data in the hope that another user has some genomes on his hard drive. Of the seven current requests, however, only two have so far received a useful redirection to existing datasets and databases.
Repositive does not stand alone in its battle for data discoverability. It follows initiatives like Uniprot (founded in 2002) for protein sequences, and OMIM (online since 1985) for human genes and genetic disorders. Despite the many calls for open data sharing in all sciences (e.g. plant science, wildlife population dynamics and neurosciences), the linking of data curation, data integration and data infrastructure is still in its infancy. Perhaps, Repositive is another step forward to effective and efficient data sharing.
“DNAdigest will give the individual contributors the power to share the information in their genes, and give the genomics researchers the power to understand,” writes the founder of DNAdigest, Fiona Nielsen, as a testimonial on the website. She seems to have understood that for Science to progress, it is absolutely essential to combine the fundamental two lessons I learned at University.