How to Solve Data Merging Problems

(October 30th, 2015) As the bioinformatics field continues to grow and evolve, the amount of data being generated is increasing exponentially. A recent review discusses the problems and possible solutions to the sharing and integration of this data.

The field of bioinformatics is ever expanding. For example, 2005 saw the completion of the first Genome-Wide Association Study (GWAS), identifying a gene associated with age-related macular degeneration. This condition renders millions a people worldwide visually impaired or blind. Just twelve years later, we have over 1,200 GWA studies and lots of data! Last month, a review, resulting from a collaboration between the UK’s Wellcome Trust Genome Campus, The Genome Analysis Centre (TGAC), and Universities in Rome and Corfu, indicated the current problems and steps that need to be taken in order to ensure that the huge amount of data being generated within bioinformatics is used to its full potential.

In particular, there’s a growing need to develop solutions for data integration – “the computational solution allowing users to fetch data from different sources, combine, manipulate and re-analyse them (...) and create new datasets and share them again”, as the authors define it. They suggest that experimental biologists need to have a greater input and work more closely with computer scientists and bioinformaticians. 

What hinders data integration? Vicky Schneider from the The Genome Analysis Centre in Norwich and her co-authors illustrate several key problems. One of them, almost everyone is familiar with, is gene naming. Even though gene nomenclature standards were issued in 1979, there’s still “an assortment of alternative names (…) used across the scientific literature and databases, posing a challenge to data integration”. Likewise, the naming of proteins is problematic, with the same protein being referred to by “a variety of names, synonyms and abbreviations”. This means we need to formulate standards and, more importantly, implicate them. Standards, such as the use of TGAC to refer to base names for DNA or the use of single letters to identify amino acids. 

Another problem is that different groups format their data in different ways. “Adoption of commonly agreed formats to represent them in computer readable files is nowadays of utter importance,” the authors say. It is suggested that “having a minimum agreed set of fields” and maybe some “optional fields” could help facilitate data integration. On top of it, “the development of converters, translating different formats in a unified form, should be promoted as well”. 

Senior author, Vicky Schneider, originally studied biology and obtained a PhD on the evolution of sex, before completing postdocs and obtaining an assistant professor position at the University of Bern. She is now the head of the Scientific Training, Education & Learning Programme (361º) division at TGAC (UK). TGAC was set up six years ago and receives strategic funding from the Biotechnology and Biological Research Council. 361º is used to “symbolise a cycle of monitoring, feedback and reflection we follow in every activity we run with the aim to improve it and make it at least 1º better next time around”.

Vicky “specialises in Bioinformatics training, particularly within the field of next generation sequencing, OMICs and bridging the gap between biologists and computational sciences”.  She became interested in the field of data integration “because it is a key and imperative issue in biological research (...) I daily see when interacting with biological researchers across all levels how we need to define a common nomenclature, set standards and ontologies and speak a common language that allows to store and search information efficiently and timely. We need to become curators and ensure metadata is incorporated early on rather than rely on ‘curators’ and annotation done later on (since it's not scalable). And in instances where we have to update information, we need better interaction and mechanisms to update and create the needed functionalities as the community of users evolves”.

Nicola Hunt

Photo: ariadnerb

Last Changes: 12.09.2015