Bench philosophy: Marrying Data with the Narrative
by Vijay Shankar Balakrishnan, Labtimes 04/2013
Semantic publishing is more than just putting a few web links into an article. Done properly, it enriches a paper with a plethora of valuable extra information.
Scientists are popular for hunting and gathering information. Far and wide, in every discipline of science, with our sets of questions, we look for assumed answers, by adopting customised experimental strategies and methods. On our way to the destination, we ‘simply’ generate an abundance of data – sometimes packed with information, or rather left with secrets, for serendipity to unlock.
In life sciences we have managed to beget wells of data for volumes of scholarly publications. What we have created has long started to sink us in itself. But did we realise the mound of knowledge that we have locked up in the data? (Attwood et. al. Biochemical Journal. 2009; 424:317-33). What are the resultant consequences we face? And what (r)evolution could that cause in the scholarly publishing ‘industry’?
Examples of semantic enhancements made to an original article by Reis et al. David Shotton and his colleagues from the Bioinformatics Research Group in Oxford, UK, added several features to make the data more accessible. Click image to enlarge
“It is quite depressing to think that we are spending millions in grants for people to perform experiments, produce new knowledge, hide this knowledge in an often badly written text and then spend some more millions trying to second guess what the authors really did and found.” With these eloquent words of Europe’s lofty Bioinformatician, Amos Bairoch (Nature Proceedings, 2009), Lab Times peeks into the world of semantic publishing, to grasp the fate of hitherto ‘badly written text’ and efforts made by scientists and the publishing industry towards potentially ‘better text’ in the future.
Semantic publishing is one of the few known and practiced ways to manage the ‘badly written text’, besides boosting open access to the scientific literature. New discoveries become difficult to understand, if the non-specialists cannot tease out the information from the data. From a researcher’s point of view, semantic publishing seems to be a promising solution, not only to theorise more ideas but also to store, manage and retrieve information from the big data.
There are different kinds of data in a research article – from graphs to sequences, or from schematics to molecular structures. In the data-driven discoveries, especially in life sciences, we need more than one type of data to corroborate the results with our hypotheses, to convincingly conclude the finding. For example, we may need data from microscopy, spectroscopy and sequencing techniques to connect the functionality of a gene sequence to its protein, then in turn to its phenotype.
A research article is not much different from a database (Bourne, PLoS Computational Biology. 2005; 1(3):e34). If so, wouldn’t it be nice if all the data could be integrated with the text, semantically? Will we boycott the idea, if we can get related information on a protein structure from a remote database, by clicking on its 2D image in an article? Will it then not be encouraging to dive deep into a topic, by reading many articles in parallel, or access databases or simulation programmes – be it for an introductory or a reference purpose? Attempting to answer these questions is what the flag holders of semantic publishing are trying to do (Berners-Lee et. al. Science. 2006; 313(5788):769-71; Berners-Lee and Hendler, Nature. 2001; 410:1023-24).
It is not that the digital publishers of scholarly articles have to start from zero. They are already trying out different concepts and ideas. At the outset, there are quite a handful of advantages of digital publishing – from distribution to peer-reviewing processes. This already saves a lot of time, energy, money and shelf space. But not quite extensively! The challenge moved from the above list to a different one: finding computer specialists to handle problems of digital identification of articles for an easy access on the Web. Not only that! Readers are also facing problems in filtering the information. (Shotton. Learned Publishing. 2009; 22:85-94).
Yet, currently, varied, visible and practicable developments are being made by different publishers: technically from the basic, computational level as well as to the reader-targeted changes towards the storage, accessibility and assimilation of information on the Web. However, computational scientists working for the change, along with the publishers, are not entirely satisfied with the response towards the moves from the life scientists. According to them, the discomfort is towards scientists’ slow or inappropriate response on new publishing experiments and the reluctance to shift track, despite the fact that they are the ones who will benefit at the end (Pettifer et. al. Insights. 2012; 25(3):288-293).
The initial challenges in semantic publishing were pertaining to document formats. Most of the pompous PDF articles have their own ‘frozen style’ of holding data and information discretely in them. The journals then began publishing articles online in HTML or XML (Hyper Text Mark-up, or eXtended Mark-up Language) formats, which enable the readers to access information on the Web easily (of course, with its own caveats of paid or open access).
But to access them in their digital libraries online, through specific URLs (Uniform Resource Locators), the articles were published with Uniform Resource Identifiers (URIs). These URIs form the central axis of semantic publishing. Digital Object Identifiers (DOIs) of articles are one class of URI, just like the ISBN code for books. Then what about the identifiers for, e.g., nucleic acid or protein sequences in the articles?
Well, this is the challenge for the computer scientists, they say (Hull et. al. PLoS Computational Biology. 2008; 4(10):e1000204). The identifiers are different in various databases. It might be easy for humans, at the cost of some extra time and energy, to find the differences in the identifiers and mean exactly what the scientists want them to mean. But when the computers have to take up this semantic task, then the computational biologists have to tediously write a generic programme to automate this process.
Another problem is hiding the big-data somewhere as ‘Supplementary Information’ and verbosely explaining its hidden meaning. To take this blame away, GigaScience now publishes ‘the big-data’. Despite the daring effort, the information is still not explicit from the data. However, to encourage semantic publishing, Elsevier, for instance, requires authors to reference the appropriate databases, where the data and metadata are maintained with unique identifiers. Several data-publishing and storing systems like Dyrad, Pangea, etc., have technically complemented such moves (Pettifer et. al. Insights. 2012; 25(3):288-93). These kinds of improvements are strongly supported when articles are published with, for example, microarray data.
If an article requires conclusion from both sequencing and microscopy data, Investigation-Study-Assay system (http://isacommons.org), for instance, provides ways to record metadata, in order to connect with and analyse data across these methods. PLoS allows readers to download a publication in XML format. In addition, PLoS also features articles’ sub-sections as scrollable links or tabs.
FEBS Letters, on the other hand, publishes Structured Digital Abstracts (SDAs), currently pertaining to articles addressing protein-protein interactions, appended to the conventional abstracts, in XML format. This is a community initiative with BioCreative II.5 and the curators of MINT database for protein-protein interactions (Seringhaus and Gerstin, BMC Bioinformatics. 2007; 8:17; & http://www.febs.org/?id=601).
Semantic Biochemical Journal is another prominent example of semantic publishing of scholarly articles, which was launched by Teresa Attwood and her colleagues from University of Manchester, along with Portland Press (EMBNet.news. 2010; 15(4):3-6).
Knowing that publishers are already in the field, it looks like the team needs to get bigger, with the editors and authors having to ‘jump-off the pavilion’ and join the semantic publishing squad. To appreciate the continuous development of community activities like BioCreative, the scholarly publishing industry needs the biggest intellectual pay-off from the scientists.
After all, the issue is not just the language problem between man and computer but also among the life scientists, computational biologists and the publishers.
Last Changed: 04.07.2013