Bench philosophy: Biocuration
by Steven Buckingham, Labtimes 05/2012
Life scientists are producing huge amounts of data that have to be stored and maintained. Biocurators take care of the ever growing data collections.
There was a time when biologists had to work hard to get Nature to yield her secrets, like drawing water from a deep well. But then everything changed. When genomics came along, it was like someone had found how to run water through a pipeline. But if genomics put data on tap, the new biology – proteomics, reactomics and every other kind of omics – will burst the dam. Now, we have so much raw data we quite literally don’t know what to do with it. We used to boast about “data flows” but now we agonise over a “data deluge”. But like new species rising to fill a niche, a new kind of biologist is also emerging: the biocurator. But is the emerging biocuration industry up to the job? Or are we in serious danger of a historic failure – drowning in data?
All of a sudden, biocuration is big business. There are annual conferences for biocurators, a PLoS biocurators collection, even an International Society of Biocurators (www.biocurator.org). Who knows, the person sitting next to you may even be one. The surge in interest in biocuration is a response to a problem we all know about. The facts speak for themselves: we are not doing all that well when it comes to keeping up with the data. Take the EMBL nucleotide sequence database maintained by the European Bioinformatics Institute, as an example. The database is, quite literally, growing exponentially: it has taken only five years to multiply tenfold (source: www.ebi.ac.uk/ena/about/statistics).
Biology curators take care of huge species collections in museums or universities to make them accesible to scientists and the society. The same holds basically true for biocurators. Instead of keeping their collections in drawers and cabinets however they store them in data repositories.
Little surprise, then, that the rate of functional annotations is not managing to keep up. As Mr Micawber might say, “Annual incoming sequences, 30 million a year; functional annotations 29 million; result misery.” 29 million functional annotations a year, did I say? But even that rate of annotation is beyond our dreams. In fact, the reality of the situation is pretty worrying. According to the Elixir website (www.elixir-europe.org/news/how-fast-life-science-data-growing), “The storage capacity of computing hardware doubles approximately every 18 months. However, new biological data is doubling every nine months or so – and this rate is increasing dramatically.” Things have been made more challenging because of new deep-sequencing techniques. And even if keeping up with hardware is a problem, what about the annotation? It’s enough to send Mr Micawber into a spiralling depression.
So, it’s a jolly good thing for us all that so many scientists are choosing a career in biocuration. But why? Haven’t they been called “museum cataloguers of the internet age”? Why would someone leave an exciting career at the bench for what many of us would consider a (dare I say it) rather boring job? Pascale Gaudet knows why. Gaudet is the scientific manager of the neXtProt database at the Swiss Institute of Bioinformatics, as well as Chair of the International Society for Biocuration. “The prestige of a ‘traditional’ career in research is certainly very high,” admits Gaudet, “and contributing processes, not to mention finding target for treating diseases, is very motivating. But biocuration provides a very different reward.”
According to Gaudet, a biocurator is a gregarious animal. “Biocurators are usually part of a team (small or large), and the credit for the work accomplished gets shared among all members of the team” says Gaudet, “but this is changing with journals now publishing papers from biocuration projects, in particular the DATABASE journal, published by Oxford University Press.” It is, however, more than just the warm glow of a team huddle that draws scientists into this young career. “Biocuration appeals to ‘generalists’, those who like to work with a wide range of scientific data. The pace is often intense, with so much literature and data to analyse. You need a good deal of analytical ability and there is the chance of contributing to new bioinformatic tools.”
But how do biocurators see themselves? According to the International Society for Biocuration, “Biocuration involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets.”
But the fact remains, an army of biocurators the size of Alexander’s wouldn’t be enough. In a Nature Precedings article, Amos Bairoch famously pronounced “Nobody will ever be able to manually annotate all the macromolecular biological entities that exist on this planet.” So what are we going to do? Can computers do the job? To some extent they already are. A lot of the routine annotation jobs are already automated and the unnoticed but vital efforts of database managers to harmonise data storage across different platforms has made this possible. But we clearly need more. And the efforts of programmers and experts in natural language processing has yielded a steady growth in the power of automated, data mining, programmes that extract information from published papers and link it to database entries. Take a look at Textpresso (www.textpresso.org/), for instance. But automated data mining simply isn’t good enough to be relied upon – not now and perhaps never. A human biocurator will still have to check the results.
Can the bench scientist do the job? You are certainly doing your part already – a lot of journals won’t accept your paper unless you link to any sequences you have submitted. We also provide some annotation in our submitted sequences. Now, I hate to say this – and please don’t take it the wrong way – but we are probably not helping that much. In his Nature Precedings article, Amos Bairoch reveals a nasty trade secret: “We often spend more time ‘de-annotating’ what people have reported then (sic) entering their data”.
What about community annotation? There is a lot of overlap between the databases and Wikipedia. Look up “nicotinic acetylcholine receptor” on Google and the Wikipedia entry appears at the top. Wikipedia’s page contains links that take you to the human genetics consortium database entries for nicotinic receptor subtypes. And increasingly, protein and nucleotide databases link out to Wikipedia entries. The NCBI database entries for proteins, for instance, take you to Wikipedia through the LinkOut service, and more recently we have seen the new iPhylo Linkout, which connects NCBI taxon information with corresponding entries in Wikipedia. But although a lot of hopes were once pinned on community annotation, specialised sites like Wikigenes have not lived up to the expectations fostered by Wikipedia.
One thing is clear. We need solutions to this problem if we want to avoid losing out on the full potential of all this data. We certainly need more funding: the money is there for the big databases focussing on model organisms but the biome is bigger than fruit flies and worms. There is an argument that journals need to be more assertive in requiring better annotation from authors. On the bright side, biocurators are getting better equipped. There is new software to make their job easier. Argo (www.nactem.ac.uk/Argo/) is a web-based system that allows biocurators to develop their own markup systems. It is freely available and can be used without subscription – try it out and get a flavour of what a biocurator’s life is like.
There is, however, something else to worry about. We all agree that the exciting stuff in biology is not the bits bodies are made of but how they work together. But until databases have been curated and made useable to the bioinformatically naive, they will never reach their potential. How accessible are these databases to the biologist with no bioinformatics training? Imagine you are a neuroscientist and you are interested in hippocampus. You have been reading about nicotinic acetylcholine receptors and you have a hunch that the behaviour you have been studying has something to do with nicotinics in the dentate gyrus. You don’t know anything about bioinformatics – as far as you are concerned a SOAP service (http://en.wikipedia.org/wiki/SOAP) is something to do with personal hygiene.
Let’s start with neXtProt, where we can go “Exploring the universe of human proteins”. They have described themselves as the Google of protein database searches. To get the information I am after, I have to put “chrna*” in “name/identifier” and “hippocampus” in “expression”. Now, I am not trying to pick holes in neXtProt – far from it. To someone with even a little bioinformatic knowledge it is extremely useful and one of the best databases of its kind around. But I am setting a high goal here and we’re still nowhere near “Google for databases”.
Now go over to NIF (www.neuinfo.org/about/index.shtm) and search for “nicotinics hippocampus” and you’ll get an easily readable table of results harvested from a range of databases. The query engine does all the thinking for you. But here’s the point – on the right of the search box you’ll see an icon with “NLX” on it. That runs a search; not by keywords but by concept. And that is only possible because of biocuration.
Last Changed: 10.11.2012