There are currently 0 users and 70 guests online.
Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.
Last update47 min 42 sec ago
December 4, 2013
A while ago I posted BHL to PDF workflow which was a sketch of a work flow to generate clean, searchable PDFs from Biodiversity Heritage Library (BHL) content:
I've made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals:
For BioStor articles to be archived in PubMed Central they would need to be marked up using the Journal Archiving and Interchange Tag Suite (formerly the NLM DTDs). This is the markup used by many publishers, and also the tag suite that TaxPub build upon.
The idea of having BioStor marked up in JATS is appealing, but on the face of it impossible because the all we have is page scans and some pretty ropey OCR. But because the NLM has also been heavily involed in scanning the historical literature they are used to dealing with scanned literature, and JATS can accommodate articles ranging from scans to fully marked up text. For example, take a look at the article "Microsporidian encephalitis of farmed Atlantic salmon (Salmo salar) in British Columbia" which is in PubMed Central (PMC1687123). PMC has basic metadata for the article, scans of the pages, and two images extracted from those pages. This is pretty much what BioStor already has (minus the extracted images).
With this in mind, I dusted off some old code, put it together and created an example of the first baby steps towards BioStor and JATS. The code is in
November 21, 2013
Quick notes on yet another attempt to marry the task of editing a taxonomic classification with versioning it in GitHub.
The idea of dumping the whole GBIF classification into GitHub as a series of nested folders looks untenable. So, maybe there's another way to tackle the problem.
Let's imagine that we dump, say, the GBIF classification down to family-level as a series of nested folders (i.e., we recreate the classification on disk). For each family we then create a bunch of files and store them in that folder. For example, we could have the classification in Darwin Core Archive format (basically, delimited text). Let's also create a graph that corresponds to that classification, using a format for which we have tools available for visualising and editing.
For example, I've created a Graph Modelling Language (GML) file for the Pinnotheridae here. Using software such as yEd I can load this file, display it, and edit it. For example, below is a compact tree layout of the graph:
This image is a bitmap, if you opened the GML file in yEd it would be interactive, and you could zoom in, alter the layout, edit the graph, etc.
Looking at the graph there are a few oddities, such as "orphan" genera that lack any species, and some names that appear very similar. For example, there is an orphan genus Glassella, and a similar genus Glassellia (note the "i") with a single species Glassellia costaricana. A little digging in BioNames shows that Glassellia is a misspelling of Glassella. The original description appears in:
E Campos, M K Wicksten (1997) A New Genus For The Central American Crab Pinnixa costaricana Wicksten, 1982 (Crustacea: Brachyura: Pinnotheridae). Proceedings of the Biological Society of Washington 110(1): 69–73. http://biostor.org/reference/81137So, we have one genus that appears twice due to a typo. Furthermore, there are nodes in the graph for the taxa Glassellia costaricana and Pinnixa costaricana, but these are the same thing (the names are synonyms, albeit Glassellia costaricana has the genus misspelt). So, we could delete Pinnixa costaricana, delete the mispelling Glassellia, fix the misspelling in Glassellia costaricana, and move it to the correctly spelt Glassella. There are other problems with this classification, but let's leave them for the moment.
Now, imagine that after editing I use the graph to regenerate the DWCA file, which now has the edited classification. I then commit the changes to GitHub, and anyone else (including GBIF) could grab the DWCA and, for example, replace their Pinnotheridae classification with the edited version.
We could also go further, and add what i think is a missing component of the GBIF classification, namely a link to the nomenclators. For example, in an ideal world we would have each name in the classification linked to a stable identifier for the name provided by a nomenclator, and that nomenclator would know, for example, that Pinnixa costaricana and Glassella costaricana were objective synonyms. If we had those links then we could automatically detect cases such as this where logically you can have either Pinnixa costaricana or Glassella costaricana in the same classification, but not both.
There are some wrinkles to figure out, for example it would be nice to compute the difference between the original and edited graphs in terms of graph operations (not simply the difference as text files) so we could do things like list nodes that have been moved or deleted. I did some work on this a while back (Page, R. D., & Valiente, G. (2005).BMC Bioinformatics, 6(1), 208. doi:10.1186/1471-2105-6-208), something like that tool might do the trick.
There is an element here of trying to coerce a problem into a form that can existing tools can solve, but in a way that's what makes it attractive. If we can use things that already exist then we can move from talking about it to actually doing it.
November 20, 2013
There is a fairly scathing editorial in Nature [The new zoo. (2013). Nature, 503(7476), 311–312. doi:10.1038/503311b ] that reacts to a recent paper by Dubois et al.:
Dubois, A., Crochet, P.-A., Dickinson, E. C., Nemésio, A., Aescht, E., Bauer, A. M., Blagoderov, V., et al. (2013). Nomenclatural and taxonomic problems related to the electronic publication of new nomina and nomenclatural acts in zoology, with brief comments on optical discs and on the situation in botany. Zootaxa, 3735(1), 1. doi:10.11646/zootaxa.3735.1.1
To quote the editorial:
...there might be more than a disinterested concern for scientific integrity at work here. A typical reader of the Zootaxa paper (not that there are typical readers of a 94-page work on the minutiae of nomenclature protocol) might reasonably conclude that the authors have axes to grind. Exhibits A–E: the high degree of autocitation in the Zootaxa paper; the admission that some of the authors were against the ICZN amendments; that they clearly feel that their opinions regarding the amendments have been disregarded; the ad hominem attacks on ‘wealthy’ publishers as opposed to straitened natural-history societies; and the use of emotive and occasionally intemperate language that one does not associate with the usually dry and legalistic tone of debate on this subject. (The online publisher BioMed Central, based in London, gets a particular pasting, to which it has responded; see http://blogs.biomedcentral.com/bmcblog/2013/11/15/the-devil-may-be-in-the-detail-but-the-longview-is-also-worth-a-look/.)
One of many recommendations made in the diatribe is that journals should routinely have on their review boards those expert in the business of nomenclature — in other words, a cadre of people who are, unlike ordinary mortals, qualified to interpret the mystic strictures of the code. A typical reader is again entitled to ask whom, apart from themselves, the authors think might be suitable candidates.
Ouch! But Dubois et al.'s paper pretty much deserves this reaction - it's a reactionary rant that is breathtaking in it's lack of perspective. From the abstract:
As shown by several examples discussed here, an electronic document can be modified while keeping the same DOI and publication date, which is not compatible with the requirements of zoological nomenclature. Therefore, another system of registration of electronic documents as permanent and inalterable will have to be devised.
So, we have an identifier system for publications which currently has 63,793,212 registered DOIs (see CrossRef), includes key journals such as Zootaxa and ZooKeys, and which has tools to support versioning of papers (see CrossMark) but hey, let's have our own unique system. After all, zoological nomenclature is special, and our community has such a good track record of maintaining our own identifier system (LSIDs anyone?).
Now that the financial crisis faced by the ICZN has been averted by a three-year bail-out by the National University of Singapore (for three years at least), maybe the guardians of scientific names can focus on providing tools and services of value to the broader scientific community (or, indeed, taxonomists). As it stands, the ICZN can say little about the majority of animal names. Much better to focus on that than trying to rail against the practices of modern publishing.
November 11, 2013
Quick notes on taxonomic names (again). It's a continuing source of bafflement that the biodiversity community is making a dog's breakfast of names. It seems we are forever making it more complicated than it needs to be, forever minting new acronyms that pollute the landscape without actually contributing anything useful, and forever promising shiny new tools and services without every actually delivering them. Meanwhile people and projects that build upon names are left to deal with a mess.
It seems to me that it would be nice if we had a single place to go to get definitive information on a name, and that place would give us a unique identifier that we could use in our own databases as a way to clean up and reconcile our data. For example, if we have a bibliographic database we can map citations to DOIs and then use those to identify the articles. If we have a list of journal names, we can map those to ISSNs and clean up our data. Likewise, if we have a classification such as GBIF or NCBI, we should be able to map the names in those classifications onto standard identifiers for taxonomic names.
The frustrating thing is we already have standard identifiers for taxonomic names. Since around 2005 we have been serving LSIDs for plant and animal names. We have Index Fungorum, IPNI, ION, and ZooBank, all serving LSIDs, all serving RDF, all using the same TDWG vocabulary.
The nomenclators vary in size and scope, but we have the three major, multicellular eukaryotes covered (circles proportional to number of names in each database):
There is some duplication, both within nomenclators (IPNI and ION I'm looking at you) and between nomenclators (ION and ZooBank have the same scope, although ZooBank is dwarfed by ION, anyone care to explain why we have both...?). All four databases are actively growing, partly through direct registration of new taxonomic names.
So, we're basically done, right? Surely all we need to do is harvest the LSIDs for all these names, put them into a single triple store, and wrap some basic services around them? If the nomenclators provide a list of recent changes (e.g., as an RSS feed) then we could continuously update the store with new names. Then any database or classification could reconcile it's names with those in the nomenclators. They could also then augment their own records by making use of additional data the nomenclators have, such as objective synonomies and links to original descriptions. In other words, we could have a model like this:
Classifications represent a view of how taxa are related, the names associated with those taxa are stored in nomenclators. This means that classification databases like GBIF and NCBI are not in the business of managing names, they simply link to the nomenclators (in the same way that a bibliographic database can link to DOI, ISSNs, and author ids such as ORCID and VIAF).
We have almost all of this infrastructure in place already. In one of the unsung triumphs of TDWG we have all the nomenclators serving data in the same format using the same technology. And yet we have singly failed to do anything useful with this extraordinary resource! Instead we seem more interested in contributing more projects to the acronym soup of biodiversity informatics. All around us projects to assign and link identifiers for publications (CrossRef), data (DataCite), and people (ORCID) are taking off. The infrastructure for taxonomic names has been in place since 2005, we could be doing the same sort of things CrossRef, DataCite and ORCID are doing in their domains. Why aren't we?
November 6, 2013
Here's another example of a Darwin Core Archive that is "broken" such that GBIF is misisng some information. GBIF data set A checklist to the wasps of Peru (Hymenoptera, Aculeata) comes from Pensoft, and corresponds to the paper:
Rasmussen, C., & Asenjo, A. (2009). A checklist to the wasps of Peru (Hymenoptera, Aculeata). ZooKeys, 15(0). doi:10.3897/zookeys.15.196
As with the previous example GBIF says there are 0 georeferenced records in this dataset. This is odd, because the ZooKeys page for this article lists three supplementary files, including KML files for Google Earth. I've used one to create the image below:
So, clearly there is georeferenced data here. Looking at the Darwin Core Archive (which I've put on GitHub there are a bunch of issues with this data. The occurrence.txt file has decimal latitude and longitude values with a comma rather than a decimal point, the file has some character encoding issues, and the columns with latitude and longitude data are labelled as "verbatim" fields not "decimal" fields. All of this means GBIF lacks all the point data for this dataset (over 2000 records). If we fix these problems, we get a map like this:
This illustrates one problem with publishing data, namely the data is rarely checked in the same way a manuscript is. Peer-review of data is a phrase that always struck me as odd, because you only get to be able to evaluate a data set by using it. In other words, data almost demands post- rather than pre-publication review. It's only when people start trying to use the data that problems emerge.
At the same time, we could improve checking of data prior to publication. In the case of the Darwin Core Archives I've looked at so far, it would be easier to find the problems if we had a simple tool that could take a Darwin Core Archive, extract the information and display it in various ways. If, for example, we have georeferenced records but we don't get a map, we would immediate wonder why that was, and figure out what the problem was. At the moment it seems easy to send data to GBIF, thinking you are contributing important information, whereas in fact that information never makes it onto a GBIF map.
Following on from Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite here's a quick and dirty example of using GitHub to help clean up a Darwin Core Archive.
The dataset 3i - Cicadellinae Database has 2,152 species and 4,749 taxa, but GBIF says it has no georeferenced data. As a result, the map for this dataset looks like this:
I downloaded the Darwin Core Archive and was puzzled because the occurrence.txt file contained in the archive has latitude and longitude pairs for some of the records. How come there is no map? After a bit of fussing I discovered that the meta.xml file that describes the data is broken. It lists a column which doesn't appear in the data file, so everything after that column gets shifted along and hence the column headings for latitude and longitude are out of alignment with the data.
So, I loaded the Darwin Core Archive into GitHub (you can see it here), then fixed the error, and then for fun extracted the latitude and longitude pairs as a GeoJSON file. GitHub can display this on a map:
Note that we now have a fairly extensive set of georeferenced data points for these insects, and this data hasn't made it onto a GBIF map because of a simple error in the metadata. I keep finding cases like this, which suggests that GBIF has more georeferenced data than it realises.
November 2, 2013
I have a love/hate relationship with the Catalogue of Life (CoL). On the one hand, it's an impressive achievement to have persuaded taxonomists to share names, and to bring those names together in one place. I suspect that Frank Bisby would feel that the social infrastructure he created is his lasting legacy. The social infrastructure is arguably more impressive than the informatics infrastructure, in particular, the Catalogue has consistently failed to support globally unique identifiers for its taxa.
If you visit the CoL web pages you will see Life Science Identifiers (LSIDs) for taxa, such as urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20130401 for the African elephant Loxodonta africana. The rationale for using LSIDs in CoL is explained in the following paper:
Jones, A. C., White, R. J., & Orme, E. R. (2011). Identifying and relating biological concepts in the Catalogue of Life. Journal of Biomedical Semantics, 2(1), 7. doi:10.1186/2041-1480-2-7This paper describes the implementation in great detail, but this is all for nought as CoL LSIDs don't resolve. In fact, as far as I'm aware, CoL LSIDs have not resolved since 2009. Here is a major biodiversity informatics project that seems incapable of running a LSID service. These LSIDs are appearing in other projects (e.g., Darwin Core Archives harvested by GBIF), but they are non-functioning. Anyone using these LSIDs on the assumption that they are resolvable (or, indeed, that CoL cared enough about them to ensure they were resolvable) is sadly mistaken.
Jones et al. list some projects that use CoL LSIDs, including the Atlas of Living Australia (ALA). While I have seen CoL LSIDs used by ALA in the past, it now seems that they've abandoned them. Resolving a LSID such as urn:lsid:biodiversity.org.au:afd.name:433239 (Dromaius novaehollandiae) (using, say the TDWG resolver) we see the following LSID: urn:lsid:biodiversity.org.au:col.name:6847559. This corresponds to the record for Dromaius novaehollandiae for the 2011 edition of the Catalogue of Life. ALA have constructed their own LSID using an internal identifier from CoL. This is the very situation working CoL LSIDs should have made unnecessary. As Jones et al. note:
Prior to the introduction of LSIDs, the CoL was criticized for using identifiers which changed from year to year . The internal identifiers have never been intended to be used in other systems linking to the CoL, of course, but this criticism draws attention to the demand for persistent identifiers that are designed for use by other systems. The CoL still does not guarantee to maintain the same internal identifiers, because there appears to be no need to insist on this as a requirement, but it does now provide persistent globally unique, publicly available identifiers.That would be fine if, in fact, the identifiers were persistent. But they aren't. Because CoL have been either unable or unwilling to support their own LSIDs, ALA has had to program around that by minting their own LSIDs for CoL content! Note that these ALA LSIDs are tied to a specific version of CoL. Record 6847559 exists in the 2011 edition (http://www.catalogueoflife.org/annual-checklist/2011/details/species/id/6847559) but not the latest (2013), where Dromaius novaehollandiae is now http://www.catalogueoflife.org/annual-checklist/2013/details/species/id/11908940.
One of features of LSIDs that has caused the most heartache is versioning. Just because this feature is there doesn't mean it is necessary to use it, and yet some LSID providers insist on versioning every LSID. CoL is such an example, so with every release the LSID for every taxon changes. In my opinion, versioning is one of the most discussed and most over-rated features of any identifier. Most people, I suspect, don't want a version, they want the latest version. They want to be able to have links that will always get them to the current version. This is how Wikipedia works, this is how DOIs work (see CrossMark). In both cases you can see the existence of other versions, and go to them if needed. But by putting versions front and centre, and by not enabling the user to simply link to the latest version, CoL have made things more complicated than they need to be.
Changing LSIDsIt needs to be understood that in relation to concepts the Catalogue is intentionally not stable, so if a client is wishing to link to a name, not a concept, the client should use any LSID available for the name (or just the name itself), not a CoL-supplied taxon LSID. It should also be noted that it is intended that deprecated concepts will be accessible via their LSIDs in perpetuity, and the meta- data retrieved will include information about the concepts’ relationships to relevant current concepts (such as inclusion, etc.). - Jones et al. p. 14Leaving aside the fact that CoL clearly has a different notion of "perpetuity" to the rest of us, the notion that identifiers change when content changes is potentially problematic. If a taxonomic concept changes CoL will mint a new LSID. While I understand the logic, imagine if other databases did this. Imagine if the NCBI decided that because the African elephant was two species instead of one (see doi:10.1126/science.1059936), they should change the NCBI tax_id of Loxodonta africana (tax_id 9785, first used in 1993) because our notion of what "Loxodonta africana" meant has now changed. Imagine the chaos this could cause downstream to all the databases that build upon the NCBI taxonomy, which would now link to an identifier the NCBI had dropped. Instead, NCBI simply added a new identifier for Loxodonta cyclotis. Yes, this means the notion of "Loxodonta africana" may now be ambiguous (if it was sequenced before 2001, did the authors sequence Loxodonta africana or Loxodonta cyclotis?), but given the choice I suspect most could live with that ambiguity (as opposed to rebuilding databases).
But, even if we accept CoL's approach of changing LSIDs if the concept changes, surely concepts that don't change should always have the same LSID (except for changes in the version at the end)? Turns out, this is not always the case. For example, here are the CoL LSIDs for Loxodonta africana from 2008 to 2013:
The core part of the LSID (the UUID highlighted in bold) has changed twice. But in each release of these versions of CoL there have only been two species of Loxodonta, L. africana and L. cyclotis. How is the 2008 concept of Loxodonta africana different from the 2010, or the 2011 concept?
As we start to tackle issues such as data quality and annotation, having persistent, resolvable, globally unique identifiers will matter more than ever. Shared identifiers are the glue that helps us bind diverse data together. The tragedy of LSIDs is that they could have been this glue if our community had chosen to invest even a fraction of the effort CrossRef invested in DOIs. Unfortunately we are now left with web sites and databases littered with LSIDs that simply don't work (CoL is not the only offender in this regard).
Resolvable identifiers mean we can actually get information about the things identified, as well as serving as a litmus test of the credibility of a resource (if I give you a URL and the URL doesn't work, you may doubt the value of the information on the end of that link). In a networked world, the trustworthiness of a resource is closely bound to its ability to maintain identifiers. The Catalogue of Life fails this test.
November 1, 2013
This is a quick sketch of a way to combine existing tools to help clean and annotate data in GBIF, particularly (but not exclusively) occurrence data.
The data provider puts a Darwin Core Archive (expanded, not zipped) into a GitHub repository. GBIF forks the repository, cleans the data, and uploads that to GBIF to populate the database behind the portal.
When GBIF firsts loads the repository it assigns it a DOI (using, say, DataCite). Actually we assign two DOIs, one for this version of the data (e.g., 10.1234/data.v1) and one for all versions of the data, say 10.1234/data. The data is considered to be published, authorship is determined by the provider, which may be an individual, a project, an institution, etc.
Big scale annotation and cleaning
Anyone familiar with GitHub can fork the repository of data and do their own cleaning (e.g., fixing dates, latitudes and longitudes, links to taxon names, etc.).
Small scale, casual annotation
Anyone visiting the GBIF portal and noticing an error (or something that they want to comment on) does so on the portal. Behind the scenes these comments are stored as issues on the GBIF repository in GitHub. To do this GBIF can either (a) enable users with an existing GitHub account to link that to their GBIF user account, or (b) create a GitHub account for the user. The user need not actually interact directly with GitHub (a similar approach is described by Mark Holder for the social curation of phylogenetic studies).
This means all annotation, big or small, is in the open and on GitHub. There is very little programming to do, GBIF simply talks to GitHub using GitHub's API. GBIF could display known "issues" for a dataset, so portal users immediate know if any data has been flagged as problematic.
All the annotations belong to the "community", in the sense that each annotation is linked to GitHub user (even if the user might not ever actually go to GitHub). This also means that the provider can, at any point, pull in those annotations so they can update their own data (and hence gain direct benefit form exposing it in the first place).
When GBIF decides that enough annotations have been made and resolved, the latest version of the repository is loaded into GBIF and gets a new DOI (e.g., 10.1234/data.v2). This means an analysis based on that version is citable. We add a link to the overall DOI so someone who doesn't care about versions can still cite the data.
Authorship and credit
Now we come to the fun part. The revision will include the input from a bunch of people. This will be recorded on GitHub, but that will only mean something the handful of geeks who think GitHub is awesome. But, let's imagine that we do the following:
This approach has a number of benefits:
There are a couple of potential issues. Darwin Core Archive data files can be large, and GitHub can be less effective with large files (although it is ideally suited to the delimited-text files that Darwin Core Archive uses, see Git (and Github) for Data). One approach to impose a limit on the size of an individual "occurrence.txt" file in the archive, so we may have multiple files, none of which is too big. Another task will be linking issues to specific occurrences (if they concern just one occurrence), the GitHub issues will be at level of the complete file. This could be handled in a form-based interface on GBIF that sent the occurrenceID as part of the issue report.
The key point of this proposal is that everything is in place already to do this. The ducks are lining up, and serious, credible projects are handling the things we need (versioning, identifiers, credit). Sometimes the smart thing is to do nothing and wait to someone else solves the problems you face. I think the waiting may be over.
October 15, 2013
I've recently been appointed Chair of the Science Committee of the Global Biodiversity Information Facility (GBIF) http://www.gbif.org . The committee is a small group of people with a range of backgrounds, and one of our roles is to advise GBIF on matters scientific (e.g., what kinds of data GBIF should collect?, what kinds of scientific questions should GBIF help answer?, etc.).
There have been formal surveys (see the papers in the journal "Biodiversity Informatics" https://journals.ku.edu/index.php/jbi/issue/view/370/showToc ), meetings, and a "vision" statement (the "Global Biodiversity Informatics Outlook, http://www.biodiversityinformatics.org/ ). But there's always the chance that these fora may miss some points of view, so I'm keen to get feedback on what sort of things GBIF could do to improve the way it can help people tackle the scientific questions they are interested in.
For example, is there some fundamental limitation that GBIF has that prevents it being useful to you? Is there some feature/data type/geographic coverage/etc. that could be addressed that would make it more useful? Is there a role that GBIF should take on that it hasn't done so? A useful analogy might be to think of the central role GenBank plays in genomics, both as a place to archive your data (sequences), a repository of other people's data that you can access, and a research tool (e.g., BLAST searches to locate similar sequences). Is that the sort of thing you'd want from GBIF, or is it something entirely different?
I'd welcome any comments, suggestions, views, etc. Feel free to add them as comments to this blog, or email me (rdmpage at gmail.com).
I should stress that this is simply me trying to calibrate my perception of GBIF's role with what others think. Also, note if you have specific comments on things such as the GBIF web site please use the feedback tab on the site (that way it will reach the people who can do something about it).
 For those unfamiliar with GBIF, its mission "is to make the world's biodiversity data freely and openly available via the Internet". At present the bulk of the data are observations of organisms (mostly multicellular eukaryotes, i.e., animals, plants and fungi) based on either museum collections or observations of living organisms. You can get an idea of the kind of science that uses GBIF-hosted data from this list of papers on Mendeley http://www.mendeley.com/groups/1068301/gbif-public-library/Updates
Based on responses so far I'll compile a list below of suggestions/themes.
October 8, 2013
One reason I was able to build BioNames is because a significant fraction of the taxonomic literature for animals is now online, either due to the efforts of the Biodiversity Heritage Library, digital archives, commercial publishers, or individual institutions and scientific societies. However there are still big gaps in literature availability. To get a sense of these gaps I've constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames). If you click on the ISSN in the table you can go to the corresponding page in BioNames to get full details of what BioNames currently knows about that journal.
The journals in red are the ones with the worst online presence (see complete key below). Note that BioNames is still a work in progress so there will be some journals that are online but I've simply not had a chance to add them to BioNames. With that in mind, there are some striking gaps in the digital availability of taxonomic publications. Several Russian journals (collectively publishing thousands of articles) are not online (the story here is somewhat complicated because some Russian journals also have English-language translations available but these are mostly recent articles). A number of large entomological journals are not available (perhaps not surprising given that most described animal taxa are insects).
We can think of this as a "league table" of literature availability. My hope is that digitising projects such as the Biodiversity Heritage Library will look at this and use it to help prioritise which journals to scan. In particular, if the journal is not pre-1923 (and therefore out of US copyright) I hope BHL will then contact the journal's publisher and see if they would be willing to add their journal to those (such as Proceedings of the Biological Society of Washington) that have opened up their complete back catalogue to being scanned by BHL.
I also hope that scientific societies or organisations that publish journals in the "red" or "orange" zones will consider digitising their journals and making their contents accessible to the wider community. We are reaching the point where if knowledge is not online then it effectively doesn't exist.
> 90%Almost all are available< 90%Most are available< 50%Limited availability< 10%Mostly inaccessibleISSN (click for details)JournalArticlesDigitised% digitised1175-5326Zootaxa85818189950374-5481The Annals and magazine of natural history44633502781000-0739880-01 Dong wu fen lei xue bao. Acta zootaxonomica Sinica34032450720006-324XProceedings of the Biological Society of Washington33843263960022-3360Journal of paleontology33733121930037-928XBulletin de la Société entomologique de France301224480013-8797Proceedings of the Entomological Society of Washington29722805940044-5134Zoologicheskiĭ zhurnal28121610044-5231Zoologischer Anzeiger2761594220022-3395The Journal of parasitology23532222940008-347XThe Canadian entomologist22602059910003-0082American Museum novitates19421814930035-418XRevue suisse de zoologie18511581850022-2933Journal of natural history18481823990367-1445Entomologicheskoe obozrenie1803300096-3801Proceedings of the United States National Museum17221365790013-872XEntomological news16911619960370-2774Proceedings of the Zoological Society of London15801008641000-7482880-01 Kun chong fen lei xue bao = Entomotaxonomia15181127740037-9271Annales de la Société entomologique de France1497757510031-031X880-01 Paleontologicheskiĭ zhurnal14723120013-8746Annals of the Entomological Society of America14411383960035-1814Revue de zoologie et de botanique africaines14004730031-0603The Pan-Pacific entomologist13895640323-6145Berliner entomologische Zeitschrift / herausgegeben von dem Entomologischen Vereine in Berlin1342710531148-8425Bulletin du Muséum National d'Histoire Naturelle réunion mensuelle des naturalistes du Muséum1303506390013-8908The Entomologist's monthly magazine1268600044-586XAcarologia12268770045-8511Copeia11911095920031-0239Palaeontology11851154970001-6616880-03 Gu sheng wu xue bao = Acta palaeontologica Sinica1127000165-5752Systematic parasitology10821028950454-6296880-01 Kun chong xue bao = Acta entomologica Sinica / Zhongguo kun chong xue hui bian ji1054902860024-0672Zoologische mededeelingen / uitgegeven vanwege 's Rijksmuseum van Natuurlijke Historie te Leiden1039997960370-047XProceedings of the Linnean Society of New South Wales1038742710030-5316Oriental insects1035916890028-7199Journal of the New York Entomological Society1013860850521-4726Annales historico-naturales Musei Nationalis Hungarici = Természettudományi Múzeum évkönyve1007886880070-7279Reichenbachia / Staatliches Museum für Tierkunde in Dresden951200022-8567Journal of the Kansas Entomological Society945906960373-3491Bollettino della Società entomologica italiana9401410037-2102Senckenbergiana biologica9391110002-8320Transactions of the American Entomological Society923796860374-9797Nouvelle revue d'entomologie923100774-2819Lambillionea918000034-7108Revista Brasileira de biologia916610007-1595Bulletin of the British Ornithologists' Club911459500013-8843Entomologische Zeitschrift881400253-116XLinzer biologische Beiträge / Oberösterreiches Landesmuseum876503570272-4634Journal of vertebrate paleontology869864991217-8837Acta zoologica Academiae Scientiarum Hungaricae868134150011-216XCrustaceana8658651000085-5626Revista brasileira de entomologia863260300365-4389Annali del Museo civico di storia naturale "Giacomo Doria."855503590097-3157Proceedings of the Academy of Natural Sciences of Philadelphia848500590010-065XThe Coleopterists' bulletin831804971313-2989ZooKeys8278271000024-4082Zoological journal of the Linnean Society8238211000008-4301Canadian journal of zoology817803980028-1344The Nautilus814501620040-7496Tijdschrift voor entomologie804580720375-0434Proceedings of the Royal Entomological Society of London. Series B, Taxonomy796783980033-2615Psyche796709890164-7954International journal of acarology7877861000003-0090Bulletin of the American Museum of Natural History776488630037-962XBulletin de la Société zoologique de France765228300181-0863Revue française d'entomologie765611562-0891Wiener Entomologische Zeitung752573761000-3118880-01 Gu ji zhui dong wu xue bao743410003-0023Transactions of the American Microscopical Society7317281000075-6547Koleopterologische Rundschau / herausgegeben von der Zoologisch-Botanischen Gesellschaft gemeinsam mit der Forstlichen Bundesversuchsanstalt706339480286-9810880-01 The entomological review of Japan = Konchūgaku hyōron70498140867-1710Genus690200042-3580Venus : Japanese journal of malacology = Kairuigaku zasshi687531770067-1975Records of the Australian Museum679629930006-6982The Journal of the Bombay Natural History Society67781120320-9180Zoosystematica rossica676610084-5604Vestnik zoologii / Akademii︠a︡ nauk Ukrainskoĭ SSR, Institut zoologii6723760387-5733Elytra666108160043-0439Journal of the Washington Academy of Sciences664603910003-4541Annales zoologici / Polska Akademia Nauk, Instytut Zoologiczny661336510016-6995Geobios659475720004-2110Arkiv för zoologi / utgivet af K. Svenska vetenskaps-akademien6585990035-8894Transactions of the Royal Entomological Society of London655495760915-5805Japanese journal of entomology645620960013-8878The Entomologist6451420031-1820Parasitology641614960007-4853Bulletin of entomological research633611970375-099XRecords of the Indian Museum a journal of Indian zoology ed. by the Director, Zoological Survey of India630213341326-6756Australian journal of entomology6296291000018-8158Hydrobiologia6276271000013-8770880-02 Konchū = Kontyū625616990217-2445The Raffles bulletin of zoology622571920372-1426Transactions of the Royal Society of South Australia, Incorporated622450720079-8835Memoirs of the Queensland Museum620373600003-4150Annales de parasitologie humaine et comparée612355580018-0130Proceedings of the Helminthological Society of Washington604588970015-4040The Florida entomologist6026011000077-7749Neues Jahrbuch für Geologie und Paläontologie. Abhandlungen602146241066-5234The journal of eukaryotic microbiology601572950031-0220Paläontologische Zeitschrift60158100567-7920Acta palaeontologica Polonica599578960032-3780Polskie pismo entomologiczne. Bulletin entomologique de Pologne5902850027-4100Bulletin of the Museum of Comparative Zoology at Harvard College581444760042-3211The Veliger578274470181-0626Bulletin du Muséum national d'histoire naturelle. Section A, Zoologie, biologie et écologie animales574564980068-547XProceedings of the California Academy of Sciences573261460035-6387Rivista di parassitologia566200003-5092Annotationes zoologicae Japonenses / auspiciis Societatis Zoologicae Tokyonensis seriatim editae = Nihon dōbutsugaku ihō562545970036-7575Mitteilungen der Schweizerischen entomologischen Gesellschaft = Bulletin de la Société entomologique suisse562310251-074XRevue de zoologie africaine5601830373-9465Folia entomologica Hungarica = Rovartani közlemények555610206-0477880-01 Trudy Zoologicheskogo instituta = Travaux de l'Institut zoologique de l'Académie des sciences de l'URSS / Akademii︠a︡ nauk Soi︠u︡za Sovetskikh Sot︠s︡ialisticheskikh Respublik554201445-5226Invertebrate systematics5505501000026-2803Micropaleontology548409750307-6970Systematic entomology537526980020-1804Insecta matsumurana536514960278-0372Journal of crustacean biology : a quarterly of the Crustacean Society for the publication of research on any aspect of the biology of crustacea5315311000165-0424Aquatic insects5255251001051-8932Bulletin of the Brooklyn Entomological Society523310013-8711Entomologica scandinavica522513980341-8391Spixiana515465900013-8789Journal of the Entomological Society of Southern Africa515392760018-0831Herpetologica514472920323-7087Zoologische Jahrbücher. Abteilung für Systematik, Geographie und Biologie der Tiere513176340007-4977Bulletin of marine science510397780250-4413Entomofauna50038777
October 4, 2013
Wednesday saw the launch of the Global Biodiversity Informatics Outlook (GBIO), based in large part on the Global Biodiversity Informatics Conference (GBIC). The aim is to provide a framework for biodiversity informatics and its applications in the hope that the field will unite around a shared vision of where we are and what needs to be done next:
We invite funders, policymakers, researchers, information technology specialists, educators and the general public to unite around the framework detailed in the following pages. The rewards of coordinated action will be as exciting and significant as the great scientific collaborations to advance our understanding of space, the human genome and the fundamental particles of matter.
There is a web site http://www.biodiversityinformatics.org/ with more details and links to related resources, and an invitation to get involved (although there doesn't appear to be an online forum where people can comment).
October 3, 2013
NESCent, EOL, and BHL have put together a research sprint:
We invite participants for an event that will pioneer the mining of the Encyclopedia of Life (http://eol.org) and the Biodiversity Heritage Library (http://www.biodiversitylibrary.org) to address outstanding and novel questions about the ecology and evolution of biodiversity. We aim to identify questions and data for which biologists may lack informatics skills and resources to address or analyze successfully; and symmetrically, to guide informaticians to pressing ecological and evolutionary questions. We seek to make actual discoveries through joint activities and to test the “computability” of major biodiversity databases.Since I won't be applying to participate I thought I'd sketch some possible ideas here.
Co-occurrence of taxon names as proxy for ecological associations
Some time ago I noted that if you build a "tag tree" for taxonomic names in a BHL document you can get some interesting patterns, such as the names of hosts and their parasites occurring together. For example, searching BioNames for the rodent genus Praomys turns up papers with fleas, lice and cestodes. This suggests ways to mine BHL for ecological association data. It could be done by looking for general patterns of co-occurrence, or perhaps in a more targeted fashion (e.g., find all pages that have mammal and insect names together). Perhaps we could develop weighting schemes based on taxonomy whereby the co-occurrence of taxonomically unrelated groups is flagged as possibly significant (at the same time we'd want to avoid false positives such as tables of contents and indices).
Mining article titles for ecological associations
Another approach is to try and interpret the text itself. Keeping with the host-parasite theme, often descriptions of new parasite species are of the form "new species x from y". Here are some examples I use in my Phyloinformatics course:
Eimeria azul sp. n. (Protozoa: Eimeriidae) from the eastern cottontail, Sylvilagus floridanus, in Pennsylvania
Mirandula parva gen. et sp. nov. (Cestoda, Dilepididae) from the long-nosed Bandicoot (Perameles nasuta Geoff.)
Hysterothylacium carutti n. sp. (Nematoda: Anisakidae) from the marine fish Johnius carutta Bloch of Bay of Bengal (Visakhapatnam)
Ctenascarophis lesteri n. sp. and Prospinitectus exiguus n. sp. (Nematoda: Cystidicolidae) from the skipjack tuna, Katsuwonus pelamis
Buticulotrema stenauchenus n. gen. n. sp. (Digenea: Opecoelidae) from Malacocephalus occidentalis and Nezumia aequalis (Macrouridae) from the Gulf of Mexico
Nubenocephalus nebraskensis n. gen., n. sp. (Apicomplexa: Actinocephalidae) from adults of Argia bipunctulata (Odonata: Zygoptera)
Studies on Stenoductus penneri gen. n., sp. n. (Cephalina: Monoductidae) from the spirobolid millipede, Floridobolus penneri Causey 1957
Species of Cloacina Linstow, 1898 (Nematoda: Strongyloidea) from the black-tailed wallaby, Wallabia bicolor (Desmarest, 1804) from eastern Australia
A new marine Cercaria (Digenea: Aporocotylidae) from the southern quahog Mercenaria campechiensis
A new species of Breinlia (Breinlia) (Nematoda: Filarioidea) from the south Indian flying squirrel Petaurista philippensis (Elliot)
Wordtrees are a great way to visualise these sentences and get insights into how to parse them (the word tree for the text above is here.
There is a lot of geographic data in BHL, which could potentially fill in gaps in geographic databases such as GBIF (which feeds into EOL). Even extracting latitude and longitude pairs from the OCR text can be enough to build some interesting maps.
Another approach is to extract images from BHL, ideally with the associated caption. This would be a way to quickly build an image database, a lot of taxonomic papers have illustrations of taxa, so this would be a quick way to get that information. It might be possible to do some clever parsing of the figure caption to extract not only taxon names but also other data. For example if the caption mentions a scale bar you could very quickly classify organisms into size categories (a 1mm scale bar versus a 1cm or 1m scale bar tells you something about the size of the organism).
Complementing the idea of image extraction, how about a tool that identifies tables in BHL OCR text? These tables are potentially sources of useful data, if they can be pulled out and indexed by taxon name (for example) then they could be analysed further. BHL OCR of tables tends to be poor, but the OCR could be redone on just the table, and/or the table could be edited manually (perhaps with the help of crowd sourcing).
September 27, 2013
I've just come back from II Iberian Congress of Biological Systematics (CISA2013) in Barcelona, where I had a great time. I gave a presentation on biodiversity informatics entitled "Biodiversity informatics: why aren't we there yet?". Instead of my usual complaining about what a disaster biodiversity informatics is, and how links are so important (etc., etc.) I tried something a little new and presented a series of charts and diagrams, together with some (not terribly well thought out) interpretations. What I had in mind in doing this was to ask the question "what do these charts tell us about the field?" or, put another way, "what, if anything, do these charts tell us we should be doing?". I envisaged someone in a company say, looking at charts on changes in the market (e.g., numbers of PCs versus laptops, mobile versus desktop Internet consumption, or peak oil) and thinking about what the implications are for their business. By this stage it should be clear I've no idea what I'm talking about, but I hope you get the idea. So, here are some of the charts I showed in my talk, together with some commentary.
These two charts show that the cost of sequencing is plummeting, and the number of sequences going into GenBank is rising exponentially (note that the GenBank chart is old and predates the step-change in sequencing costs, so growth was exponential even before it became much cheaper). I realise that there is more to sequencing costs than the first chart implies (http://dx.doi.org/10.1186/gb-2011-12-8-125) but the bottom line is we have a flood of data.
The rate of publication of new animal names has been roughly constant in the last few decades. Exactly what these sort of graphs mean is problematic, but my suspicion is that it reflects a discipline working at capacity. There is a limit to how many taxa it can describe, and I suspect a limit to the kinds of taxa being described (i.e., those that can be fairly easily recognised morphologically).
So we have exponential growth of sequence data coupled with taxonomic output that is essentially flatlining. Perhaps then it's no surprise that we have dark taxa in GenBank (i.e., taxa that don't carry proper Linnaean names):
This chart shows the declining number of "invertebrate" taxa in GenBank that have proper scientific names. Unfortunately, it is not trivial to figure out whether these dark taxa represent previously undiscovered biodiversity (i.e., new species) or taxa that have already been described but which we are either unable or unwilling to identify. In any event, exponential growth versus flat line means there is a disconnect between genomics and taxonomy.
The literature gap
This chart (from BioStor) highlights two things. Firstly, the Biodiversity Heritage Library is not just about old (i.e., pre-1923) literature. Despite that, 1923 is a mass extinction event in terms of access to taxonomic literature. If we date modern open access as getting underway around 2003 (the birth of PLoS) then we have a period of time (1923-2003) where much of the literature about biodiversity is "dark", either not digitised or locked behind a paywall. Some museums and scientific societies are opening up their publications (this is mostly what comprises the second peak in the chart), but much of the 20th century literature is closed to us.
One reason the legacy literature matters is the "long tail" phenomenon. Above is a plot of the size of Wikipedia articles for mammals, where the pages are ranked from largest to smallest. A few mammals have really detailed pages, the vast majority of mammals have small pages ("stubs"). So for most taxa we know only a little, and hence the most recent publication on those taxa might be quite old. This means that if we want to build comparative databases we will need the legacy literature.
The chart below is a plot of the dates of publication of the sources used by the PanTHERIA database. Many of these are in the gap between 1923 and 2003, and a few date back to the 19th century. Even for a well-studied group such as mammals, the old literature matters.
Who publishes taxonomy?
Based on data in BioNames the chart above shows the relative importance of different publishers in terms of how many articles describing new animal taxa they have published. BioStor, which harvests articles from BHL, is the single largest source, which emphasises how important BHL is (all its content is open access). There are some significant commercial publishers (Springer, Elsevier, Taylor and Francis, BioOne) who we would need to talk to about data mining. There is also a huge long tail (hard to see but represented by all the tiny dots) of very small journals that collectively publish quite a lot of taxonomy.
But one thing that is striking about modern animal taxonomy is the emergence of Zootaxa as a "mega journal". The chart below shows time lines of articles-per-decade for the major taxonomic journals in zoology. There is a colossal spike that is Zootaxa. So, if we are interested in data mining at scale Zootaxa looks like the place to start.
Where is the biodiversity?
GBIF makes some wonderful maps, like the one below. But it's worrying that it seems to bear more relation to economic development than where the actual biodiversity is. The Amazon basin barely registers, Africa is poorly covered (not to mention China) and there are obvious sampling tracks in the oceans.
Maybe crowd sourcing ("citizen science") can come to the rescue? Not so much if this next map is representative. It shows the distribution of photos in the EOL group on Flickr. This looks more like a map of where the iPhones are, rather than where the biodiversity is. If the crowd has the same economic and geographic bias as the experts, then it's not going to help us much.
GenBank as a biodiversity database
Another "crowd" are people doing sequencing and depositing georeferenced sequences in GenBank. Many of these are DNA barcodes, but some of it is simply well-documented sequence data. A map of animal DNA sequences from GenBank reveals a map (above) that is sparser than GBIF, and shares many of the same biases, but this map and the next diagram make me wonder whether it is useful to take another look at GenBank's role.
GenBank has a lot more information than just sequences. Many accessions have geographic information, as well as other useful data such as "host" associations (e.g., for parasites or other close ecological relationships). I played with this a while ago, and found some interesting patterns. Given that GenBank has taxonomy, some geography, and some ecology, and we can compute phylogenetic relationships on the sequence data it could enable a richer biodiversity database than GBIF. Put another way, if we were to build a GBIF-style database on top of GenBank data, what would we do differently?
Data is private
This is a diagram that I published a few years ago http://dx.doi.org/10.1038/npre.2007.1028.1 that showed the gap between published papers on molecular phylogenetics and the number of phylogenies that made their way into TreeBASE. I can't help thinking that this tells us something about what we actually think of the value of individual phylogenies (i.e., they are relatively disposable). This is not to say that phylogenies don't matter, just that any individual phylogeny is relevant for a shorter period of time than the data (e.g., DNA sequences) used to infer that phylogeny.
This is a small, very biased collection of diagrams. There are obviously other diagrams that could be created, and some much more sophisticated analyses that we could do to try and tease out some more implications. In this post I'm largely waving my arms about. But I think it might be useful to explore this further and try and ask some questions about where we are, and where we are going. Or, more to the point, what we should be doing right now.
September 20, 2013
In some recent posts I've been exploring the quality of GBIF's taxonomic data. I've done some further analyses and decided to write this up in something more than a blog post. I'm writing a draft which you can see on GitHub. It tackles just one issue, namely what happens when you combine taxonomic names from multiple sources and don't know that some of those names are synonyms. For example, below is a cluster map for mammal species names from the Catalogue of Life, Mammal Species of the World, and the IUCN Red List.
Each database has a set of names that it and it alone recognises, as well as names that two of the three agree on. Merging these three sets of names successful requires knowing which are synonyms. As I've noted before some synonyms have ended up in GBIF as separate names, which can mean users get a rather distorted view of what GBIF actually knows about a species.
This issue doesn't just affect GBIF, projects like the Map of Life suffer the same problem. The gibbon example I used earlier crops up again. I had to do three separate searches of Map of Life using the three different synonyms for the hoolock gibbon to get a complete picture of our knowledge of its distribution:
The multiplicity of names for the same taxon is one of the main challenges facing anyone wanting to integrate biodiversity data, and hence this taxonomy meme seems rather appropriate:
September 12, 2013
A nice article by Brendan Borrell about the secret life of herpetologist Edward Taylor, and Rafe Brown's efforts to untangle his taxonomic legacy has appeared in Nature:
Borrell, B. (2013). Taxonomy: The spy who loved frogs. Nature, 501(7466), 150–153. doi:10.1038/501150aFascinating article, but as always I'm going to skip straight past the content and look at links. The article leads with Ptychozoon intermedium, the Philippine parachute gecko. Naturally, pedant that I am, I wanted to find the original description of this gecko (which wasn't cited in the Nature piece). I turned to BioNames, and got the name but no literature. A bit of Googling revealed that Taylor originally used the name Ptychozoon intermedia (note the ending "a" rather than "um", sigh). OK, BioNames has Ptychozoon intermedia, plus the original description:
Edward H Taylor (1915) New species of Philippine Lizards. Philippine Journal of Science Manila Sect 10(D): 89–109. http://biostor.org/reference/129464Obviously I need to improve BioNames to handle multiple variants of the species name. Finding this article took a little tracking down, not quite on the level of uncovering a spy, perhaps, but sometimes the amount of detective work involved in tracking down taxonomic literature is tiresome.
To continue with the theme, in my experience when reading taxonomic papers the list of literature cited is often simply listed as a text string without a link to the place you can find it. This is in marked contrast to papers in other subjects (say, phylogenetics), where most if not all the literature cited is linked. For the Nature article on Edward Taylor here are the references cited:
So 8 of 10 references have no link (I'm ignoring the ISI link for the first reference). So, I spent a little time fussing with BioStor, JSTOR, and Google and came up with some more:
Not perfect, but better. My concern is that the lack of linked literature citations simply seems to confirm taxonomy's status as an intellectual backwater. In other subjects the reader can quickly visit the literature cited and navigate the web of papers relevant to the article. But in taxonomy we have to resort to Google and/or specialised tools such as JSTOR, BioStor and BHL to find the literature. This needs to change, unless we are happy with taxonomy being a digital backwater.
The Barcode of Life
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology
Molecular Biology and Evolution