Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.


XML feed

Last update

35 min 30 sec ago

May 19, 2015


Text mining for museum specimen identifiers #TDM http://t.co/BsFXSAZJpK cc @rdmpage @robgural thoughts?

— Ross Mounce (@rmounce) May 19, 2015

This post is a response to Ross Mounce's post Text mining for museum specimen identifiers. As Ross notes in that post, mining literature for specimen codes is something I've been interested in for a while (search for specimen codes on iPhylo), and @Aime Rankin (formerly an undergraduate student at Glasgow) did some work on this as well. It's great to see progress in this area.

Here are some thoughts on Ross's post (I'm posting here rather than as a comment on Ross's blog because this is going to be long).

What questions to ask?Obviously there's a lot of scope for metrics, such as numbers of citations for individual specimens, and league tables for collections (see GBIF specimens in BioStor: who are the top ten museums with citable specimens?). As Ross notes, there's also scope for updating out of date museum metadata with information from the literature (e.g., Linking data from the NHM portal with content in BHL), but even more interesting is the potential to cross-link databases in a way that permits novel queries. For example, if we have a paper on a disease that includes data we can link to a georeferenced specimen, then we can enable spatial queries for diseases (e.g., BHL and GBIF as biomedical databases).

Materials for miningFrom my perspective the obvious corpus to mine is the Biodiversity Heritage Library (BHL). Ross repeats the erroneous view that BHL is just "legacy" literature. Apart from the obvious point that everything not published right not is, by definition, legacy, BHL has a lot of modern content (including papers published in the last couple of years).

Furthermore, there are journals that cite Natural History Museum specimens, including "in house" journals (e.g., Bulletin of the British Museum (Natural History) Zoology and Bulletin of the Natural History Museum. Zoology series), as well as the Bulletin of the British Ornithologists' Club which has published lots of new bird names for which the type specimen is often in the NHM.

I guess one issue is accessibility. Ross notes that: The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.So, how we can make BHL content as accessible? For each article I've extracted from BHL and stored in BioStor you can get full text by simply appending ".text" to the BioStor URL, but this isn't quite the same as grabbing a big dump of text.

The other source of mining is GenBank, which has a lot of sequences that have NHM vouchers, but also a weird and wonderful array of ways of recording those specimens. This is one reason I'm building "Material examined", to cope with these codes. For example sequence KF281084 has voucher "TRING 1877111743" which more traditionally would be written as "BMNH 1877.11.17.43", which is "NHMUK 1877.11.17.43" in the NHM database. This is just one example of the horrors of matching specimen codes (for more see the code for Material examined).

One reason GenBank is useful is that the sequences are often linked to the literature, which means you get to make the link between specimen and literature without actually needing to mine the text itself (handy if access is problematic).

Bonus question: How should I publish this annotation data?But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?Personally I'd avoid RDF because that way lies madness (or at least endless detours haggling about ontologies).

But making the output useful is an important question. Despite the fact that it is a bit clunky, I suspect Darwin Core Archives are the way to go. The core data is a CSV table, so it's easy to generate, and also easy to use. Lets say you analysed a particular corpus (e.g., PLoS ONE), you could then output the data in Darwin Core (making sure both specimen and publication had stable identifiers), then package it up and upload to Zenodo or Figshare and get a DOI. For bonus points, it would be great to see this data on GBIF, but this would require (a) mapping NHM specimen codes to GBIF ids (the NHM has this), and (b) GBIF being able to recognise that the data you're adding is not new specimens but rather annotations of existing specimens.

Things to think about

Here are a couple of additional things to think about.

Specimen finding as a serviceIn the same way that we have taxonomic name-finding services, it would be great if we had a specimen code-finding service. I have code that I use in BioStor, but it would be great to have something that is robust, stable, and generalisable across multiple specimen codes. My tool Material examined focusses on parsing a single string rather than parsing a block of text, but adding that functionality is an obvious thing to do.

Markup as outputOne concern I have with work that involves mining text is that we hardly ever store the intermediate step of text + located elements. Instead we get to see sumamry output (e.g., this page has these three scientific names, and these 10 specimen codes). As Terry Catapano (@catapanoth) once wisely pointed out "indexing is markup", in that if you find a substring in some text, you have in effect marked up the text. Can we preserved the marked up text so that we go back and look at it and improve our text mining methods, or make that markup available to others to build upon it? There are all sorts of things which could be built upon this information, for example, imaging if the results where given to BHL so that people could search by specimen code.

May 14, 2015


This a quick writeup of an analysis I did to make the case that the list of names held by the Index of Organism Names (ION) (part of Thomson Reuters) would be very useful for GBIF. I must declare a bias, in that I've spent a good chunk of the last 3-4 years exploring the ION database and investigating ways to link the taxonomic names it contains to the primary taxonomic literature, culminating in building BioNames.

What makes ION special is its scope (it endeavours to have all names covered by the ICZN), and that many of its names have associated citation information (i.e., details on the publication that published the name). Like any name database it has duplications and errors, and some of the older content is a bit ropey, but it's a tremendous resource and from my perspective nothing else in zoology come close.

But rather than rely on anecdote, I decided to do a quick analysis to see what ION could potentially add to GBIF. I've been doing some work on bird names recently, so as an exercise I searched GBIF for holotype specimens for birds. The search (13 May 2015) returned 11,664 records. I then filtered those on taxonomic names that GBIF could not match exactly (TAXON_MATCH_FUZZY) or names that GBIF could only match to a higher rank (TAXON_MATCH_HIGHERRANK). The query URL is:


This query found 6,928 records, so over half the bird holotype specimens in GBIF do not match a taxonomic name in GBIF. What this means is that GBIF can't accurately place these names in its own taxonomic hierarchy. It also makes it hard to do meaningful analyses of things such as "how long does it take before a bird specimen is collected to when it is described as a new species?" because if you can match the name then you can't get the date the name was published.

To explore this further, I downloaded the results of the query (the download has DOI http://doi.org/10.15468/dl.vce3ay). I then wrote a script to parse the specimen records and extract the GBIF occurrence id, catalogue number, and scientific name. I then used the GBIF API to retrieve (where available) the verbatim record for each specimen (using the URL http://api.gbif.org/v1/occurrence//verbatim where is the occurrence id). This gives us the original name on the specimen, which I then looked up in BioNames using its API. If I got a hit I extracted the identifier of the name (the LSID in the ION database) and the corresponding publication id in BioNames (if available). If there was a publication associated with the name I then generated a human-readable citation using BioNames’s citeproc API. The code for all this is on github.

Here's a sample of the mapping:

OccurrenceHolotypeGBIF matched nameVerbatim nameIONBioNamesPublicaton883603238USNM PAL378357.3368464Porzana Vieillot, 1816Porzana severnsi8796592c4f3...Olson, S. L., & James, H. F. (1991). Descriptions of thirty-two new species of birds from the Hawaiian Islands: Part 1. Non-Passeriformes. Ornithological Monographs, 45, 1-88. doi:10.2307/40166794858732312AMNH Skin-245914Otus choliba (Vieillot, 1817)Otus choliba duidae4307811b3315...Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988858732345AMNH Skin-245936Atlapetes Wagler, 1831Atlapetes duidae4307791b3315...Chapman, F. M., & History, T. D. E. of the A. M. of N. (1929). Descriptions of new Birds from Mt. Duida, Venezuela. American Museum Novitates, 380, 1-27. Retrieved from http://hdl.handle.net/2246/3988858733764AMNH Skin-45339Leptotila Swainson, 1837Leptotila gaumeri Lawr.858744126AMNH Skin-218110Zosterops Vigors & Horsfield, 1827Zosterops alberti ablita

The complete result of this mapping can be viewed here. Of the 6,392 holotypes with names not recognised by GBIF, nearly half (3,165, 49.5%) exactly matched a name in ION. Many of these are also linked to the publication that published that name.

So, adding ION help us find half the missing holotype names. This is before doing anything more sophisticated, such as approximate string matching, resolving synonyms, etc. Hence, I'd argue that the names in ION would add a lot to GBIF's ability to interpret the occurrence records it receives from museums.

I've not had time for further analysis, but at first glance a lot of the missed names are subspecies, the are quite a few fossils, and many names are in the relatively older literature. However there are also some recently described taxa, such as the hawk-owl Ninox rumseyi Rasmussen et al. 2012, and a bunting subspecies from Tristan du Cuhna (Nesospiza acunhae fraseri Ryan, 2008) that are missing from GBIF.

May 8, 2015


@rdmpage @BioDivLibrary @bouchoutdec Why aren't you creating an iphylo blog what metrics you expect to see so you will not be disappointed?

— Donat Agosti (@myrmoteras) May 1, 2015 There are no requirements for signing up. A signature is first and foremost a statement of support for open data . Each signatory can determine how best to make progress towards the goal. Some recommendations are included in the declaration. We hope that signatories will become early adopters of the open access approach, that they will promote change in their institutions, societies and journals, and will position themselves and their institutions as leaders. (from http://www.bouchoutdeclaration.org/faqs/)I've put off writing this post about the Bouchout Declaration for a number of reasons. I attended the meeting that launched the declaration last year, and from my perspective that was a frustrating meeting. Much talk about "Open Biodiversity Knowledge Management" with nobody seemingly willing or able to define it (see The vision thing - it's all about the links for some comments I made before attending the meeting), and as much as the signing of the Boechout Declaration provided good theatre, it struck me as essentially an empty gesture. Public pronouncements are all well and good, but are ultimately of little value unless backed up by action. We have institutions that have signed the declaration yet have much of their intellectual output locked behind paywalls (e.g., JSTOR Global Plants). So much for being open.

So, since Donat challenged me, here's what I'd like to see happen. I'd like to see metrics of "openness" that we can use to evaluate just how open the signatories actually are. These metrics could be viewed as ways to try and persuade institutions into sharing data and other information, as a league table we can use to apply pressure, or as a way to survey the field and see what the impediments are to being open (are they financial, legal, cultural, resource, etc.).

Below are some of the things I think we could "score" the openness of biodiversity institutions.

Is the collection digitised and in GBIF?Simple criterion that is easy to measure. If an institution has specimens or other biological material, is data and or metadata on the collection freely available? What fraction of the collection has been digitised? How good is that digitsation (e.g., what fraction has been georeferenced?). We could define digitisation more broadly to include imaging and sequencing (both are methods of converting analogue specimens into digital objects).

Are the institutional publications digitised? Are they open access?Some institutions have a history of digitising their in-house publications and making them freely available online (e.g., the AMNH), some even make them fully citable with CrossRef DOIs (e.g., the Australian Museum). But some institutions have, sadly, signed over their publications to commercial publishers or archives that charge for access (e.g., Kew's publications have been digitised by JSTOR, which limits their accessibility). As a foot note, I suspect that those institutions that lost confidence in their in-house publishing operations and outsourced them are the ones who have ended up loosing control of their intellectual output, some of which is now closed off (e.g., some of the NHM London's journals are now the property of Cambridge University Press). Those institutions that maintained a culture of in-house publishing are the ones at the vanguard of digitising and opening up those publications.

Does the institution take part on the Biodiversity Heritage Library?There are at least two ways to participate in the Biodiversity Heritage Library (BHL), one is by becoming a member and start scanning books from institutional libraries. The other is by granting permission to BHL to scan institutional publications. BHL is often viewed as an archive of "old" literature, but in fact it has some very recent content. Some farsighted organisations have let BHL scan their journals, contributing to BHL becoming an indispensable resource for biodiversity research.

Do institution staff publish in open access journals?A while ago I complained about how few new species descriptions were in open access journals (The top-ten new species described in 2010 and the failure of taxonomy to embrace Open Access publication). A measure of openness is whether an institution encourages its staff to publish their work in open access journals, and to make their data freely available as well. Some prefer to chase Nature and Science papers, but I'd like to think we could prioritise openness over journal impact factor.

These are just some of the more obvious things that could be used to measure openness. At the same time, it would be useful to develop ways to show the benefits of being open. For example, I've long argued that we could develop citation tracking for specimens. This gives researchers a means to track provenance of information (who said what about the identity of a specimen), and it also gives institutions a way to measure the impact of their collections. Doing this at scale is only going to be possible if collections are digitised, specimens have identifiers of some sort, and we can text mine the literature and associated data for those identifiers (in other words, the data and publications need to be open). So, perhaps on way to help make the case for being open is to develop metrics that are useful for the institutions themselves.

I guess I would have been much more enthusiastic about the Bouchout Declaration if these sort of things had been in place at the start. Anyone can sign a document. Ideas are cheap, execution is everything.

April 21, 2015


Playing with the my "material examined" tool I've been working on, I wondered whether I could make use of it in, say, a spreadsheet. Imagine that I have a spreadsheet of museum codes and want to look those up in GBIF. I could create a service for Open Refine but Open Refine is a bit big and clunky, you have to fire up a Java application and point your browser at it, and Open Refine isn't as intuitive or as flexible as a spreadsheet.

It turns that Google Spreadsheets supports custom functions, including importing JSDON from a remote data source. Following How to import JSON data into Google Spreadsheets in less than 5 minutes here's what to do:

  1. Create a new Google Spreadsheet.
  2. Click on Tools -> Script Editor.
  3. Click Create script for Spreadsheet.
  4. Delete the placeholder content and paste the code from this script.
  5. Rename the script to ImportJSON.gs and click the save button.
  6. Back in the spreadsheet, in a cell, you can type “=ImportJSON()” and begin filling out it’s parameters.

Lets imagine we have a spreadsheet with a specimen code in cell A1, e.g. "FMNH 187122".

To call the material examined service, we need a function like this:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A1,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

Paste this into cell B1 (i.e., just to the right of the specimen code) and after a short delay you should see something like this:

The three parameters supplied to ImportJSON are are the query URL, written as a spreadsheet function that grabs the specimen code from cell A1, a list of the bits of data we want to extract from the result (expressed as JSON paths), and some options (in this case, don't show the headers). ImportJSON will grab the specimen code in cell A1, add it to the query URL, then output the results. You should see something like this:

The first column is the GBIF occurrence ID, the second is the scientific name (you can add more JSON paths to get more fields).

Note that we have multiple rows as there is more than one specimen with the code "FMNH 187122" in GBIF. Now, we can ask the material examined service to return only certain taxa (such as mammals) by adding the "scientificName" parameter:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A10,"&scientificName=",B10,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

If you put the specimen code in cell A10, and the higher taxon "Mammalia" in cell B10, and paste the function above into cell C10, then you should see something like this:

Note that now we have a single row with the mammal specimen.

It's a little bit fussy (you need to get the ImportJSON script, and mess a bit with the parameters but it's quick and flexible, and you get all the power of a spreadsheet to help clean the data before trying to match it to GBIF. Plus you can do it all in your browser.

April 15, 2015


The six finalists for the GBIF Ebbe Nielsen Challenge have been announced by GBIF: “The creativity and ambition displayed by the finalists is inspiring’, said Roderic Page, chair of the Challenge jury and the GBIF Science Committee, who introduced the Challenge at GBIF’s 2014 Science Symposium in October.

“My biggest hope for the Challenge was that the biodiversity community would respond with innovative—even unexpected—entries,” Page said. “My expectations have been exceeded, and the Jury is eager to see what the finalists can achieve between now and the final round of judging.”The finalists all receive a €1,000 prize, and now have the possibility to refine their work and compete for the grand prize of €20,000 (€5000 for second place). As the rather cheesy quote above suggests, I think the challenge has been a success in terms of the interest generated, and the quality of the entrants. While the finalists bask in glory, it's worth thinking about the future of the challenge. If it is regarded as a success, should it be run in the same way next year? The first challenge was very open in terms of scope (pretty much anything that used GBIF data), would it be better to target the challenge on a more focussed area? If so, which area needs the nost attention. Food for thought.


I've put together a working demo of some code I've been working on to discover GBIF records that correspond to museum specimen codes. The live demo is at http://bionames.org/~rpage/material-examined/ and code is on GitHub.

To use the demo, simply paste in a specimen code (e.g., "MCZ 24351") and click Find and it will do it's best to parse the code, then go off to GBIF and see what it can find. Some examples that are fun include MCZ 24351, KU:IT:00312, MNHN 2003-1054, and AMS I33708-051

It's proof of concept at this stage, and the search is "live", I'm not (yet) storing any results. For now I simply want to explore how well if can find matches in GBIF.

By itself this isn't terribly exciting, but it's a key step towards some of the things I want to do. For example, the NCBI is interested in flagging sequences from type specimens (see http://dx.doi.org/10.1093/nar/gku1127 ), so we could imagine taking lists of type specimens from GBIF and trying to match those to voucher codes in GenBank. I've played a little with this, unfortunately there seem to be lots of cases where GBIF doesn’t know that a specimen is, in fact, a type.

Another thing I’m interested in is cases where GBIF has a georeferenced specimen but GenBank doesn’t (or visa versa), as a stepping stone towards creating geophylogenies. For example, in order to create a geophylogeny for Agnotecous crickets in New Caledonia (see GeoJSON and geophylogenies ) I needed to combine sequence data from NCBI with locality data from GBIF.

It’s becoming increasingly clear to me that the data supplied to GBIF is often horribly out of date compared to what is in the literature. Often all GBIF gets is what has been scribbled in a collection catalogue. By linking GBIF records to specimen codes cited that are cited in the literature we could imagine giving GBIF users enhanced information on a given occurrence (and at the same time get citation counts for specimens The impact of museum collections: one collection ≈ one Nobel Prize).

Lastly, if we can link specimens to sequences and the literature, then we can populate more of the biodiversity knowledge graph

March 10, 2015


The GBIF Ebbe Nielsen Challenge has closed and we have 23 submissions for the jury to evaluate. There's quite a range of project types (and media, including sound and physical objects), and it's going to be fascinating to evaluate all the entries (some of which are shown below). This is the first time GBIF has run this challenge, so it's gratifying to see so much creativity in response to the challenge. While judging itself is limited to the jury (of which I'm a member), I'd encourage anyone interested in biodiversity informatics to browse the submissions. Although you can't leave comments directly on the submissions within the GBIF Challenge pages, each submission also appears on the portfolio page of the person/organisation that created the entry, so you can leave comments there (follow the link at the bottom of the page for each submission to see it on the portfolio page).