Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.


XML feed

Last update

1 hour 36 min ago

August 28, 2014

My BioNames project has been going for over a year now, but I hadn't gotten around to providing bulk access to the data I've been collecting and cleaning. I've gone some way towards fixing this. You can now grab a snapshot of the BioNames database as a Darwin Core Archive here. This snapshot was generated on the 22nd August, so it is already a little out of date (BioNames is edited almost daily as I clean and annotate it when I should be doing other things).

The data dump doesn't capture all the information in the BioNames as I've tried to keep it simple, and Darwin Core is a bit of a pain to deal with. The actual database is in CouchDB which is (mostly) an absolute joy to work with. I replicate the database to Cloudant, which means there's a copy "in the cloud". A number of my other CouchDB projects are also in Cloudant, in the case of Australian Faunal Directory and BOL DNA Barcode Map the data is also served directly from Cloudant.

August 25, 2014

Note to self for upcoming discussion with JournalMap.

As of Monday August 25th, BioStor has 106,617 articles comprising 1,484,050 BHL pages. From the full text for these articles, I have extracted 45,452 distinct localities (i.e., geotagged with latitude and longitude). 15,860 BHL pages in BioStor pages have at least one geotag, these pages belong to 5,675 BioStor articles.

In summary, BioStor has 5,675 full-text articles that are geotagged. The largest number of geotags for an article is 2,421, for Distribución geográfica de la fauna de anfibios del Uruguay (doi:10.5479/si.23317515.134.1).

The SQL for the queries is here.

August 19, 2014

This is guest post by Angelique Hjarding in response to discussion on this blog about the paper below.
Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427Thank you for highlighting our recent publication and for the very interesting comments. We wanted to take the opportunity to address some of the issues brought up in both your review and from reader comments.

One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.
Angelique Hjarding. (2014). Endemic Chameleons of Kenya and Tanzania. Figshare. doi:10.6084/m9.figshare.1141858

It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.

Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.

Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.

Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.

Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.

We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.

Angelique Hjarding http://orcid.org/0000-0002-9279-4893Krystal Tolley
Neil Burgess

August 15, 2014

If we view biodiversity data as part of the "biodiversity knowledge graph" then specimens are a fairly central feature of that graph. I'm looking at ways to link specimens to sequences, taxa, publications, etc., and doing this across multiple data providers. Here are some rough notes on trying to model this in a simple way.

For simplicity let's suppose that we have this basic model:

A specimen comes from a locality (ideally we have the latitude and longitude of that locality), it is assigned to a taxon, we have data derived from that specimen (e.g., one or more DNA sequences), and we have one or more publications about that specimen (e.g., a paper that publishes a taxon name for which the specimen is a type, or a paper that publishes a sequence for which the specimen is a voucher).

In GenBank we have sequences that have accession numbers, and these are linked to taxa (identified by NCBI tax ids). A nice feature of sequence databases is that taxa are explicitly defined by extension, that is, a taxon is the set of sequences assigned to a given taxon. Most (but not all, see Miller et al. doi:10.1186/1756-0500-2-101) sequences are also linked to a publication, which will usually have a PubMed id (PMID), and sometimes a DOI. Many sequences are also georeferenced (see Guest post: response to "Putting GenBank Data on the Map"). Most sequences aren't linked to a voucher specimen, but there is the implict notion of a source (in RDF-speak, many specimens are "blank nodes" Blank nodes for specimens without URI). Some sequences are associated with a specimen that has a museum code, and some are explicitly linked to the specimen by a URL.

DNA barcodes
Barcodes, as represented in BOLD are similar to sequences in GenBank. We have explicit taxa ("BINs") each of which has a URL, some also having DOIs. Most barcodes are georeferenced. There's some ambiguity about whether the URL for a barcode record identifies the barcode sequence, the specimen, or both. There may be a voucher code for the specimen. Some barcodes are linked to publications, but not (as far as I can see) in the data obtained from the API. Some barcodes are linked to the corresponding record in GenBank (which may or may not be supressed, see Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)).

At it's core GBIF has occurrence records (many of these are specimen-based, but the majority of data in GBIF is actually observation-based), each of which has a unique id, and which is linked to a taxon, also with a unique id. As with the sequence databases, a taxon is a set of occurrences that have been assigned to that taxon. Many records in GBIF are georeferenced. There are limited cross links to other database - some occurrences list associated GenBank sequences. Some GBIF occurrences actually are sequences (e.g., the European Molecular Biology Laboratory Australian Mirror and the soon to be indexed Geographically tagged INSDC sequences), and barcodes are also making their way into GBIF (e.g., Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data). Links to publications are limited.

Museums and herbaria
Some individual natural history collections which are online provide specimen-level web pages and URLs (some even have DOIs, see DOIs for specimens are here, but we're not quite there yet), and some museums list associated GenBank sequences. In the diagram I've not linked the specimens to a taxon, because most specimens are tagged by a name, not an explicit taxon concept (unlike NCBI, BOLD, or GBIF).

Literature databases (represented here by BioStor, but could be other sources such as ZooKeys, for example) may contain articles that mention specimen codes. These articles may also mention taxon names, and geographic localities (including coordinates) (see, for example, Linking GBIF and the Biodiversity Heritage Library. Mining text for names, specimens, and localities is fairly easy, but linking these together is harder (i.e., this specimen is of this taxon, and was found at this locality).

Linking together
If we have these separate sources and this trivial model, then we can imagine trying to tie information about the same specimen together across the different databases. Why might we want to do this. Here are three reasons:

  1. Augmentation Combining information can enhance our understanding of a specimen. Perhaps a specimen in GBIF is a geographic outlier. A publication that mentions the specimen includes it in a new taxon, perhaps discovered by sequencing DNA extarcted from that specimen. Linking this information together resolves the problematic distribution.
  2. Provenance What is the evidence that a particular specimen belongs to a particualr taxon, or was collected at a particular locality? If we connect specimens to the literature we we can review the evidence for ourselves. If we have sequences we can run BLAST, build a tree, and see if we should rethink our classification of that sequence. Imagine being able to browse GBIF and see the evidence for each dot on the map?
  3. Citation Mentions in the literature, use as vouchers for DNA barcoding or other forms of sequencing can be thought of a "citation" of that specimen. Museums hosting that material could use metrics base don this to demonstrate the value of their collection (see also The impact of museum collections: one collection ≈ one Nobel Prize).
Making the links
All this is well and good, the trick is to actually make the links. Here things get horribly messy very quickly. Museum specimens are cited in inconsistent ways, we don't have widely used unique, resolvable specimen identifiers, and even if we did have these identifiers we don't have a global discovery mechanism for matching voucher codes to identifiers. GBIF would be an obvious part of a "global discovery mechanism" (bit like CrossRef but for specimens), GBIF can have multiple records for the same specimen. Sometimes this is because GBIF not only aggregates data from primary sources (such as museums) but also other aggregations which may themselves already include specimens harvested from primary sources. GBIF can also have multiple records because museums keep messing with their databases, try new variants of the Darwin Core triple, etc., resulting in records that look "new" to GBIF. Whole collections can be duplicate din this way.

One way to tackle this multiplicity of specimen records is to think in terms of "clusters" of specimens that are, in some sense, the same thing across multiple databases. For example, clustering a set of duplicated GBIF records together with the sequences derived from those specimens, perhaps including a DNA barcode, and a list of papers that mention that specimen. This is represented by the yellow bar through the diagram, it connects all the different pieces of information about a specimen into a single cluster. More *cough* later.

August 14, 2014

Update: Angelique Hjarding and her co-authors have responded in a guest post on iPhylo.
The quality and fitness for use of GBIF-mobilised data is a topic of interest to anyone that uses GBIF data. As an example, a recent paper on African chameleons comes to some rather alarming conclusions concerning the utility of GBIF data:

Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427
Here's the abstract (unfortunately the paper is behind a paywall):

The IUCN Red List of Threatened Species uses geographical distribution as a key criterion in assessing the conservation status of species. Accurate knowledge of a species’ distribution is therefore essential to ensure the correct categorization is applied. Here we compare the geographical distribution of 35 species of chameleons endemic to East Africa, using data from the Global Biodiversity Information Facility (GBIF) and data compiled by a taxonomic expert. Data screening showed 99.9%of GBIF records used outdated taxonomy and 20% had no locality coordinates. Conversely the expert dataset used 100%up-to-date taxonomy and only seven records (3%) had no coordinates. Both datasets were used to generate range maps for each species, which were then used in preliminary Red List categorization. There was disparity in the categories of 10 species, with eight being assigned a lower threat category based on GBIF data compared with expert data, and the other two assigned a higher category. Our results suggest that before conducting desktop assessments of the threatened status of species, aggregated museum locality data should be vetted against current taxonomy and localities should be verified. We conclude that available online databases are not an adequate substitute for taxonomic experts in assessing the threatened status of species and that Red List assessments may be compromised unless this extra step of verification is carried out.
The authors used two data sets, one from GBIF, the other provided by an expert to compute the conservation status for each chameleon species endemic to Kenya and/or Tanzania. After screening the GBIF data for taxonomic and geographic issues, a mere 7% of the data remained - 93% of the 2304 records downloaded from GBIF were discarded.

This study raises a number of questions, some of which I will touch on here. Before doing so, it's worth noting that it's unfortunate that neither of the two data sets used in this study (the data downloaded from GBIF, and the expert data set assembled by Colin Tilbury) are provided by the authors, so our ability to further explore the results is limited. This is a pity, especially now that citable data repositories such as Dryad and Figshare are available. The value of this paper would have been enhanced if both datasets were archived.

Below is Table 1 from the paper, "Museums from which locality records for East African chameleons were obtained for the expert and GBIF datasets":

MuseumExpert datasetGBIFAfrika Museum, The NetherlandsxAmerican Museum of Natural History, USAxBishop Museum, USAxBritish Museum of Natural History, UKxBrussels Museum of Natural Sciences, BelgiumxCalifornia Academy of Sciences, USAxDitsong Museum, South AfricaxxLos Angeles County Museum of Natural History, USAxMuseum für Naturkunde, GermanyxMuseum of Comparative Zoology (Harvard University), USAxNaturhistorisches Museum Wien, AustriaxSmithsonian Institution, USAxSouth African Museum, South AfricaxTrento Museum of Natural Sciences, ItalyxUniversity of Dar es Salaam, TanzaniaxZoological Research Museum Alexander Koenig, Germanyx

It is striking that there is virtually no overlap in data sources available to GBIF and the sources used by the expert. Some of the museums have no presence in GBIF, including some major collections (I'm looking at you, The Natural History Museum), but some museums do contribute to GBIF, but not their herpetology specimens. So, GBIF has some work to do in mobilising more data (Why is this data not in GBIF? What are the impediments to that happening?). Then there are museums that have data in GBIF, but not in a form useful for this study. For example, the American Museum of Natural History has 327,622 herpetology specimens in GBIF, but not one of these is georeferenced! Given that there are records in GenBank for AMNH specimens that are georeferenced, I suspect that the AMNH collection has deliberately not made geographic coordinates available, which raises the obvious question - why?

GBIF coverage
I had a quick look at GBIF to get some idea of the geographic coverage of the relevant herpetology collections (or animal collections if herps weren't separated out). Below are maps for some of these collections. The AMNH is empty, as is the smaller Zoological Research Museum Alexander Koenig collection (which supplied some of the expert data).

American Museum of Natural History, USA
Bishop Museum, USA
California Academy of Sciences, USA
Ditsong Museum, South Africa
Los Angeles County Museum of Natural History, USA
Museum für Naturkunde, Germany
Museum of Comparative Zoology (Harvard University), USA
Smithsonian Institution, USA
Zoological Research Museum Alexander Koenig, Germany

Some collections are relevant, such as the California Academy of Sciences, but a number of the collections in GBIF simply don't have georeferenced data on chameleons. Then there are several museums that are listed as sources for the expert database and which contribute to GBIF, but haven't digitised their herp collections, or haven't made these available to GBIF.

The other issue encountered by Hjarding et al. 2014 is that the GBIF taxonomy for chameleons is out of date (2302 of 2304 GBIF-sourced records needed to be updated). Chameleons are a fairly small group, and it's not like there are hundreds of new species being discovered each year (see timeline in BioNames), 2006 was a bumper year with 12 new taxonomic names added. But there has been a lot of recent phylogenetic work which has clarified relationships, and as a result species get shuffled around different genera, resulting in a plethora of synonyms. GBIF's taxonomy has lagged behind current research, and also manages to horribly mangle the chameleon taxonomy is does have. For example, the genus Trioceros is not even placed within the chameleon family Chamaeleonidae but is simply listed as a reptile, which means anyone searching for data on the family Chamaeleonidae will all the Trioceros species.

The use case for this study seems one of the most basic that GBIF should be able to meet - given some distributions of organisms, compute an assessment of their conservation status. That GBIF-mobilised data is so patently not up to the task in this case is cause for concern.

However, I don't see this is simply a case of expert data set versus GBIF data, I think it's more complicated than that. A big issue here is data availability, and also the extent of data release (assuming that the AMNH is actively withholding geographic coordinates for some, if not most of its specimens). GBIF should be asking those museums that provide data why they've not made georeferenced data available, and if its because the museums simply haven't been able to do this, then how can it help this process? It should also be asking why museums which are part of GBIF haven't mobilised their herpetology data, and again, what can it do to help? Lastly, in an age of rapid taxonomic change driven by phylogenetic analysis, GBIF needs to overhaul the glacial pace at which it incorporates new taxonomic information.

August 4, 2014

I stumbled across this paper (found on the GBIF Public Library):
Oldman, D., de Doerr, M., de Jong, G., Norton, B., & Wikman, T. (2014, July). Realizing Lessons of the Last 20 Years: A Manifesto for Data Provisioning and Aggregation Services for the Digital Humanities (A Position Paper) System. D-Lib Magazine. CNRI Acct. doi:10.1045/july2014-oldman
The first sentence of the abstract makes the paper sound a bit of a slog to read, but actually it's a great fun, full of pithy comments on the state of digital humanities. Almost all of this is highly relevant to mobilising natural history data. Here are the paper's main points (emphasis added):
  1. Cultural heritage data provided by different organisations cannot be properly integrated using data models based wholly or partly on a fixed set of data fields and values, and even less so on 'core metadata'. Additionally, integration based on artificial and/or overly generalised relationships (divorced from local practice and knowledge) simply create superficial aggregations of data that remain effectively siloed since all useful meaning is available only from the primary source. This approach creates highly limited resources unable to reveal the significance of the source information, support meaningful harmonisation of data or support more sophisticated use cases. It is restricted to simple query and retrieval by 'finding aids' criteria.
  2. The same level of quality in data representation is required for public engagement as it is for research and education. The proposition that general audiences do not need the same level of quality and the ability to travel through different datasets using semantic relationships is a fiction and is damaging to the establishment of new and enduring audiences.
  3. Thirdly, data provisioning for integrated systems must be based on a distributed system of processes in which data providers are an integral part, and not on a simple and mechanical view of information system aggregation, regardless of the complexity of the chosen data models. This more distributed approach requires a new reference model for the sector. This position contrasts with many past and existing systems that are largely centralised and where the expertise and practice of providers is divorced.

Recommended reading.

June 11, 2014


@rdmpage@AlexHardisty@proibiosphere Well, take part in the process of clarification!

— Pensoft Publishers (@Pensoft) June 10, 2014
I've been involved in a few Twitter exchanges about the upcoming pro-iBiosphere meeting regarding the "Open Biodiversity Knowledge Management System (OBKMS)", which is the topic of the meeting. Because for the life of me I can't find an explanation of what "Open Biodiversity Knowledge Management System" is, other than vague generalities and appeals to the magic pixie dust that is "Linked Open Data" and "RDF", I've been grumbling away on Twitter.

So, here's my take on what needs to be done. Fundamentally, if we are going to link biodiversity information together we need to build a network. What we have (for the most part) at the moment is a bunch of nodes (which you can think of as data providers such as natural history collections, databases, etc., or different kinds of data, such as names, publications, sequences, specimens, etc.).

We'd like a network, so that we can link information together, perhaps to discover new knowledge, to serve as a pathway for analyses that combine different sorts of data, and so on:

A network has nodes and links. Without the links there's no network. The fundamental problem as I see it is that we have nodes that have clear stakeholders (e.g., individual, museums, herbaria, publishers, database owners, etc.). They often build links, but they are typically incomplete (they don't link to everything that is relevant), and transitory (there's no mechanism to facilitate persistence of the links). There is no stakeholder for whom the links are most important. So, we have this:

This sucks. I think we need an entity, a project, and organisation, whatever you want to call it for whom the network is everything. In other words, they see the world like this:

If this is how you view the world, then your aim is to build that network. You live or die based on the performance of that network. You make sure the links exist, they are discoverable, and that they persist. You don't have the same interests as the nodes, but clearly you need to provide value to them because they are the endpoints of your links. But you also have users who don't need the nodes per see, they need the network.

If you buy this, then you need to think about how to grow the network. Are there network effects that you can leverage, in the same way CrossRef has with publishers submitting lists of literature cited linked to DOIs, or in social media where you give access to your list of contacts to build your social graph?

If the network is the goal, you don't just think "let's just stick HTTP URLs on everything and it will all be good". You can think like that if you are a node, because if the links die you can still persist (you'll still have people visiting your own web site). But if you are a network and the links die, you are in big trouble. So you develop ways to make the network robust. This is one reason why CrossRef uses an identifier based on indirection, it makes it easier to ensure the network persists in the face of change in how the nodes serve their data. What is often missed is that this also frees up the nodes, because they don't need to commit to serving a given URL in perpetuity, indirections shields them from this.

In order to serve users of the network, you want to ensure you can satisfy their needs rapidly. This leads to things like caching links and basic data about the end points of those links (think how Google caches the contents of web pages so if the site is offline you may still find what you are looking for).

If your business depends on the network, then you need to think how you can create incentives for nodes to join. For example, what services can you offer them that make you invaluable to the nodes? Once you crack that, then all sorts of things can happen. Take structured markup as an example. Google is driving this on the web using schema.org. If you want to be properly indexed by Google, and have Google display your content in a rich form (e.g., thumbnails, review ratings, location, etc.) you need to mark up your page in a way Google understands. Given that some businesses live or die based on their Google ranking, there's a strong incentive for web sites to adopt this markup. There's a strong incentive for Google to encourage markup so that it can provide informative results for its users (otherwise they might rely on "social search" via Facebook and mobile apps). This is the kind of thing you want the network to aim for.

In summary, this is my take on where we are at in biodiversity informatics. The challenge is that the organisations in the room discussing this are typically all nodes, and I'd argue that by definition they aren't in a position to solve the problem. You need to pivot (ghastly word) and think about it from the perspective of the network. Imagine you were to form a company whose mission was to build that network. How would you do it, how would you convince the nodes to engage, what value would you offer them, what value would you offer users of the network? If we start thinking along those lines, then I think we can make progress.

June 7, 2014

I'm adding more charts to the GBIF Chart tool, including some to explore the type status of specimens from the Solomon Islands. There are nearly 500 holotypes from this region, so quite a few new species have been discovered in this region.

Inspired by the Benoît Fontaine et al. paper on the lag time between a species being discovered and subsequently described (see Species wait 21 years to be described - show me the data) I thought I would do a quick and dirty plot of the difference between the year a specimen was collected and the year the name of the taxon it belongs to was published (from the authorship string for the scientific name). Plotting the results was *cough* interesting:

In theory, the difference between the two dates should be negative (if you subtract publication year from collection year), the smaller number the less the wait for description. But I found some large positive numbers, implying that taxa had been described long before the types were discovered! Something is clearly wrong. What seems to be happening here is the GBIF has failed to match the species name for an occurrence, and so goes up the taxonomic hierarchy and just records the genus. For example, http://gbif.org/occurrence/472764211 was collected in 1965 and is the type of Pandanus guadalcanalius St.John. GBIF doesn't recognise this name, and so matches the occurrence to the genus Pandanus Linnaeus, 1782. hence it looks like we've used a time machine to describe a taxon in 1782 based on a specimen from 1965.

At the other end of the spectrum, there are a lot of specimens that seem to have waited over 200 years for description! Turns out these are mostly specimens from the MCZ that have their collection date recorded by GBIF as "1700-01-01". This seems an arbitrary date, and turns out it's an artefact. The MCZ records "unknown" collection dates as the range 1700-01-01 - 2100-01-01
(see http://mczbase.mcz.harvard.edu/guid/MCZ:IZ:DIPL-4985). Unfortunately, when it generates the export for GBIF, these get truncated to 1700-01-01, and GBIF then (not unreasonably) treats that as the actual collection date. Somewhere in the middle of the plot of lag between collection and description is some interesting information, but it's a pity that most of this is obscured by some serious data errors.

For me the bigger lesson here is the power of visualisation to explore the data and to expose errors. This is why I was underwhelmed by the new charts GBIF is releasing. Plots of ever upward trends are ultimately not very useful. They don't give much insight into the data, nor do they help tackle interesting questions. I think we need a much richer set of visualisations to really understand the strengths and limitations of the data in GBIF.

Investigating further, there are some other reasons for the "back to the future" types. For example, http://www.gbif.org/occurrence/188826624 (CAS 5506 from FishBase) was collected in 1933 and is recorded as a holotype, with the scientific name Cypselurus opisthopus (Bleeker, 1865). 1933 - 1865 = 68, so the taxon was named 68 years before it was collected(!).

A bit of investigation using BioNames, BioStor, and GBIF (http://www.gbif.org/occurrence/473244692, another record for CAS 5506) reveals that CAS 5506 is the holotype for Cypselurus crockeri, shown below in a plate from it's original description (published in 1935):
Seale A (1935) The Templeton Crocker Expedition to western Polynesian and Melanesian islands, 1933. No. 27. Fishes. Proceedings of the California Academy of Sciences 21: 337–378. http://biostor.org/reference/59326
So, in fact this species was described shortly after its collection, with a lag of 1933 - 1935 = -2 years.

Apart from the duplication issue (FishBase has replicated some of the CAS dataset, sigh), the other problem is one of modelling the data. The CAS record has the original taxon name for which CAS 5506 is the type (Cypselurus crockeri), the FishBase record has the currently accepted name for the taxon (Cypselurus opisthopus). These two different approaches have very different implications for the charts I'm making, and simply reinforce my feeling that the GBIF data is both fascinating and full of "gotchas!".

June 6, 2014

Note to self on citation matching.

Looking for this paper "Fishes of the Marshall and Marianas islands. Vol. I. Families from Asymmetrontidae through Siganidae" I Googled it, adding "bistro" as a search term to see if I'd already added it to BioStor. The Google search:


found several hits in BioStor:

What is interesting is that these hits are to full text of references that cite the article I'm after, not the article itself. I'm sure many have had this experience, where you are searching for an obscure article and you keep finding papers that cite it, rather than the actual paper you're after. But this suggests another strategy for building the citation graph for an article. If you have a decent corpus of full text articles, search for the article (using, say title, journal, pagination) in the text of those articles and store the hits. Those are the references that cite the article (OK, not all, but some of them). This may be a more attractive way of building the citation graph, rather than parsing citations in articles and trying to locate them. Indeed, it could be extended to help marking up those citations. Imagine grabbing blocks of text from near the end of an article, searching for those in a database of citations, using close matches to flag the corresponding block as a citation.

Need to think about this a little more...


@CameronNeylon@rdmpage Your take reminds me of http://t.co/lo3n1q4XeD I attended ICDMW 2011 where he had this paper.

— Tuija Sonkkila (@ttso) June 8, 2014

The paper is:

Polepeddi, L., Agrawal, A., & Choudhary, A. (n.d.). Poll: A Citation Text Based System for Identifying High-Impact Contributions of an Article. 2011 IEEE 11th International Conference on Data Mining Workshops. IEEE. doi:10.1109/icdmw.2011.136/blockquote>
Following on from the previous post on visualising GBIF data, I've added some more interactivity. If you click on a pane in the treemap widget you get a list of the corresponding taxa, together with an image from EOL (if one exists). It's a fun way to quickly see what sort of species are present (in this case in the Solomon Islands). You can try it at http://bionames.org/~rpage/gbif-stats/.

Pro tip
It's not obvious from the site, but to go back up the taxonomic hierarchy in the treemap, right click (ctrl-click on a Mac) on the grey bar corresponding to the higher taxon.

June 4, 2014

Tim Roberston and the ream at GBIF are working on some nice visualisations of GBIF data, and have made an early release available for viewing: http://analytics.gbif-uat.org. For a given country, say, the Solomon Islands, you can see numerous plots, mostly like this:

Ever the critic, as much as I like this (and appreciate the scale of the task underlying doing analytics on data at the scale of GBIF), what I would really like to see is something that more closely resembles Google Analytics. I want graphs that I can use to get some insight into the data, and which lead me to ask questions (and provide easy for me to discover the answers).

So, I put together a crude, live demo of the sort of thing I'd like to see. You can see it at http://bionames.org/~rpage/gbif-stats (can't promise that this link will be long-lived), and below is a screen shot:

What I've done is fetch all the occurrence records for the Solomon Islands from GBIF (using the API), dumped that into CouchDB, and generated some simple queries. I display the results using Google Charts. There are some similarities with the tools developed by Javier Otegui, Arturo Ariño, and colleagues.

Otegui, J., Ariño, A. H., Encinas, M. A., & Pando, F. (2013, January 25). Assessing the Primary Data Hosted by the Spanish Node of the Global Biodiversity Information Facility (GBIF). (G. P. S. Raghava, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0055144Otegui, J., & Arino, A. H. (2012, August 15). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics. Oxford University Press (OUP). doi:10.1093/bioinformatics/bts359

For fun I've also added a map of the GBIF occurrences (also served from CouchDB).

Here's quick guide to some of the charts. Below you can see (left) a plot of species accumulation over time, that is the total number of species that have been collected up to that time. If we had collected all the species we'd expect this to asymptote (flatten out). If it keeps going up, then we still need to do some sampling. On the right is the number of occurrences recorded for each year. You can see that collecting is highly episodic.

To get a little more information on this, I've generated a crude chart where the rows are institutions (e.g., museums and herbaria) that have specimens, and the number of occurrences collected each decade are represented by the shaded boxes (the rightmost box is the current decade) (if you hover over a bar you will see a popup with the decade). To the right is the total number of occurrences.

From this we can see that there have been some major collections at various times (e.g., Kew in the 1960's, the Australian Museum in the 1970's to 1990's). Strangely, the MCZ has lots of specimens from the 1700's, I suspect we have a data quality issue here. There are certainly some issues with dates in this data set, with about a quarter of occurrences with no date:

Note that the data for the Solomon Islands, comes from all around the world, mostly from the US. There is a big spike in the date of collection curve in 1944, suggesting a lot of material may be the result of collecting by US servicemen in WW2.

I use a treemap to display the taxonomic distribution of the records, and a donut chart to summarise the taxonomic level to which the occurrences are identified:

The treemap is dominated by vertebrates, which I suspect is a poor reflection of the actual taxonomic composition of the Solomon Islands biota. Over 3/4 of the occurrences are identified to species level, which is encouraging, but there's clearly a lot of material that needs some taxonomic work.

Where next
This has been made in a rush, and there is a lot which could be done. For example, some of the charts would be more useful if you could drill down and explore further. This could be done via the GBIF API or portal (for example, by constructing a URL that shows the portal results for the Solomon Islands for a given year of collection).

There are, of course, issues of scalability. I've made this for the 83,364 occurrences currently in the GBIF portal for the Solomon Islands. There would need to be some thought given to how this could be scaled to larger data sets. But I think this is worth pursuing so that we can get further insights into the remarkable database that GBIF is building.

June 2, 2014

It is almost a year to the day that I released BioNames, a database of "taxa, texts, and trees". This project was my entry in EOL's Computable Data Challenge. Since it went live (after much late night programming by myself and Ryan Schenk) I've been tweaking the interface, cleaning (so much cleaning), and adding data (mostly DOIs, links to BioStor, and PDFs). I also wrote a paper describing the project, published in PeerJ (http://dx.doi.org/10.7717/peerj.190).

Why BioNames?
I'm building BioNames to scratch a very specific itch. To me it is a source of enormous frustration that one of the most basic questions we can ask about a name (where was it first published?) is difficult to answer using current taxonomic databases. And if there is an answer, it is usually given as a text string describing the publication (i.e., a literature citation) rather than an identifier such as a DOI that enables me to (a) go to the publication, (b) refer to the publication in a database in an unambiguous way, and (c) discover further information about that publication by querying services that recognise that identifier.

There are enormous digitisation efforts underway by commercial publishers, digital archives, and libraries, and all of this is putting more and more literature online. This is the primary evidence base for taxonomy, it is where new names are published, taxa are described, and hypotheses of synonym and relationship are proposed, and we should be actively linking to it. Of course, there are some projects that do this, but these are typically restricted in taxonomic or geographic scope. I want all this information together in one place. Hence, BioNames.

Of course, I could wait until projects like ZooBank have all the animal names, but as I pointed out in Why the ICZN is in trouble, the ICZN and ZooBank have only a tiny fraction of the published names:

This renders ZooBank barely usable for my purposes. There are millions of animal names in circulation, and our inability to discover much about them leads to all sorts of headaches, such as the errors in GBIF that I've mentioned earlier on this blog. I want a tool that can help me interpret those errors, and I want it now, hence BioNames.

What is in BioNames?
The original data comes from the LSID metadata served by ION. At the moment BioNames has 4,880,925 names, 1,549,152 of which are linked to a bibliographic citation. The bulk of the time I spend on BioNames consists of cleaning and clustering these citations, and linking them to digital identifiers.

To get some insight into what is left to be done I created a CSV dump of the publication data underlying BioNames, and loaded it into Google's Cloud Storage (http://storage.googleapis.com/ion-names/names3.csv). I then used Google's BigQuery to write some simple SQL queries. You can find more details here: https://github.com/rdmpage/bionames-bigquery.

Here is a summary table of the number of names that are published in an article with one of the identifiers that I track. These include DOIs, PMIDs, as well as whether the article is in BioStor, has a URL (typically to a publisher's web site), or a PDF.
IdentifierNumber of namesDOI196,915BioStor130,792JSTOR23,483CiNii11,296PMID8,886URL72,754PDF161,474(any)489,029

The final row is the number of articles that have at least one identifier (some articles have multiple identifiers, such as a DOI and a link to BioStor). Given that there are approximate 1.5 million names with bibliographic citations, and around 490,000 have an identifier, the user as a 30% chance of finding the original description for an animal name picked at random. Obviously, BioNames has gaps (ION has missed a number of names, and/or publications), the taxonomic coverage of bibliographic identifiers is uneven (depending on the publications chosen by taxonomists to publish in, and the level of digitisation of those publications), and there is still a lot of data cleaning to do. But an almost 1 in 3 chance of finding something useful for a name seems a reasonable level of progress.

Out of interest I created some quick and dirty charts in Excel for different categories of identifier. Here, for example, is the percentage of names published each year that are linked to a publication with a DOI:

Over 80% of names published in 2013 were in an article with a DOI, so we are fast heading to a situation where modern zoological taxonomy is fully part of the citation graph of science. Much of this spike in 2013 is due to the adoption of DOIs by Zootaxa, which is far and away the dominant journal in animal taxonomy.

Here is the same chart for publications in BioStor.

The big spike at the start is for names where the year of publication is missing. Leaving that aside, we can see the impact of the 1923 copyright cut-off in the US, which puts a big dent in the Biodiversity Heritage Library's digitisation efforts. Note, however, that BHL has a lot of post-1923 content.

Does anyone use BioNames?

I use BioNames almost every day, and have devoted way more time than is healthy to populating it. As I explore issues like the quality of the taxonomy in GBIF, I find it useful to see the original descriptions of a taxa, and its fate in subsequent revisions. In the early days I'd spend more time adding missing papers to help answer a question, but increasingly I'm finding that the content is already there. So, I find it useful, but what (gulp) if I'm the only one?

Below is the number of "sessions" per day since BioNames was launched (data from Google Analaytics for May 1st, 2013 to May 31st, 2014). After an initial flurry of interest, web traffic pretty quickly died off. Since then it's been slowly gaining more visitors, then (for reasons which escape me), it started getting a lot more traffic in April onwards:

To give these numbers some context, for the same period BioStor (my archive of articles from BHL) had the following traffic:

Note the different scales, BioStor is getting around 500 sessions a day during week days, BioNames gets around 200. By way of comparison, GBIF gets up to 4000 sessions a day, and this blog typically has 50-100 sessions per day.

Where next?
There are a couple of directions for the future. There is still a lot of data cleaning and linking to do. Last year I did a quick analysis of which taxonomic journals should be digitised next. I've updated this by creating a a spreadsheet that ranks the journals in BioNames by the number of names each has published, and each is coloured by the fraction of those names for which I've found a digital identifier for the paper in which they are published. This table is incomplete, and reflects not only the extent of digitisation, but also the extent to which I've managed to locate the journals online. But it is a starting point for thinking about what journals to prioritise for digitisation, or if they are already divitised, journals that I need to target for addition to BioNames. The spreadsheet is available as a Google sheet.

Another direction is data mining. In addition to the obvious task, naming locating and indexing taxonomic names, there are other things to be done. In BioStor I extract geographic point localities and specimen codes from the OCR text. These could be indexed to enable geographic or specimen-based searching. The same approach could be generalised to the literature in BioNames, so that we could track the mentions of a particular specimen, or retrieve lists of publications about a specific locality (e.g., all taxonomic papers that refer to a particular mountain range, deep sea vent, or island).

BioNames also does some limited analysis of taxonomic name co-ocurrence, for example suggesting that species names with the same specific epithet but different generic names are possible synonyms if they occur on the same page. There is a lot of scope for expanding this. I'm also keen to explore citation indexing, that is, extracting lists of literature cited from articles in BioNames, and linking those to the corresponding record in BioNames. Ultimately I want to be able to navigate through the taxonomic literature along these citation links, so that we can trace the fate of names through time.

But this is still only a start, papers such as Seltmann et al. illustrate other things that are possible once we have a large corpus of taxonomic literature available:

Seltmann, K. C., Pénzes, Z., Yoder, M. J., Bertone, M. A., & Deans, A. R. (2013, February 18). Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. (C. S. Moreau, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0055674

So, a lot still to be done. I hope to have achieved some of this if and when I write a follow up post on the status of BioNames in a year's time.