Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.


XML feed

Last update

58 min 27 sec ago

February 20, 2015


Quick notes on another example of data duplication in GBIF. I'm in the process of building a tool to map specimen codes to GBIF records, and came across the following example. Consider the specimen code "AM M.22320", which is the voucher for the sequence KJ532444 (GenBank gives the voucher as M22320, but the associated paper doi:10.1016/j.ympev.2014.03.009 clarifies that this specimen comes from the Australian Museum). Locating this specimen in GBIF I found a series of records that were identical except for the catalogNumbers, which looked like this: M.22320.001, M.22320.002, etc. What gives?

Initially I thought this may be a simple case of data duplication (maybe the suffixes represent different versions of the same record?). Then I managed to locate the records on the Australian Museum web site:.

  • M.22320.009 - Wet Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.008 - Skull Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.001 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.005 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.006 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.007 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.003 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.004 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.002 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.010 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990

Turns out we have 10 records for "M.22320", which include various preparations and tissue samples. The holotype specimen for Pteralopex taki (originally described in doi:10.1071/AM01145, see BioNames) has generated 10 different records., all of which have ended up in GBIF. Anyone using GBIF occurrence data and interpreting the number of occurrence records as a measure of how abundant an organism is at a given locality is clearly going to be misled by data like this.

One way to tackle this problem would be if GBIF (or the data provider) could cluster the records that represent the "same" specimen, so GBIF doesn't end up duplicating the same information (in this case, 10-fold). The Australian Museum records don't seem to specify a direct link between the 10 records. I then located the same records in OZCAM, the data provider that feeds GBIF. Here is the OZCAM record for "M.22320.001": http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064. OZCAM doesn't have the information on whether the record is a skull, a wet preparation, or a tissue sample, that information has been lost, and hence doesn't make it as far as GBIF.

Note that OZCAM has resolvable identifiers for each specimen in the form of UUIDs that are appended to "http://ozcam.ala.org.au/occurrence/". The corresponding UUIDs are included in the Darwin Core dump that OZCAM makes available to GBIF. Here they are for the parts of M.22320:


But when GBIF parses the dump it ignores these UUIDs, which means the GBIF user can't easily go to the OZCAM site (which has a bunch of other useful information, compare http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064 with http://www.gbif.org/occurrence/774916561/verbatim ). It also means that GBIF has stripped out an identifier that we might make use of to unambiguously refer to each record (and, presumably, this UUID doesn't change between harvests of OZCAM data).

In summary, this is a bit of a mess: we have multiple records that are really just bits of the same specimen but which are not linked together by any data provider, and as the data is transmitted up the chain to GBIF clues as to what is going on are stripped out. For a user like me who is trying to link the GenBank sequence to its voucher this is frustrating, and ultimately all rather avoidable if we took just a little more care in how we represent data about specimens, and how we treat that data as it gets transmitted between data bases.

February 19, 2015


PubPeer is a web site where people can discuss published articles, anonymously if they prefer. I finally got a chance to play with it a few days, it it was a fascinating experience. You simply type in the DOI or PMID for an article and see if anyone has said anything about that article. It also automatically pulls comments from PubMed Commons, for example the article Putting GenBank data on the map has a comment that was originally published as a guest post on this blog. PubPeer knows about this blog post via Altmetric, which is another nice feature. PubPeer also has browser extensions which, if you install one, automatically flags DOIs on web pages that have comments on PubPeer. Also nice.

So, I took PubPeer for a spin. While browsing GenBank and GBIF, as you do, I came across the following paper: "Conservation genetics of Australasian sailfin lizards: Flagship species threatened by coastal development and insufficient protected area coverage" doi:10.1016/j.biocon.2013.10.014. Some of the sequences from this paper, such as KF874877 are flagged as "UNVERIFIED". Puzzled by this, I raised the issue on PubPeer (see https://pubpeer.com/publications/D1090D7AF8178B1A10C4C45AC1006E ). A little further digging led to the suggestion that they were numts. After raising the issue on Twitter, one of the authors (Cameron Siler) got in touch and reported that there had been an accidental deletion of a single nucleotide in an alignment. Cameron is updating the Dryad data (http://dx.doi.org/10.5061/dryad.1fs7c ) and GenBank sequences.

I like the idea that there is a place we can go to discuss the contents of a paper. It's not controlled by the journal, and you can either identify yourself or remain anonymous if you prefer. Not everyone is a fan of this mode of commentary, especially it is possible for people to make all sorts of accusations while remaining anonymous. But it's a fascinating project, and well worth spending some time browsing around (what IS it with physicists?). For anyone interested in annotating data, it's also a nice example of one possible approach.

January 28, 2015


Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

  1. A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
  2. A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
  3. A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.


As an example, consider the following two occurrences for Psilogramma menephron:

occurrencetaxonlongitudelatitudecatalogue numbersequence 887386322Psilogramma menephron Cramer, 1780145.86301-17.44BC ZSM Lep 01337 1009633027Psilogramma menephron Cramer, 1780145.86-17.44KJ168695KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIsIn terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.

NanopublicationsIf we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are: A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.A nanopublication comprises three elements:
  1. The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
  2. The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
  3. The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.

SummaryMy goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

January 22, 2015


For the last few weeks I've been working on a little project to display phylogenies on web-based maps such as OpenStreetMap and Google Maps. Below I'll sketch out the rationale, but if you're in a hurry you can see a live demo here: http://iphylo.org/~rpage/geojson-phylogeny-demo/, and some examples below.

The first is the well-known example of Banza katydids from doi:10.1016/j.ympev.2006.04.006, which I used in 2007 when playing with Google Earth.

The second example shows DNA barcodes similar to ABFG379-10 for Proechimys guyannensis and its relatives.


People have been putting phylogenies on computer-based maps for a while, but in most cases these have required stand-alone software, such as Google Earth, or GeoJSON for encoding geographic information. Despite the obvious appeal of placing trees in maps, and calls for large-scale geophylogeny databases (e.g., do:10.1093/sysbio/syq043), computerised drawing trees on maps has remained a bit of a niche activity. I think there are several reasons for this:

  1. Drawing trees on maps needs both a tree and geographic localities for the nodes in the tree. The later are not always readily available, or may be in different databases to the source of phylogenetic data.
  2. There's no accepted standard for encoding geographic information associated with the leaves in a tree, so everyone pretty much invents their own format.
  3. To draw the tree we typically need standalone software. This means users have to download software, instead of work on the web (which is where all the data is).
  4. Geographic formats such as KML (used by Google Earth) are not particularly easy to store and index in databases.

So there are a number of obstacles to making this easy. The increasing availability of geotagged sequences in GenBank (see Guest post: response to "Putting GenBank Data on the Map"), especially DNA barcodes, helps. For the demo I created a simple pipeline to take a DNA barcode, query BOLD for similar sequences, retrieve those, align them, build a neighbour joining tree, annotate the tree with latitude and longitudes, and encode that information in a NEXUS file.

To layout the tree on a map (say OpenStreetMap using Leaflet or Google Maps) I convert the NEXUS file to GeoJSON. There are a couple of problems to solve when doing this.Typically when drawing a phylogeny we compute x and y coordinates for a device such as a computer screen or printer where these coordinates have equal units and are linear in both horizontal and vertical dimensions. In web maps coordinates are expressed in terms of latitude and longitude, and in the widely-used Web Mercator projection the vertical axis (latitude) is non-linear. Furthermore, on a web map the user can zoom in and out, so pixel-based coordinates only make sense with respect to a given zoom level.

To tackle this I compute the layout of the tree in pixels at zoom level 0, when the web map comprises a single "tile".

The tile coordinates are then converted to latitude and longitude, so that they can be placed on the map. The map applications take care of zooming in and out, so the tree scales appropriately. The actual sampling localities are simply markers on the map. Another problem is to reduce the visual clutter that results from criss-crossing lines connecting connecting the tips of the tree and the associated sampling localities. To make the diagram more comprehensible, I adopt the approach used by GenGIS to reorder the nodes in the tree to minimise the crossings (see algorithm in doi:10.7155/jgaa.00088). The tree and the lines connecting it to the localities are encoded as "LineString" objects in the GeoJSON file.

There are a couple of things which could be done with this kind of tool. The first is to add it as a visualisation to a set of phylogenies or occurrence data. For example, imagine my "million barcode map" having the ability to display a geophylogeny for any barcode you click on.

Another use would be to create a geographically indexed database of phylogenies. There are databases such as CouchDB that store JSON as a native format, and it would be fairly straightforward to consume GeoJSON for a geophylogeny, ignore the bits that draw the tree on the map, and index the localities. We could then search for trees in a given region, and render them on a map.

There's still some work to do (I need to make the orientation of the tree optional and there are some edges case that need to be handled), but it's starting to reach the point when it's fun just to explore some examples, such as these microendemic Agnotecous crickets in New Caledonia (data from doi:10.1371/journal.pone.0048047 and GBIF).

January 20, 2015


A couple of articles in the tech press got me thinking this morning about Bitcoin, Ted Nelson, Xanadu, and the web that wasn't. The articles are After The Social Web, Here Comes The Trust Web and Transforming the web into a HTTPA 'database'. There are some really interesting ideas being explored based on centralised tracking of resources (including money, think Bitcoin, and other assets, think content). I wonder whether these developments may make lead to renewed interest in some of the ideas of Ted Nelson.

I've always had a soft-spot for Ted Nelson and his Xanadu project (see my earlier posts on translcusion and Nature's ENCODE app). To get a sense of what he was after, we can compare the web we have with what Nelson envisaged.

The web we have today:

  1. Links are one-way, in that it's easy to link to a site (just use the URL), but it's hard for the target site to find out who links to it. Put another way, like writing a scientific paper, it's easy to cite another paper, but non-trivial to find out who is citing your own work.
  2. Links to another document are simply launching pads to go to that other document, whether it's still there or not.
  3. Content is typically either "free" (paid for by advertising or in exchange for personal data), or behind a paywall and hence expensive.

Nelson wanted:

  1. Links that were bidirectional (so not only did you get cited, but you knew who was citing you)
  2. "Transclusion", where documents would not simply link (=cite) to other documents but would include snippets of those documents. If you cited another document, say to support a claim, you would actually include the relevant fragment of that document in your own document.
  3. A micropayment system so that if your work was "transcluded" you could get paid for that content.

The web we have is in many ways much easier to build, so Nelson's vision lost out. One-way links are easy to create (just paste in a URL), and the 404 error (that you get when a web page is missing) makes it robust to failure. If a page vanishes, things don't collapse, you just backtrack and go somewhere else.

Nelson had a more tightly linked web. He wanted to keep track of who links to whom automatically. Doing this today is the preserve of big operations such as Google (who count links to rank search results) or the Web of Science (who count citations to rank articles and journals - note that I'm using the web and the citation network pretty much interchangeably in this post). Because citation tracking isn't built into the web, you need create this feature, and that costs money (and hence nobody provides access to citation data for free).

In the current web, stuff (content) is either given away for "free" (or simply copied and pasted as if it was free), or locked behind paywalls. Free, of course, is never free. We are either handing over data, being the targets of advertising (better targeted the more data we hand over), or we pay for freedom directly (e.g., open access publication fees in the case of scientific articles). Alternatively, we have the paywalls well know to academics, where much of the world's knowledge is held behind expensive paywalls (in part because publishers need some way to make money, and there's little middle ground between free and expensive).

Nelson's model envisaged micropayments, where content creators would get small payments every time their content was used. Under the transclusion model, only small bits of your content might be used (in the context of a scientific paper, imagine just a single fact or statement was used). You didn't get everything for free (that would destroy the incentive to create), but nor was everything locked up behind prohibitively expensive paywalls. Nelson's model never took off, in part I suspect because there was simply no way to (a) track who was using the content, and (b) collect micropayments.

What is interesting is that Bitcoin seems to deal with the micropayments issue, and the HTTPA protocol (which uses much the same idea as Bitcoin to keep an audit trail of who has accessed and used data) may provide a mechanism to track usage. How is this going to change the current web? Might there be ways to use these ideas to reimagine academic publishing, which at the moment seems caught between steep open access fees or expensive journal subscriptions?

January 9, 2015


Each year about this time, as I ponder what to devote my time on in the coming year, I get exasperated and frustrated that each year will be like the previous one, and biodiversity informatics will seem no closer to getting its act together. Sure, we are putting more and more data online, but we are no closer to linking this stuff together, or building things that people can use to do cool science with. And each year I try and figure out why we are still flaying about and not getting very far. This year, I've settled on the lack of "platforms".

In 2011 Steve Yegge (accidentally) published a widely read document known as the "Google Platforms Rant". It's become something of a classic, and I wonder if biodiversity informatics can learn from this rant (it's long but well worth a read).

One way to think about this is to look at how we build things. In the early days, people would have some data and build a web site:

In the diagram above "dev" is the web developer who builds the site, and "DBA" is the person who manages the data (for many projects this is one and the same person). The user is presented with a web site, and that's the only way they can access the data. If the web site is well designed this typically works OK, but the user will come up against limitations. Why do I have to manually search for each record? How can I combine this data with some other data? These questions lead to some users doing things like screen scrapping, anything to get the data and do more than the web site permits (I spend a lot of my time doing exactly this). In contrast, the person (or team) building the site ("dev") can access the data and tools directly.

Eventually some sites realise that they could add value to their users if they added an API, so typically we get something like this:

Now we have an API (yay), but notice that it is completely separate from the web site. Now the site developers have to manage two different things, and two sets of users (web site visitors, and user programming against the API). Because the site and the API are different, and the site gets more users, typically what happens is the API lacks much of the functionality of the site, which frustrates users of the API. For example, when Mendeley launched it's API its limited functionality and lack of documentation drove me nuts. Similarly, the Encyclopedia of Life (EOL) API is pretty sucky. If anyone from EOL is reading this, for the love of God add user authentication and the ability to create and edit collections to the API. Until you do, you'll never have an ecosystem of apps.

A solution to sucky APIs is "dogfooding":

Dogfooding is the idea that your product is so good you'd use it yourself. In the case of web development, if we build the web site on top of the same API that we expose to users, then the site developers have a strong incentive to make the API well-documented and robust, because their web site runs on the same services. As a result the interests of the web developers and users who are programmers are much more aligned. If a user finds a bug in the API, or the API lacks a feature, it's much more likely to get fixed. An example of a biodiversity informatics project that "gets" dogfooding is GBIF, which has a nice API that powers much of their web site. This is a good example of how to tell if an API is any good, namely, can you recreate the web site yourself just using the API?

But the example above leaves one aspect of the whole system still intact and not accessible to users. Typically a company or organisation has data, tools, and processes that it uses to manage whatever is central to its operations. These are kept separate from users, who only get to access these indirectly through the web site or the API.

A "platform" takes things one step further. Steve Yegge summarises Jeff Bezos' memo that outlined Amazon's move to a platform:

  1. All teams will henceforth expose their data and functionality through service interfaces.
  2. Teams must communicate with each other through these interfaces.
  3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  4. It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
  5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
  6. Anyone who doesn't do this will be fired.
  7. Thank you; have a nice day!
All the core bits of infrastructure that powered Amazon were to become services, and different bits of Amazon could only talk to each other through these services. The point of this is that it enabled Amazon to expose its infrastructure to the outside world (AKA paying customers) and now we have Amazon cloud services for storing data, running compute jobs, and so on. By exposing its infrastructure as services, Amazon now runs a big chunk of the startup economy. By insisting that Amazon itself uses these services (dogfooding at the infrastructure level), Amazon ensures that this infrastructure works (because its own business depends on it).

There are some things Google does that are like a platform (despite the complaints in the "Google Platforms Rant"). For example, you could imagine that most workers at Google use tools such as Google Docs to create and share documents. Likewise, Google Scholar is unlikely to be a simple act of altruism. If you have a team of world class researchers you need a tool that enables them to find existing research. Google Scholar does this. If you then expose it to the outside world you get more users, and an incentive for commercial publishers to open up their paywall journals to being indexed by Google's crawlers, and incentive that would be missing if Scholar was purely an internal service.

Now, giant companies like Amazon and Google might seem a world away from biodiversity informatics, but I think there are things we can learn from this. Looking around, I think there are other examples of platforms that may seem closer to home. For example, the NCBI runs GenBank and PubMed, and these are very like platforms. GenBank provides tools, such as BLAST, that it provides to the user community, but which it also uses internally to cluster sequences into related sets. Consider PubMed, which has gone from a simple index to the biomedical literature to a publishing platform. PubMed has driven the standardisation of XML across biomedical publishers. It is quite possible to visit the NCBI site, explore data, then read full text for the associated publications in PubMed Central, without ever leaving the NCBI site. No wonder some commercial publishers are deeply worried about PubMed Central.

A key thing about platforms is that the people running the platform have a deep interest in many of the same things as the users of that platform (note the "users" scattered all over the platform diagram above). Instead of user being a separate category that you try and serve by figuring out what they want, developers are users too.

To try and flesh this out a little more, what would a "taxonomic" platform look like? At the moment, we have lots of taxonomic web sites that pump out lists of names and little else. This is not terribly useful. If we think about what goes into making lists of names, it requires access to the scientific literature, it requires being able to read that literature and extract statements about names (e.g., this is the original description, these two names are synonyms, etc.), and it requires some way of summarising what we know about those names and the taxa that we label with those names. Typically these are all things that happen behind the scenes, then the user simply gets a list of names. A platform would expose all of the data, tools, and processes that went into making that list. It would provide the literature in both human and computer readable forms, it would provide tools for extracting information, tools to store knowledge about those names, and tools to make inferences using that knowledge. All of these would be exposed to users. And these some services and tools would be used by the people building those services and tools.

This last point means that you also need people working on the same problems as "users". For example, consider something like GBIF. At the moment GBIF consumes output of taxonomic research (such as lists of names) and tries to make sense of these before serving them back to the community. There is little alignment between the interests of taxonomists and GBIF itself. For GBIF to become a taxonomic platform, it would need to provide the data, tools and services for people to do taxonomic research, and ideally it would actually have taxonomists working at GBIF using those tools (these taxonomists could, for example, be visiting fellows working on particular taxa, rather than permanent employees). These tools would greatly help the taxonomic community, but also help GBIF make sense of the millions of names it has to interpret.

It's important to note here the the goal of the platform is NOT to "help" users - that simply reinforces the distinction between you and the "users". Instead it is to become a user. You may have more resources, and work on a different scale (few business Amazon's services support will be anything like as big as Amazon), but you are ultimately "just" another user.

December 24, 2014

One of my guilty pleasures on a Sunday morning is browsing new content on the Biodiversity Heritage Library (BHL). Indeed, so addicted am I to this that I have an IFTTT.com feed set to forward the BHL RSS feed to my iPhone (via the Pushover app. So, when I wake most Sunday mornings I have a red badge on Pushover announcing fresh BHL content for me to browse, and potentially add to BioStor.

But lately, there has been less and less content that is suitable for BioStor, and this reflects two trends that bother me. The first, which I've blogged about before, is that an increasing amount of BHL content is not hosted by BHL itself. Instead, BHL has links to external providers. For the reasons I've given earlier, I find this to be a jarring user experience, and it greatly reduces the utility of BHL (for example, this external content is not taxonomically searchable).

The other trend that worries me is that recently BHL content has been dominated by a single provider, namely the U.S. Department of Agriculture. To give you a sense of how dominant the USDA now is, below is a chart of the contribution of different sources to BHL over time.

I built this chart by querying the BHL API and extracting data on each item in BHL (source code and raw data available on github). Unfortunately the API doesn't return information on what each item was scanned, but because the identifier for each item (its ItemID) is an increasing integer, if we order the items by their integer ID then we order them by the date they were added. I've binned the data into units of 1000 (in other words, every item with an ItemID < 1000 is in bin 0, ItemIDs 1000 to 1999 are in bin 1, and so on). The chart shows the top 20 contributors to BHL, with the Smithsonian as the number one contributor.

The chart shows a number of interesting patterns, but there are a couple I want to highlight. The first is the noticeable spikes representing the addition of externally hosted material (from the American Museum of Natural History Library and the Biblioteca Digital del Real Jardin Botanico de Madrid). The second is the recent dominance of content from the USDA.

Now, to be fair, I must acknowledge that I have my own bias as to what BHL content is most valuable. My own focus is on the taxonomic literature, especially original descriptions, but also taxonomic revisions (useful for automatically extracting synonyms). Discovering these in BHL is what motivated me to build BioStor, and then BioNames, the later being a database that aims to link every animal taxon name to its original description. BioNames would be much poorer if it wasn't for BioStor (and hence BHL).

If, however, your interest is agriculture in the United States, then the USDA content is obviously a potential goldmine of information on topics such as past crop use, pricing policies, etc. But this a topic that is both taxonomically narrow (economically important organisms are a tiny fraction of biodiversity), and, by definition, geographically narrow.

To be clear, I don't have any problem with BHL having USDA content as such, it's a tremendous resource. But I worry that lately BHL has been pretty much all USDA content. There is still a huge amount of literature that has yet to be scanned. I'd like to see BHL actively going after major museums and libraries that have yet to contribute. I especially want to see more post-1923 content. BHL has managed to get post-1923 content from some of its contributors, it needs a lot more. On obvious target is those institutions that signed the Bouchout Declaration. If you've signed up to providing "free and open use of digital resources about biodiversity", then let's see something tangible from that - open up your libraries and your publications, scan them, and make them part of BHL. I'm especially looking at European institutions who (with some notable exceptions) really should be doing a lot better.

It's possible that the current dominance of USDA content is a temporary phenomenon. Looking at the chart above, BHL acquires content in a fairly episodic manner, suggesting that it is largely at the mercy of what its contributors can provide, and when they can do so. Maybe in a few months there will be a bunch of content that is taxonomically and geographically rich, and I will be spending weekends furiously harvesting that content for BioStor. But until then, my Sundays are not nearly as much fun as they used to be.

December 18, 2014

One reason I'm excited by the launch of the NHM data portal is that it opens up opportunities to link publications about specimens i the NHM to the record of the specimens themselves. For example, consider specimen 1977.3097, which is in the new portal as http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2336568 (possibly the ugliest URL ever).

This specimen is of the bat Pteralopex acrodonta, shown in the image to the right (by William N. Beckon, taken from the EOL page for this species). This species was described in the following paper:
Hill JE, Beckon WN (1978) A new species of Pteralopex Thomas, 1888 (Chiroptera: Pteropodidae) from the Fiji Islands. Bulletin of the British Museum (Natural History) Zoology 34(2): 65–82. http://biostor.org/reference/8This paper is in my BioStor project, and if you visit BioStor you'll see see that BioStor has extracted a specimen code (BM(NH) 77.3097) and also has a map of localities extracted from the paper.

Looking at the paper we discover that BM(NH) 77.3097 is the type specimen of Pteralopex acrodonta:
HOLOTYPE. BM(NH) 77.3097. Adult . Ridge about 300 m NE of the Des Voeux Peak Radio Telephone Antenna Tower, Taveuni Island, Fiji Islands, 16° 50½' S, 179° 58' W, c. 3840ft (1170 m). Collected 3 May 1977 by W. N. Beckon, died 6-7 May 1977. Caught in mist net on ridge summit : bulldozed land with secondary scrubby growth, adjacent to primary forest. Original number 104. Skin and skull.Note that the NHM data portal doesn't know that 1977.3097 is the holotype, nor does it have the latitude and longitude. Hence, if we can link 1977.3097 to BM(NH) 77.3097 we can augment the information in the NHM portal.

This specimen has also been cited in a subsequent paper:
Helgen, K. M. (2005, November). Systematics of the Pacific monkey‐faced bats (Chiroptera: Pteropodidae), with a new species of Pteralopex and a new Fijian genus . Systematics and Biodiversity. Informa UK Limited. doi:10.1017/s1477200005001702You can read this paper in BioNames. In this paper Helgen creates a new genus, Mirimiri for Pteralopex acrodonta, and cites the holotype (as BMNH 1977.3097). Hence, if we could extract that specimen code from the text and link it to the NHM record we could have two citations for this specimen, and note that the taxon the specimen belongs to is also known as Mirimiri acrodonta.

Imagine being able to do this across the whole NHM data portal. The original description of this bat was published in a journal published by the NHM (and part of a volume contributed by the NHM to the Biodiversity Heritage Library). With a *cough* little work we could join up these two NHM digital resources (specimen and paper) to provide a more detailed view what we know about this specimen. From my perspective this cross-linking between the different digital assets of an institution such as the NHM (as well as linking to external data such as other publications, GenBank sequences, etc.) is where the real value of digitisation lies. It has the potential to be much more than simply moving paper catalogues and publications online.

December 17, 2014

The Natural History Museum has released their data portal (http://data.nhm.ac.uk/). As of now it contains 2,439,827 of the Museum's 80 million specimens, so it's still early days. I gather that soon this data will also appear in GBIF, ending the unfortunate situation where data from one of the premier natural history collections in the world was conspicuous by its absence.

I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.

On a recent trip to the Natural History Museum, London, the subject of DNA barcoding came up, and I got the clear impression that people at the NHM thought classical DNA barcoding was pretty much irrelevant, given recent developments in sequencing technology. For example, why sequence just COI when you can use shotgun sequencing to get the whole mitogenome? I was a little taken aback, although this is a view that's getting some traction, e.g. [1,2]. There is also the more radical view that focussing on phylogenetics is itself less useful than, say, "evolutionary gene networks" based on massive sequencing of multiple markers [3].

At the risk of seeming old-fashioned in liking DNA barcoding, I think there's a bigger issue at stake (see also [4]). DNA barcoding isn't simply a case of using a single, short marker to identify animal species. It's the fact that it's a globalised, standardised approach that makes it so powerful. In the wonderful book "A Vast Machine" [5], Paul Edwards talks about "global data" and "making data global". The idea is that not only do we want data that is global in coverage ("global data"), but we want data that can be integrated ("making data global"). In other words, not only do we want data from everywhere in the world, say, we also need an agreed coordinate system (e.g., latitude and longitude) in order to put each data item in a global context. DNA barcoding makes data global by standardising what a barcode is (a given fragment of COI), and what metadata needs to be associated with a sequence to be a barcode (e.g., latitude and longitude) (see, e.g. Guest post: response to "Putting GenBank Data on the Map"). By insisting on this standardisation, we potentially sacrifice the kinds of cool things that can be done with metagenomics, but the tradeoff is that we can do things like put a million barcodes on a map:

To regard barcoding as dead or outdated we'd need an equivalent effort to make metagenomic sequences of animals global in the same way that DNA barcoding is. Now, it may well be that the economics of sequencing is such that it is just as cheap to shotgun sequence mitogenomes, say, as to extract single markers such as COI. If that's the case, and we can get a standardised suite of markers across all taxa, and we can do this across museum collections (like Hebert et al.'s [6] DNA barcoding "blitz" of 41,650 specimens in a butterfly collection), then I'm all for it. But it's not clear to me that this is the case.

This also leaves aside the issue of standardising other things's much as the metadata. For instance, Dowton et al. [2] state that "recent developments make a barcoding approach that utilizes a single locus outdated" (see Collins and Cruickshank [4] for a response). Dowton et al. make use of data they published earlier [7,8]. Out of curiosity I looked at some of these sequences in GenBank, such as JN964715. This is a COI sequence, in other words, a classical DNA barcode. Unfortunately, it lacks a latitude and longitude. By leaving off latitude and longitude (despite the authors having this information, as it is in the supplemental material for [7]) the authors have missed an opportunity to make their data global.

For me the take home message here is that whether you think DNA barcoding is outdated depends in part what your goal is. Clearly barcoding as a sequencing technology has been superseded by more recent developments. But to dismiss it on those grounds is to miss the bigger picture of what is a stake, namely the chance to have comparable data for millions of samples across the globe.

  1. TAYLOR, H. R., & HARRIS, W. E. (2012, February 22). An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. Molecular Ecology Resources. Wiley-Blackwell. doi:10.1111/j.1755-0998.2012.03119.x
  2. Dowton, M., Meiklejohn, K., Cameron, S. L., & Wallman, J. (2014, March 28). A Preliminary Framework for DNA Barcoding, Incorporating the Multispecies Coalescent. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu028
  3. Bittner, L., Halary, S., Payri, C., Cruaud, C., de Reviers, B., Lopez, P., & Bapteste, E. (2010). Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol Direct. Springer Science + Business Media. doi:10.1186/1745-6150-5-47
  4. Collins, R. A., & Cruickshank, R. H. (2014, August 12). Known Knowns, Known Unknowns, Unknown Unknowns and Unknown Knowns in DNA Barcoding: A Comment on Dowton et al. Systematic Biology. Oxford University Press (OUP). doi:10.1093/sysbio/syu060
  5. Edwards, Paul N. A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. MIT Press ISBN: 9780262013925
  6. Hebert, P. D. N., deWaard, J. R., Zakharov, E. V., Prosser, S. W. J., Sones, J. E., McKeown, J. T. A., Mantle, B., et al. (2013, July 10). A DNA “Barcode Blitz”: Rapid Digitization and Sequencing of a Natural History Collection. (S.-O. Kolokotronis, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0068535
  7. Meiklejohn, K. A., Wallman, J. F., Pape, T., Cameron, S. L., & Dowton, M. (2013, October). Utility of COI, CAD and morphological data for resolving relationships within the genus Sarcophaga (sensu lato) (Diptera: Sarcophagidae): A preliminary study. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2013.04.034
  8. Meiklejohn, K. A., Wallman, J. F., Cameron, S. L., & Dowton, M. (2012). Comprehensive evaluation of DNA barcoding for the molecular species identification of forensically important Australian Sarcophagidae (Diptera). Invertebrate Systematics. CSIRO Publishing. doi:10.1071/is12008

December 9, 2014

The following is a guest post by Bob Mesibov.

The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.

One of the first things I did with my download was check the 'taxa.txt' table for species name popularity*. Here they are, the top 10 species names for animals and plants, with their frequencies in the CoL list and their usual meanings:

2732 gracilis = slender
2373 elegans = elegant
2231 bicolor = two-coloured
2066 similis = similar
1995 affinis = near
1937 australis = southern
1740 minor = lesser
1718 orientalis = eastern
1708 simplex = simple
1350 unicolor = one-colouredPlants
1871 gracilis = slender
1545 angustifolia = narrow-leaved
1475 pubescens = hairy
1336 parviflora = few-flowered
1330 elegans = elegant
1324 grandiflora = large-flowered
1277 latifolia = broad-leaved
1155 montana = (of a) mountain
1124 longifolia = long-leaved
1102 acuminata = pointed

Take the numbers cum grano salis. The first thing I did with the CoL tables was check for duplicates, and they're there, unfortunately. It's interesting, though, that gracilis tops the taxonomists' poll for both the animal and plant kingdoms.

*With the GNU/Linux commands

awk -F"\t" '($11 == "Animalia") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head
awk -F"\t" '($11 == "Plantae") && ($8 == "species") {print $20}' taxa.txt | sort | uniq -c | sort -nr | head