There are currently 0 users and 19 guests online.
Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.
Last update56 min 5 sec ago
September 1, 2015
A little over a week ago I was at the 6th International Barcode of Life Conference, held at Guelph, Canada. It was my first barcoding conference, and was quite an experience. Here are a few random thoughts.
AttendeesIt was striking how diverse the conference crowd was. Apart from a few ageing systematists (including veterans of the cladistics wars), most people were young(ish), and from all over the world. There clearly something about the simplicity and low barrier to entry of barcoding that has enabled its widespread adoption. This also helps give barcoding a cohesion, no matter what the taxonomic group or the problem you are tackling, you are doing much the same thing as everybody else (but see below). While ageing systematists (like myself) may hold their noses regarding the use of a single, short DNA sequence and a tree-building method some would dismiss as "phenetic", in many ways the conference was a celebration of global-scale phylogeography.August 22, 2015
Standards aren't enoughAnd yet, standards aren't enough. I think what contributes to DNA barcoding's success is that sequences are computable. If you have a barcode, there's already a bunch of barcodes sequences you can compare yours to. As others add barcodes, your sequences will be included in subsequent analyses, analyses which may help resolve the identity of what you sequenced.
To put this another way, we have standard image file formats, such as JPEG. This means you can send me a bunch of files, safe in the knowledge that because JPEG is a standard I will be able to open those files. But this doesn't mean that I can do anything useful with them. In fact, it's pretty hard to do anything with images part from look at them. But if you send me a bunch of DNA sequences for the same region, I can build a tree, BLAST GenBank for similar sequences, etc. Standards aren't enough by themselves, to get the explosive growth that we see in barcodes the thing you standardise on needs to be easy to work with, and have a computational infrastructure in place.
Next generation sequencing and the hacker cultureClassical DNA barcoding for animals uses a single, short mtDNA marker that people were sequencing a couple of decades ago. Technology has moved on, such that we're seeing papers such as An emergent science on the brink of irrelevance: a review of the past 8 years of DNA barcoding. As I've argued earlier (Is DNA barcoding dead?) this misses the point about the power of standardisation on a simple, scalable method.
At the same time, it was striking to see the diversity of sequencing methods being used in conference presentations. Barcoding is a broad church, and it seemed like it was a natural home for people interested in environmental DNA. There was excitement about technologies such as the Oxford Nanopore MinION™, with people eager to share tips and techniques. There's something of a hacker culture around sequencing (see also Biohackers gear up for genome editing), just as there is for computer hardware and software.
CommunityAugust 21, 2015 The final session of the conference started with some community bonding, complete with Paul Hebert versus Quentin Wheeler wielding light sables. If, like me, you weren't a barcode, things started getting a little cult-like. But there's no doubt that Paul's achievement in promoting a simple approach to identifying organisms, and then translating that into a multi-million dollar, international endeavour is quite extraordinary.
After the community bonding, came a wonderful talk by Dan Janzen. The room was transfixed as Dan made the case for conservation, based on his own life experiences, including Area de Conservación Guanacaste where he and Winnie Hallwachs have been involved since the 1970s. I sat next to Dan at a dinner after the conference, and showed him iNaturalist, a great tool for documenting biodiversity with your phone. He was intrigued, and once we found pictures taken near his house in Costa Rica, he was able to identify the individual animals in the photos, such as a bird that has since been eaten by a snake.Dark taxaMy own contribution to the conference was a riff on the notion of dark taxa, and mostly consisted of me trying think through how to respond to DNA barcoding. Two graphs, three responses from Roderic Page The three responses to barcoding that I came up with are:
@rdmpage Thanks for spreading the word! Looks like an interesting conference, which is a rare thing indeed.— Nakensnegl (@kueda) August 23, 2015 Yes, the barcoding conference was that rare thing, a well organised (including well-fed), interesting, indeed eye-opening, conference.
August 14, 2015
Yet another barely thought out project, although this one has some crude code. If some 16,000 new taxonomic names are published each year, then that is roughly 40 per day. We don't have a single place that aggregates these, so any major biodiversity projects is by definition out of date. GBIF itself hasn't had an update list of fungi or plant names for several years, and at present doesn't have an up to date list of animal names. You just have to follow the Twitter feeds of ZooKeys and Zootaxa to feel swamped in new names.
And yet, most nomenclators are pumping out RSS feeds of new names, or have APIs that support time-based queries (i.e., send me the names added in the last month). Won't it be great to have a single aggregator that took these "name streams", augmented them by adding links to the literature (it could, for example, harvest RSS feeds and Twitter streams of the relevant journals), and provided the biodiversity community with a feed of new names and associated supporting information. We could watch new discoveries of new biodiversity unfold in near real time, as well as provide a stream of data for projects such as GBIF and others to ingest and keep their databases up to date.
I need more time to sketch this out fully, but I think a case can be made for a taxonomy-centric (or, perhaps more usefully, a biodiversity-centric) clone of PubMed Central.
Here are some reasons:
August 10, 2015
One of the limitations of the Biodiversity Heritage Library (BHL) is that, unlike say Google Books, its search functions are limited to searching metadata (e.g., book and article titles) and taxonomic names. It doesn't support full-text search, by which I mean you can't just type in the name of a locality, specimen code, or a phrase and expect to get back much in the way of results. In fact, in many cases when I Google a phrase that occurs in BHL content I'm more likely to find that phrase in content from the Internet Archive, and then it's a matter of following the links to the equivalent item in BHL.
So, as an experiment I've created a live demo of what full-text search in BHL could look like. I've done this using the same infrastructure the new BioStor is built on, namely CouchDB hosted by Cloudant. Using BHL's API I've grabbed some volumes of the British Ornithological Club's Bulletin and put them into CouchDB (BHL's API serves up JSON, so this is pretty straightforward to do). I've added the OCR text for each page, and asked Cloudant to index that. This means that we can now search on a phrase in BHL (in the British Ornithological Club's Bulletin) and get a result.
I've made a quick and dirty demo of this approach and you can see it in the "Labs" section on BioStor, so you can try it here. You should see something like this:
The page image only appears if you click on the blue labels for the page. None of this is robust or optimised, but it is a workable proof-of-concept of how fill-text search could work.
What could we do with this? Well, all sorts of searches are no possible. We can search for museum specimen codes, such as 1900.2.27.13. This specimen is in GBIF (see http://bionames.org/~rpage/material-examined/www/?code=BMNH%201900.2.27.13) so we could imagine starting to link specimens to the scientific literature about that specimen. We can also search for locations (such as Mt. Albert Edward), or common names (such as crocodile).
Note that I've not completed uploading all the page text and XML. Once I do I'll have a better idea of how scalable this approach is. But the idea of having full-text search across all of BHL (or, at least the core taxonomic journals) is tantalising.Technical details
Initially I simply displayed a list of the pages that matched the search term, together with a fragment of text with the search term highlighted. Cloudant's version of CouchDB provides these highlights, and a "group_field" that enabled me to group together pages from the same BHL "item" (roughly corresponding to a volume of a journal).
This was a nice start, but I really wanted to display the hits on the actual BHL page. To do this I grabbed the DjVu XML for each BHL page for British Ornithological Club's Bulletin, and used a XSLT style-sheet that renders the OCR text on top of the page image. You can't see the text because it I set the colour of the text to "rgba(0, 0, 0, 0)" (see http://stackoverflow.com/a/10835846) and set the "overflow" style to "hidden". But the text is there, which means you can select with the mouse and copy and paste it. This still leaves the problem of highlighting the text that matches the search term. I originally wrote the code for this to handle species names, which comprise two words. So, each DIV in the HTML has a "data-one-word" and "data-two-words" attribute set, which contains the first (and forst plus second) word in the search term, respectively. I then use a JQuery selector to set the CSS of each DIV that has a "data-one-word" or "data-two-words" attribute that matches the search term(s). Obviously, this is terribly crude, and doesn't do well if you've more than two word sin your search query.
As an added feature, I use CSS to convert the BHL page scan to a black-and-white image (works in Webkit-based browsers).
August 9, 2015
One of the less glamorous but necessary tasks of data cleaning is mapping "strings to things", that is, taking strings such as "George A. Boulenger" and mapping them to identifiers, such as ISNI: 0000 0001 0888 841X. In case of authors such as George Boulenger, one way to do this would be through Wikipedia, which has entries for many scientists, often linked to identifiers for those people (see the bottom of the Wikipedia page for George A. Boulenger and look at the "Authority Control" section).
How could we make these mappings? Simple string matching is one approach, but it seems to me that a more robust approach could use bibliographic data. For example, if I search for George A. Boulenger in BioStor, I get lots of publications. If at least some of these were listed on the Wikipedia page for this person, together with links back to BioStor (or some other external identifier, such as DOIs), then we could do the following:
Based on my limited browsing of Wikipedia, there seems to be little standardisation of entries for people, certainly little in how their published works are listed (the section heading, format, how many, etc.). The project I'm proposing would benefit from a consistent set of guidelines for how to include a scholar's output.
What makes this project potentially useful is that it could help flesh out Wikipedia pages by encouraging people to add lists of published works, it could aid bibliographic repositories like my own BioStor by increasing the number of links they get from Wikipedia, and if the Wikipedia page includes external identifiers then it helps us go from strings to things by giving us a way to locate globally unique identifiers for people.
Following on from Testing the GBIF taxonomy with the graph database Neo4J I've added a more complex test that relies on linking taxa to names. In this case I've picked some legume genera (Coursetia and Poissonia) where there have been frequent changes of name. By mapping the GBIF taxa to IPNI names (and associated LSIDs) we can build a graph linking taxa to names, and then to objective synonyms (by resolving the IPNI LSIDs and following the links to the basionym), see http://gist.neo4j.org/?4df5af75d42e0f963e5d.
In this example we find species that occur twice in the GBIF taxonomy, which logically should not happen as the names are objective synonyms. We can detect these problems if we have access to nomenclatural data. in this case, because IPNI has tracked the names changes, we can infer that, say, Coursetia heterantha and Poissonia heterantha are synonyms, and hence only one of these should appear in the GBIF classification. This is an example that illustrates the desirability of separating names and taxa, see Modelling taxonomic names in databases.
Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact
Imagine a web site where researchers can go, log in (easily) and get a list of all the species they have described (with pretty pictures and, say, GBIF map), and a list of all DNA sequences/barcodes (if any) that they've published. Imagine that this is displayed in a colourful way (e.g., badges), and the results tweeted with the hastag #itaxonomist.
Imagine that you are not a taxonomist, but if you have worked with one (e.g., published a paper), you can go to the site, log in, and discover that you “know” a taxonomist. Imagine if you are a researcher who has cited taxonomic work, you can log in and discover that your work depends on a taxonomist (think six degrees of Kevin Bacon).
#itaxonomist relies on three things:
Under the hood this builds part of the “biodiversity knowledge graph”, and uses ideas I and others have been playing around with (e.g., see David Shorthouse’s neat proof of concept http://collector.shorthouse.net/agent/0000-0002-7260-0350 and my now defunct Mendeley project http://iphylo.blogspot.co.uk/2011/12/these-are-my-species-finding-taxonomic.html).
For a subset of people and names this we could build this very quickly. Some some taxonomists already have ORCIDs , and some nomenclators have limited numbers of DOIs. I am currently building lists of DOIs for primary taxonomic literature, which could be used to seed the database.
The “i am a taxonomist” query is simply a map between ORCID to DOI to name in nomenclator. The “i know a taxonomist” is a map between ORCID and DOI that you share with a taxonomist, but there are no names associated with that DOI (e.g., a paper you have co-authored with a taxonomist that wasn’t on taxonomy, or at least didn’t describe a new species). The “six degrees of taxonomy” relies on the existence of open citation data, which is trickier, but some is available in PubMed Central and/or could be harvested from Pensoft publications.
August 7, 2015
I've been playing with the graph database Neo4J to investigate aspects of the classification of taxa in GBIF's backbone classification. Neo4J is a graph database, and a number of people in biodiversity informatics have been playing with it. Nicky Nicolson at Kew has a nice presentation using graph databases to handle names Building a names backbone, and the Open Tree of Life project use it in their tree machine.
One of the striking things about Neo4J is how much effort has gone in to making it easy to play with. In particular, you can create GraphGists, which are simple text documents that are transformed into interactive graphs that you can query. This is fun, and I think it's also a great lesson in how to publicise a technology (compare this with RDF and SPARQL, which is in no way fun to work with).
I created some GraphGists that explore various problems with the current GBIF taxonomy. The goal is to find ways to quickly test the classifications for logical errors, and wherever possible I want to use just the information in the GBIF classification itself.
The first example is a version of the "papaya plots" that I played with in an earlier post (see also an unfinished manuscript Taxonomy as impediment: synonymy and its impact on the Global Biodiversity Information Facility's database). For various reasons, GBIF has ended up with the same species occuring more that once in its backbone classification, usually because none of its source databases has enough information on synonymy to prevent this happening.
As an example, I've grabbed the classification for the bat family Molossidae, converted it to a Neo4J graph, and then tested for the existence of species in different genera that have the same specific epithet. This is a useful (but not foolproof test) of whether there are undetected synonyms, especially if the generic placement of a set of species has been in flux (this is certainly true for these bats). If you visit the gist you will see a list of species that are potential synonyms.
A related test catches cases where one classification treats a taxon as a subspecies whereas another treats it as a full species, and GBIF has ended up with both interpretations in the same classification (e.g., the butterfly species Heliopyrgus margarita and the subspecies Heliopyrgus domicella margarita).
Another GraphGist tests that the genus name for a species matches the genus it is assigned too. This seems obvious (the species Homo sapiens belongs in the genus Homo) but there are cases where GBIF's classification fails this test, such as the genus Forsterinaria. Typically this test fails due to problematic generic names (e.g., homonyms), incorrect spellings, etc.
The last test is slightly more pedantic, but revealing nevertheless. It relies on the convention in zoology that when you write the authorship of a species name, if the name is not in the original genus then you enclose the authorship in parentheses. For example, it's Homo sapiens Linnaeus, but Homo erectus (Dubois, 1894) because Dubois originally called this species Pithecanthropus erectus.
Because you can only move a species to a genus that has been named, it follows that if a species is described before the genus name was published, then if the species is in that newer genus the authorship must be in parentheses. For example, the lepidopteran genus Heliopyrgus was published in 1957, and includes the species willi Plötz, 1884. Since this species was described before 1957, it must have been originally placed in a different genus, and so the species name should be Heliopyrgus willi (Plötz, 1884). However, GBIF has this as Heliopyrgus willi Plötz, 1884 (no parentheses). The GraphGist tests for this, and finds several species of Heliopyrgus that are incorrectly formed. This may seem pendantic, but it has practical consequences. Anyone searching for the original description of Heliopyrgus willi Plötz, 1884 might think that they should be looking for the text string "Heliopyrgus willi" in literature from 1884, but the name didn't exist then and so the search will be fruitless.
I think there's a lot of scope for deveoping tests like these, inclusing some that m make use of external data as well. In an earlier post (A use case for RDF in taxonomy ) I mused about using RDF to perform tests like this. However Neo4J is so much easier to work with I suspect that it makes better sense to develop standard queries in it's query language (CYPHER) and use those.
August 4, 2015
Note to self about a possible project. This PLoS ONE paper:
Tibély, G., Pollner, P., Vicsek, T., & Palla, G. (2013, December 31). Extracting Tag Hierarchies. (P. Csermely, Ed.)PLoS ONE. Public Library of Science (PLoS). http://doi.org/10.1371/journal.pone.0084133describes a method for inferring a hierarchy from a set of tags (and cites related work that is of interest). I've grabbed the code and data from http://hiertags-beta.elte.hu/home/ and put it on GitHub.
Possible projectUse Tibély et al. method (or others) on taxonomic names extracted from BHL text (or other) and see if we can reconstruct taxonomic classifications. ow do classifications compare to those in databases? Can we enhance existing databases using this technique (e.g., extract classifications from literature for groups pporly represented in existing databases)? Could be part of larger study of what we can learn from co-occurrence of taxonomic names, e.g. Automatically extracting possible taxonomic synonyms from the literature.
Note to anyone reading this: if this project sounds interesting, by all means feel free to do it. These are just notes about things that I think would be fun/interesting/useful to do.
July 31, 2015
One of my pet projects is BioStor, which has been running since 2009 (gulp). BioStor extracts articles from the Biodiversity Heritage Library (details here: http://dx.doi.org/10.1186/1471-2105-12-187), and currently has over 110,000 articles, all open access. The site itself is showing its age, both in terms of performance and design, so I've wanted to update it for a while now. I made a demo in 2012 of BioStor in the Cloud, but other stuff got in the way of finishing it, and the service that it ran on (Pagodabox) released a new version of their toolkit, so BioStor in the Cloud died.
At last I've found the time to tackle this again, motivated in part because I've had to move BioStor to a new server, and it's performance has been pretty poor. The next version of BioStor is currently sitting at http://biostor.gopagoda.io (the images and map views are good ways to enter the site). It's still being populated, and there is code to tweak, but it's starting to look good enough to use. It has a cleaner article display, built in search (making things much more findable), support for citation styles using citeproc-js, and display of altmetrics (e.g., Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae).
Once all the data has been moved across and I've cleaned up a few things I plan to make bistor.org point to this new version.
July 28, 2015
Quick notes on modelling taxonomic names in databases, as part of an ongoing discussion elsewhere about this topic.
We have a table for taxa and we don't distinguish between taxa and their names. the taxonomic hierarchy is represented by the parentID field, which points to your parent. If you don't have a (non NULL) value for parentID you are not an accepted taxon (i.e., you are a synonym), and the field acceptedID points to the accepted taxon. Simple, fits in a single database table (or, let's be honest, and Excel spreadsheet).
The tradeoff is that you conflate names and taxa, you can't easily describe name-only relationships (e.g., homonyms, nomenclatural synonyms) without inventing "taxa" for each name.Separating names and taxa
The next model, which I've drawn rather clunky below as if you were doing this in a relational database, is based on the TDWG LSID vocabularies. One day someone will explain why the biodiversity informatics community basically ignored this work, despite the fact that all the key nomenclators use it.
In this model we separate out names as first-class objects with globally unique identifiers. The taxa table refers to the names table when it mentions a name. Any relationships between names are handled separately from taxa, so we can easily handle things like replacement names for homonyms, basionyms, etc. Not that we can also remove a lot of extraneous stuff from the taxa table. For example, if we decide that Poissonia heterantha is the accepted name for a taxon, we don't need to create taxa for Coursetia heterantha or Tephrosia heterantha, because by definition those names are synonyms of Poissonia heterantha.
The other great advantage of this model is that it enables us to take the work the nomenclators have done straight without having to first shoe-horn it into the Darwin Core format, which assumes that everything is a taxon.
July 27, 2015
In my (previous post ) I discussed the potential for the Biodiversity Data Journal (BDJ) to be a venue for nano (or near-nano publications). In this post I want to draw attention to what I think is a serious stumbling block, which is the lack of machine readable statements in the journal.
Given that the journal is probably the most progressive in the field (indeed, is suspect that there are few journals in any field as advanced in publishing technology as BDJ) this may seem an odd claim to make. The journal provides XML of its text, and typically provides data in Darwin Core Archive format, which is harvested by GBIF. The article XML is marked up to flag taxonomic names, localities, etc. Surely this is the very definition of machine readable?
The problem becomes apparent when you ask "what claims or assertions are the papers making?", and "how are those assertions reflected in the article XML and/or the Darwin Core Archive?".
For example, consider the following paper:Gil-Santana, H., & Forero, D. (2015, June 16). Aristathlus imperatorius Bergroth, a newly recognized synonym of Reduvius iopterus Perty, with the new combination Aristathlus iopterus (Perty, 1834) (Hemiptera: Reduviidae: Harpactorinae) . BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.3.e5152
The title gives the key findings of this paper: Aristathlus imperatorius = Reduvius iopterus, and Reduvius iopterus = Aristathlus iopterus. Yet these statements are no where to be found in the Darwin Core Archive for the paper, which simply lists the name Aristathlus iopterus. The XML markup flags terms as names, but says nothing about the relationships between the names.
Here is another example:
Starkevich, P., & Podenas, S. (2014, December 30). New synonym of Tipula (Vestiplex) wahlgrenana Alexander, 1968 (Diptera: Tipulidae). BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.2.e4237Indeed, I've yet to find an example of a paper in BDJ where a synonomy asserted in the text is reflected in the Dawrin Core Archive!
The issue here is that neither the XML markup nor the associated data files are capturing the semantics of the paper, in the sense of what the paper is actually saying. The XML and DwCA files capture (some) of the names, and localities mentioned, but not the (arguably) most crucial new pieces of information.
There is a disconnect between what the papers are saying (which a human reader can easily parse) and what the machine-readable files are saying, and this is worrying. Surely we should be ensuring that the Darwin Core Archives and/or XML markup are capturing the key facts and/or assertions made by the paper? Otherwise databases down stream will remain none the wiser about the new information the journal is publishing.
I stumbled across this intriguing paper:
Do, L., & Mobley, W. (2015, July 17). Single Figure Publications: Towards a novel alternative format for scholarly communication. F1000Research. F1000 Research, Ltd. http://doi.org/10.12688/f1000research.6742.1The authors are arguing that there is scope for a unit of publication between a full-blown journal article (often not machine readable, but readable) and the nanopublication (a single, machine readable statement, not intended for people to read), namely the Single Figure Publications (SFP):
The SFP, consisting of a figure, the legend, the Material and Methods section, and an optional Results/Discussion section, reduces the unit of publication to a more tractable size. Importantly, it results in a markedly decreased time from data generation to publication. As such, SFPs represent a new means by which to communicate scientific research. As with the traditional journal article, the content of the SFPs is readily understandable by the scientist. Coupled with additional tools that aid in structuring content (e.g. describing in detail the methods using pre-defined steps from protocols), the SFP represents a “bottom-up” means by which scholars can structure the content of their findings in a modular and piece-wise fashion wedded to everyday laboratory life.It seems to me that this is something that the Biodiversity Data Journal is potentially heading towards. Some of the papers in that journal are short, reporting say, new occurence records for a single species e.g.:Ang, Y., Rohner, P., & Meier, R. (2015, June 26). Across the Baltic: a new record for an enigmatic black scavenger fly, Zuskamira inexpectata (Pont, 1987) (Sepsidae) in Finland. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.3.e4308
Imagine if we have even shorter papers that are essentially a series of statements of fact, or assertions (linked to supporting evidence). These could potentially be papers that annotated and/or clarified data in an external database, such as GBIF. For example, let's imagine we find two names in GBIF that GBIF treats as being different taxa, but a recent publication asserts are actually synonyms. We could make that information machine readable (say, using Darwin Core Archive format), link it to the source(s) of the assertion (i.e., the DOI of the paper making the synonymy), then publish that as a paper. As the Darwin Core Archive is harvested by GBIF, GBIF then has access to that information, and when the next taxonomic indexing occurs it can make use of that information.
One reason for having these "micropublications" is that sometimes resolving an issue in a dataset can take some time. I've often found errors in databases and have ended up spending a couple of hours finding names, literature, etc. to figure out what is going on. As fun as that is, in a sense it's effort that is wasted if it's not made more widely available. But if I can wrap that couple of hours scholarship into a citable unit, publish it, and have it harvested and incorporated into, say, GBIF, then the whole exercise seems much more rewarding. I get credit for the work, and GBIF users get (hopefully) a tiny bit of improvement, and they can see the provenance of that improvement (i.e., it is evidence-based).
This seems like a simple mechanism for providing incentives for annotating databases. In some ways the Biodiversity Database Journal could be though of as doing this already, however as I'll discuss in the next blog post, there's an issue that is preventing it being as useful as it could be.
July 23, 2015
July 23, 2015 These are some quick thoughts on the games on the BHL site, part of the Purposeful Gaming and BHL Project. As mentioned on Twitter, I had a quick play of the Beanstalk game and got bored pretty quickly. I should stress that I'm not a gamer (although my family includes at least one very serious gamer, and a lot of casual players). Personally, if I'm going to spend a large amount of time with a computer I want to be creating something, so gaming seems like a big time sink. Hence, I may not be the best person to review the BHL games. Anyhow...
It seems to me that there are a couple of ways games like this might work:
The BHL games are trying to get you to do one activity (type in the text shown in a fragment of a BHL book) and this means, say, a tree grows bigger. To me this feels like a huge disconnect (cf. point 2 above), there is no connection between what I'm doing and the outcome.
Worse, BHL is an amazing corpus of text and images, and this is almost entirely hidden from me. If I see a cool looking word, or some old typeface, there's no way for me to dig deeper (what text did that come from?, what does that phrase mean?). I get no sense of where the words come from, or whether I'm doing anything actually useful. For things like ReCAPTCHA (where you helped OCR books) this doesn't matter because I don't care about the books, I want my tickets. But for BHL I do care (and BHL should want at least some of the players to care as well).
So, remembering that I'm not a gamer, here are some quick ideas for games.
Find that speciesOne reason BHL is so useful is it contains original taxonomic descriptions. Sometimes the OCR is too poor for the name to extracted from the description. Imagine a game where the player has a list of species (with cute pictures) and is told to go find them in the text. Imagine that we have a pretty good idea where they are (from bibliographic data we could, for example, know the page the name should occur on), the player hunts for the word on the page, and when they find it and mark it. BHL then gets corrected text and confirmation that the name occurs on that page. Players could select taxa (e.g., birds, turtles, mosses) that they like.
Find lat/longsBHL text is full of lat/long pairs, often the OCR is not quite good enough to extract them. Imagine that we can process BHL to find things that look like lat/long pairs. Imsgine that we can read enough of the text to get a sense of where in the world the text refers to. Now, have a game where we pick a spot on a map and find things related to that spot. Say we get presented with OCR text that may refer to that locality, we fix it, and the map starts get populated. A bit like Yelp and Four Square, we could imagine badges for the most articles found about a place.
Find the letter/fontThere are lots of cool symbols and fonts in BHL, someone might be interested collecting these. Simple things might be diphthongs such as æ. Older BHL texts are full of these, often misinterpreted. Other examples are male and female symbols. Perhaps we could have a game where we try and guess what symbol the OCR text actually matches - in other words, show the OCR text first, player tries to guess actual symbol, then the image appears, and then player types in actual symbol. Goal is to get good at predicting OCR errors.
Games like this would really benefit if the player could see (say, on the side) the complete text. Imagine that you correct a word, then you see it comes from a gorgeous plate of a bird. Imagine you could then correct any of the there words on that page.
Word eatersImagine the layer is presented with a page with text and, a bit like Minecraft's monsters, things appear which start to "eat" the words. You need to check as many words as possible before the text is eaten. Perhaps structure things in such a way that checked words form a barrier to the word-eating creatures and buy you some time, or like Minecraft, fixing a bad OCR word blasts a radius free of the word eaters. As an option (again, like Minecraft) turn off the eaters and just correct the words at your leisure.
CountdownBased on the UK game show, present a set of random letters (as images), player makes longest word they can, then check against dictionary, this tells you what letters they think the images represent.
Falling wordsHave page fragments fall from the top of the screen, and have a key word displayed (say, "sternum", or enable player to type a word in) then display images of words whose OCR text resembles this (in other words, have a bunch of OCR text indexed using methods that allow for errors). As the word images fall, the player taps on an image that matches the word and they are collected. Maybe add features such as a timeline to show when the word was used (i.e., the date of the BHL text), give the meaning of the word, lightly chide players who enter words like "f**k" (that'd be me), etc.
SummaryLike comedy, I imagine that designing games is really, really hard. But the best games I've seen create a world that the player is immersed in and which makes sense within the rules of that world. Regardless of whether these ideas are any good, my concern is that the BHL games seem completely divorced from context, and the game play bears no relation to outcomes in the game.
July 22, 2015
Browsing JSTOR's Global Plants database I was struck by the number of comments people have made on individual plant specimens. For example, for the Holotype of Scorodoxylum hartwegianum Nees (K000534285) there is a comment from Håkan Wittzell that the "Collection number should read 1269 according to Plantae Hartwegianae". In JSTOR the collection number is 1209.
Now, many (if not all) of these specimens will also be in GBIF. Indeed, K000534285 is in GBIF as http://www.gbif.org/occurrence/912442645, also with collection number 1209. A GBIF user will have no idea that there is some doubt about one item of metadata about this specimen.
So, an obvious thing to do would be to make the link between the JSTOR and GBIF records. Implementing this would need so fussing because (sigh) unlike DOIs for articles we don't have agreed upon identifiers for specimens. So we'd need to do some mapping between the specimen barcode K000534285, the JSTOR URL http://plants.jstor.org/stable/10.5555/al.ap.specimen.k000534285, and the GBIF record http://www.gbif.org/occurrence/912442645.
In addition to providing users with more information, it might also be useful in kickstarting annotation on the GBIF site. At the moment GBIF has no mechanism for annotating data, and if it did, then it would have to start from scratch. Imagine that a person visiting occurrence 912442645 sees that it has already attracted attention elsewhere (e.g., JSTOR). They might be encouraged to take part in that conversation (because at least one person cared enough to comment already). Likewise, we could feed annotations on the GBIF site to JSTOR.
A variation on this idea is to think of annotations such as those in the JSTOR database as being analogous to the tweets, blog posts, and bookmarking that altmetric tracks for academic papers. Imagine if we applied the same logic to GBIF and had a way to show users that a specimen has been commented on in JSTOR Plants? Thinking further down the track, we could image adding other sorts of "attention", such as citations by papers, vouchers for DNA sequences, etc.
It would be a fun project to see whether the Disqus API enabled us to create a tool that could match JSTOR Global Plants comments to GBIF occurrences.
Steve Baskauf has concluded a thoughtful series of blog posts on RDF and biodiversity informatics with http://baskauf.blogspot.co.uk/2015/07/confessions-of-rdf-agnostic-part-7.html. In this post he discussed the "Rod Page Challenge", which was a series of grumpy posts I wrote (starting with this one) where I claimed RDF basically sucked, and to illustrate this I issued a challenge for people to do something interesting with some RDF I provided. Since this RDF didn't have a stable home I've put it on GitHub and it has a DOI http://dx.doi.org/10.5281/zenodo.20990 courtesy of GitHub's integration with Zenodo.
I argued that the RDF typically available was basically useless because it wasn't adequately linked (see Reflections on the TDWG RDF "Challenge"). Two of the RDF files I provided were created specifically created to tackle this problem (derived from my projects iPhylo Linkout http://dx.doi.org/10.1371/currents.RRN1228 and the precursor to BioNames http://dx.doi.org/10.7717/peerj.190). This marked pretty much the end of any interest I had in pursuing RDF.
Towards the end of Steve's post he writes:
At the close of my previous blog post, in addition to revisiting the Rod Page Challenge, I also promised to talk about what it would take to turn me from an RDF Agnostic into an RDF Believer. I will recap the main points about what I think it will take in order for the Rod Page Challenge to REALLY be met (i.e. for machines to make interesting inferences and provide humans with information about biodiversity that would not be obvious otherwise):
Steve's point 1 is essentially the point I was making with the challenge. At the time of the challenge, RDF from major biodiversity informatics projects was in silos, with few (if any) links to external resources (the kinds of things Steve refers to in his point 2). As a result, the promised benefits from RDF simply haven't materialised. The lesson I took from this is that we need rich, dense cross-links between data sources (the "biodiversity knowledge graph"), and that's one reason I've been obsessed with populating BioNames, which links animal names to the primary literature (I'm planning to extend this to plants as well). Turns out , creating lots of cross links is really hard work, much harder than simply pumping out a bunch of RDF and waiting for it to automagically coalesce into an all-connected knowledge graph.
I posed the challenge back in 2011, and since then I think the landscape has changed to the extent that I wonder if trying to "fix" RDF is really the way forward.
XML is deadAnyone (sane) developing for the web and wanting to move data around is using JSON, XML is hideous and best avoided. Much of the early work on RDF used XML, which only made things even harder than they already were. JSON beats XML, to the extent that RDF itself now has a JSON serialisation, JSON-LD. But JSON-LD is about more than the semantic web (see JSON-LD and Why I Hate the Semantic Web), and has the great advantage that you can actually ignore all the RDF cruft (i.e., the namespaces) and simply treat the data as key-value pairs (yay!). Once you do that, then you can have fun with the data, especially with databases such as CouchDB ("fun" and "database" in the same sentence, I know!).
Key-value pairs, document stores, and graph databasesThe NoSQL "movement" has thrown up all sorts of new ways to handle data and to think about databases. We can think of RDF as describing a graph, but it carries the burden of all the namespaces, vocabularies, and ontologies that come with it. Compare that with the fun (there's that word again) of graph databases such as Neo4J with its graph gists. The Neo4J folks have made a great job of publicising their approach, and making it easy and attractive to play with.
So, we're in a interesting time when there are a bunch of technologies available, and I think maybe it's time to ask whether the community's allegiance to RDF and the Semantic Web has been somewhat misplaced...
June 25, 2015
Two ongoing challenges in biodiversity informatics are getting data into a form that is usable, and linking that data across different projects platforms. A recent and interesting approach to this problem are "data journals" as exemplified by the Biodiversity Data Journal. I've been exploring some data from this journal that has been aggregated by GBIf and EOL, and have come across a few issues. In this post I'll firstly outline the standard format for moving data between biodiversity projects, the Darwin Core Archive, then illustrate some of the pitfalls.Darwin Core Archive
Firstly a quick digression on the Darwin Core Archive format, which has a few gotchas for newcomers to the format (such as myself). The Darwin Core Archive supports a "star schema" like this.
At the centre of the star is a table containing data either about taxa or occurrences. We can have additional tables with other sorts of data, and we also have a meta.xml file which tells us what all the data columns are and how the different tables are related to the core table.
For example, if we have taxa as our core, then we can have a table like this were each taxon has a unique taxon_id:taxon_idtaxon stuff1stuff2stuff3stuff
Now, imagine that we have a reference for each of these taxa (say it's the paper that originally described these species). Then we could add a unique identifier for that reference reference_id to the taxon table:taxon_idreference_idtaxon stuff1astuff2astuff3astuff
Now, if we were building a relational database we could have a separate table for the references, and link the two table using the reference_id as a primary key for the references and as a foreign key in the taxon table, like this:reference_idreference stuffareference
This means that we need only have the reference stored once, which means there's no redundancy. If we need to update the reference data, we only need to do it once.
However, this is not how Darwin Core Archive works. Because it's a star schema, we need to have a references table like this:reference_idtaxon_idreference stuffa1referencea2referencea3reference
Note that we have added the taxon_id to link the reference to each taxon, and that the same reference occurs three times (once for each taxon it refers to), hence we have redundancy. Note also that if we don't include the taxon_id key then there's no way for a Darwin Core Archive reader to link the reference to the corresponding taxa (we'll come back to this below).
I've said that the reference are in their own table. In fact, we can have everything in one big table, and use the meta.xml table to tell a Darwin Core Archive reader to process that same table but extract different data each time (the Mammal Species of the World checklist http://doi.org/10.15468/csfquc is an example of this). Hence, we could extract taxon_id and taxon stuff for the taxa, then reference_id, reference stuff for the references.taxon_idreference_idtaxon stuffreference stuff1astuffreference2astuffreference3astuffreference
The other thing to remember is that the meta.xml file is responsible for describing the data. It does this in two ways (1) it defines the type of data a given table contains (e.g., taxa, occurrence, image, etc.), and (2) it defines what each column in the data represents, using a controlled vocabulary.
The type of data each table contains is defined by a URI, and the list of these "registered extensions" is available from GBIF. The two "core" extensions are for taxa and occurrences, the two things GBIF primarily deals with, while the other extensions enable richer data to be added. Of course, a Darwin Core Archive consumer that doesn't understand these extensions can simply ignore them. Rather unfortunately, some extensions, such as the EOL media and references extensions overlap with the GBIF multimedia and references extensions. Hence, if you have, say images or bibliographic data, you have two extensions to choose from. If you choose EOL's then EOL will import your data, but GBIF won't. Furthermore, the extensions vary in richness. If you have bibliographic data then GBIF's vocabulary for references looks sparse and lacking many of the fields one might expect, whereas EOL's is quite rich.Problems with Biodiversity Data Journal and GBIF
With that background, let's take a look at what happens to Biodiversity Data Journal (BDJ) data once it enters GBIF. For example, the species Eupolybothrus cavernicolus, described using "transcriptomic, DNA barcoding and micro-CT imaging data" (http://dx.doi.org/10.3897/BDJ.1.e1013). Data from this paper is in GBIF as both an occurrence dataset (http://doi.org/10.15468/zpz4ls) and checklist dataset (http://doi.org/10.15468/rpavbl).
ImagesThe checklist dataset includes both media and references. The images don't appear in GBIF, but are visible in EOL (e.g., http://eol.org/data_objects/26558840 shown below:
Because the type for the media is set to a type (http://eol.org/schema/media/Document) that only EOL recognises, GBIF doesn't harvest the images, and hence misses out on all this extra multimedia goodness.
ReferencesThe references in the BDJ dataset don't appear in either GBIF or EOL (see http://eol.org/pages/38177334/literature). Presumably they don't appear in GBIF because BDJ uses EOL's extension, but why don't they appear in EOL? Looking at the raw data, the references.csv file in the Darwin Core lacks the coreid field needed to link the references to the corresponding taxon (the fiels is defined in the meta.xml file, but there is no corresponding column in the references.csv file. Looking at other BDJ Darwin Core Archives this seems to be a common problem.MapStrangely the BDJ paper shows a map with a point locality, but the same data in GBIF does not (see http://doi.org/10.15468/zpz4ls).
A look at the occurrences.csv shows that the file has verbatim latitude and longitude but not decimal versions of the coordinates, which is what GBIF uses to locate records on the map. So the BDJ data set isn't contributing any geographical data. Clearly a lot of BDJ data is georeferenced (see map), but not this example.
TaxaThe centipede Eupolybothrus cavernicolus is not in GBIF's backbone classification. This is a common issue, especially with newly described taxa. GBIF does not have access to recent nomenclatural data, and so even though the BDJ data comes with a ZooBank LSID urn:lsid:zoobank.org:act:6F9A6F3C-687A-436A-9497-70596584678C for the name Eupolybothrus cavernicolus, GBIF itself doesn't know about and so if you do a default search on the name Eupolybothrus cavernicolus you get only the genus.Summary
Here are the issues I uncovered after a little bit of messing about:
What both puzzles and frustrates me is that a much trumpeted collaboration between these projects has significant problems which seem to have gone undetected. It seems as if it is enough to have a pipeline between a data journal and a project, without actually testing whether that pipeline loses or misrepresents the data. In some cases, very little of the data in a BDJ archive actually makes it into GBIF, which is wasteful and rather defeats the point of having a data journal to database pipeline in the first place.
June 24, 2015
I spent last Friday and Saturday at (Research in the 21st Century: Data, Analytics and Impact, hashtag #ReCon_15) in Edinburgh. Friday 19th was conference day, followed by a hackday at CodeBase. There's a Storify archive of the tweets so you can get a sense of the meeting.
Sitting in the audience a few things struck me.
GitHub is becoming more and more important, not only as a repository of scientific code and data, but as a useful model of sorts of things we need to be doing. Arron Smith gave a fascinating talk on GitHub. Apart from the obvious things such as version control, Arfon discussed the tools and mindset of open source programmers, and who that could be applied to scientific data. For example, software on GitHub is often automatically tested for bugs (and GitHub displays a badge saying whether things are OK). Imagine doing this for a data set, having it automatically checked for errors and/or internal consistency. Reproducibility is a big topic in science, but open source software has to be reproducible by default in the sense that it has to be able to be downloaded and compiled on a user's computer. This is just a couple of the things Arfon covered, see his slides for more.Transitive Credit
One idea which particularly struck me was that of "transitive credit": Katz, D. S. (2014, February 10). Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. JORS. Ubiquity Press, Ltd. http://doi.org/10.5334/jors.be
From the above paper: The idea of transitive credit is as follows: The credit map for product A, which is used by product B, feeds into the credit map for product B. For example, product A is a software package equally written by two authors and its credit map is that 50 percent of the credit for this should go the lead developer, 20 percent to the second developer, and 10 percent to the third developer. In addition, 5 percent should go to each of the four libraries that are needed to run the code. When this product is created and registered, this credit map is registered along with it. Product B is a paper that obtains new science results, and it depended on Product A. The person who registers the publication also registers its credit map, in this case 75 percent to her/himself, and 25 percent to the software code previous mentioned. Credit is now transitive, in that the lead software developer of the code can be given credit for 12.5 percent of the paper. If another paper is later written that extends the product B paper and gives 10% credit to that paper, the lead software package developer will also have 1.25% credit for the new paper.The idea of being able to track credit across derived products is interesting, and is especially relevant to projects such as GBIF, where users can download large datasets that are themselves aggregations of data from numerous different providers (making it was to calculate the relative contributions of each provider). If we then track citations of that data (and citations of those citations) we could give data providers a better estimate of the actual impact of their data.
ImpactEuan Adie of altimetric talked about "impact", and remarked on an example of a paper being cited in a policy document and this being picked up by altimetric and seen by the authors of the paper, who had no idea that their work had influenced a policy document. This raises some intriguing possibilities, related to the idea of "transitive credit" above.doi:10.1017/S0968047002000018
This paper has no recent "buzz" (e.g., Twitter, Facebook, Mendeley) but is cited on three Wikipedia pages. So, this paper has impact, albeit in social media. Many papers like this will slip below the social media radar but will be used by various databases and may contribute to subsequent work. Perhaps we could expand alt metrics sources of information to include some of those databases. For example, if a paper has been aggregated/cited by a major databases (such as GBIF) then it would be nice to see that on the altimetric donut. For authors this gives them another example of the impact of their work, but for the databases it's also an opportunity to increase engagement (if people have relevant work that doesn't appear in the donut they can take steps to have that work included in the aggregation). Obviously there are issues about what databases to count as providing signal for alt metrics, but there's scope here to broaden and quantify our notion of impact.
HackdayThe ReCon hackney was an pretty informal event held at CodeBase just down from Edinburgh Castle, and apparently the largest start-up incubator in the European tech scene. It was a pretty amazing place, and a great venue for a hackney.June 20, 2015 I spent the day looking at the ORCID API and seeing if I could create some mashups with Journal Map and my own BioNames. One goal was to see if we could generate a map of researcher's study sites starting with their ORCID, using ORCID's API to retrieve a list of their publications, then talking to the Journal Map API to get point localities for those papers. The code worked, but the results were a little disappointing because Jim Caryl and I were focussing on University of Glasgow researchers, and they had few papesri n Journal Map. The code, such as it is, is in GitHub.
My original idea was to focus on BioNames, and see how many authors of taxonomic papers had ORCIDs. Initial experiments seemed promising (see GitHub for code and data). Time was limited, so I got as far has building lists of DOIs from BioNames and discovering the associated ORCIDs. The next steps would be (a) providing ORCID login to BioNames, and using ORCID to help cluster author name strings in BioNames. Still much to do.
I've not been to many hackdays/hackathons, but I find them much more rewarding than simply sitting in a lecture theatre and listening to people talk. Combining both types of meeting is great, and I look forward to similar event sin the future.
I've published a short note on my work on geophylogenies and GeoJSON in PLoS Currents Tree of Life: Page R. Visualising Geophylogenies in Web Maps Using GeoJSON. PLOS Currents Tree of Life. 2015 Jun 23 . Edition 1. doi:10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e.At the time of writing the DOI hasn't registered, so the direct link is here. There is a GitHub repository for the manuscript and code.
I chose PLoS Currents Tree of Life because it is (supposedly) quick and cheap. Unfortunately a perfect storm of delays in reviewing together with licensing issues resulted in the paper taking nearly three months to appear. The licensing issues were a headache. PLoS uses the Creative Commons CC-BY license for all its content. Unfortunately, the original submission included maps from Google Maps and Open Street Map (OSM), to show that the GeoJSON produced by my tool could work with either. Google Maps tile imagery is not freely available, so I had to replace that in order for PLoS to be able to publish my figures. At first I used simply replaced the tiles Google Maps displays with ones from OSM, but those tiles are CC-BY-SA, which is incompatible with PLoS's use of CC-BY. Argh! I got stroppy about this on Twitter:
FFS. So it appears I can't use either Google Maps or Open Street Map in a @PLOSCurrents article. Open licensing somehow feels worse than ©— Roderic Page (@rdmpage) June 16, 2015 Eventually I discovered maps from CartoDB that have CC-BY licenses, and so could be used in the PLoS Currents article. After replacing Google's and OSM tiles with these maps (and trimming off the "Google" logo) the figures were acceptable to PLoS. Increasingly I think Creative Commons has resulted in a mess of mutually incompatible licenses that make mashing up things hard. The idea was great ("skip the intermediaries" by declaring that your content can be used), but the outcome is messy and frustrating.
But, enough grumbling. The article is out, the code is in GitHib. Now to think about how to use it.
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology