Popular contentSyndicateCategories
User loginNavigationWho's onlineThere are currently 0 users and 29 guests online.
|
iPhyloRants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View. URLhttp://iphylo.blogspot.com/Last update12 min 35 sec agoMay 23, 201303:06
Cool! @alaskamuseum has DOIs for specimens, e.g. dx.doi.org/10.7299/X78052… (minted by @ezidcdl ) — Roderic Page (@rdmpage) May 22, 2013I've been banging on about having citable, persistent identifiers for specimens, so was suitably impressed when Derek Sikes posted a comment on iPhylo that Arctos already does this. For example, here is a DOI for a specimen: http://dx.doi.org/10.7299/X7VQ32SJ. So, we're all done, right? Not quite. DOIs by themselves don't get us where we (OK, where I think we) want to be. The DOI identifies a specimen, which is great (see discussion on iDigBio: You are putting identifiers on the wrong thing for why this matters). We can also get machine-readable metadata using the DOI (by using the URL http://data.datacite.org/10.7299/X7VQ32SJ ). The metadata is limited (ideally we'd want something like Darwin Core), but it is a start. It's not clear how we get from the DOI to Darwin Core. There are at least two issues that remain to be tackled. The first is that we now have a bunch of identifiers for the same thing, e.g.:
Most of these identifiers don't know about each other (for example, GBIF doesn't know about the DOI, nor does Arctos link to GBIF). So we have disconnected pieces of information about the same thing. The second issue is how do we discover a specimen DOI? CrossRef supports services where you can take a bibliographic citation, e.g. Phylogeny and biogeography of ice crawlers (Insecta: Grylloblattodea) based on six molecular loci: designating conservation status for Grylloblattodea species and get back a DOI (in this case, http://dx.doi.org/10.1016/j.ympev.2006.04.013). This makes it possible for publishers to take lists of literature cited in authors' manuscripts and quickly add DOIs to those citations. We don't have an equivalent service for specimens, which is going to make our task of linking specimens to sequences and the literature something of a challenge. We are making progress, but there is some way to go. Identifiers are only part of the solution, we also need services. May 16, 201307:19
iPhylo: GBIF specimens in BioStor: who are the top ten museums?(Unfortunately not @amnh) iphylo.blogspot.com/2012/02/gbif-s… — Susan Perkins (@NYCuratrix) May 14, 2013Ideas on measuring the "impact" of a natural history collection have been bubbling along, as reflected in recent comments on iPhylo, and some offline discussions I've been having with David Blackburn and Alan Resetar. My focus has been at the specimen-level, with a view to motivation the adoption of persistent specimen-level identifiers so that we can citations of specimens over time (e.g., in publications and databases such as GenBank). Not only does this provide a measure of the "impact" of a collection, it helps with provenance. If we sequence a specimen that is subsequently assigend to a different taxon and we have a way of tracking that specimen via its identifier, then we can transmit that new identification to other consumers of data based on that specimen. For example, we could automatically notify GenBank that what we thought was an x is actually a y. So I made a simple "league table" of museum collections based on specimens cited in BioStor. There are all sorts of issues with this approach. Once you rank collections, people may use that to argue some can be axed and more resources funnelled into others. A more positive approach would be to indetify collections that are underused, and try and figure out why. And in the same way that taxonomic papers may have a citation long life, specimens may sit in a museum for a long time before being cited (for example, when eventually recognised as a new species doi:10.1016/j.cub.2012.10.029). So, metrics can be a double-edged sword. Citing specimens is a useful metric, but not all citations are equal, and not all citations are immediate. A specimen that yields DNA sequences that are published in, say, Nature, arguably has more weight than a specimen listed in a rarely cited paper. Likewise, subsequent citations of a paper that cites a specimen should confer more weight on the value of that specimen. Elsewhere (doi:10.1093/bib/bbn022, preprint here: hdl:10101/npre.2008.1760.1) I've argued for a Google PageRank-style way to measure the impact of a specimen that takes into account papers and other objects derived from a specimen (e.g., images, sequences). Meanwhile, Morgan Jackson alerted me to a quicker way to get a measure of the impact of the collection. @rdmpage @nycuratrix Check Nature for a recent note about this. A bird museum calc. their collection's h-score from papers citing specimens — Morgan Jackson (@BioInFocus) May 16, 2013The "short note" Morgan refers to is by Kevin Winker and Jack J. Withrow:Winker, K., & Withrow, J. J. (2013). Natural history: Small collections make a big impact. Nature, 493(7433), 480–480. doi:10.1038/493480b They constructed a Google Scholar profile and collected papers that cite the University of Alaska Museum's bird collection (see here for full details). The h-score of this collection of papers is 42, which Winkler and Withrow note is "equivalent to an average Nobel laureate in physics". Here's the graph of citations over time: It's a neat trick, if a little time consuming. But one advantage it has is that it puts collections on a similar footing to individual researchers. You could imagine asking the question "how much money would you spend supporting a researcher at this level?" How does this compare to the resources actually being spent? One thing I hope will emerge from discussions like this is a desire to make specimens first-class citizens of the web, with stable identifiers that enable them to be cited in the same way we cite papers and, increasingly, data sets. May 2, 201304:55
Bob Mesibov (who has been a guest author on this blog) recently published a paper on data quality in in ZooKeys:
Mesibov, R. (2013). A specialist’s audit of aggregated occurrence records. ZooKeys, 293(0), 1–18. doi:10.3897/zookeys.293.5111 In this paper Bob documents some significant discrepancies between data in his Millipedes of Australia (MoA) database and the equivalent data in the Atlas of Living Australia and GBIF (disclosure, I was a reviewer of the paper, and also sit on GBIF's science committee). This paper spawned a thread on TAXACOM, and also came up at the GBIF meeting I was at earlier this week. One thing lacking from the discussion is a clear sense of just how big are the discrepancies between GBIF and MoA data, so I grabbed the data provided by Bob (http://dx.doi.org/10.3897/zookeys.293.5111.app and extracted the records where GBIF and MoA disagreed. I converted these to GeoJSON and threw them on Google Maps: You can see a live version here http://bl.ocks.org/rdmpage/raw/5501293/ (it can take a little while for the map to appear). I've connected the MoA and GBIF localities for the same occurrence by a straight line, and the the MoA records are encircled by an estimate of their uncertainty (for many records the circle is invisible at this scale). There are some fairly spectacular discrepancies, and a lot of relatively small scale displacements of records. Does this matter? The answer to this question will depend on what people want to do with the data. You may regard the discrepancies as serious (certainly it's interesting that there are so many differences between the two data sets), or minor given the geographic scale. But visualising them at least makes it possible to form a judgement. April 25, 201304:47
Things are finally coming together, at least enough to have a functioning demo. It looks awful, but shows the main things I want BioNames to do. One thing I'm most concerned about at this stage is the possible confusion users might experience between taxon names and concepts. For example, there are two pages about Pteropus, one about the name Pteropus, the other about the bat that bears this name (as understood by GBIF).
The demo is live at http://bionames.org/bionames-api/mockup_index.php (note that this is a temporary URL so I can't guarantee it will be online when you read this). BioNames live mockup from Roderic Page on Vimeo. April 22, 201303:05
Over on Google Plus (yeah, me neither) Donat Agosti is giving me a hard time regarding the quality of some data that I am using. I've responded to Donat directly, but here I just want to quickly outline two different approaches to cleaning and reconciling bibliographic metadata.
The problem addressed by Donat is the issue of multiple strings for the same journal (e.g., the plethora of different abbreviations and permutations people use to refer to the same journal). In trying to make sense of this mess there are a couple of strategies we can use. One is to cluster the strings into sets that we think refer to the same thing, e.g.: We could then synthesise the preferred journal name from this set. We could make some sort of consensus string, for example. There are also some quite nice Bayesian methods for combining contradictory metadata. Another approach, which I use, is to map the strings to a third party identifier, in this case an ISSN: Once I've done this I can use the identifier to refer to the journal, hence ultimately I don't particularly care what string is best for the journal (indeed, I can defer to a third party for this decision). The point is obsessing with clean, "correct" bibliographic metadata is something of a fool's errand. Obviously, it's nice to have clean metadata if you can get it, but in many cases there is no exact answer to what is the correct metadata. Some journals have multiple names (e.g., in different languages), some run different volume numbering schemes in parallel, and date of publication can be rather problematic (see my Mendeley group on publication dates). If we can map a publication to a globally unique identifier, such as a DOI, then we can sidestep this issue and focus on what I think really matters - linking data together. April 17, 201323:59
This seems to be the season for big, arm-wavy documents about the future of biodiversity informatics (see A decadal view of biodiversity informatics: challenges and priorities). An equivalent document is being drafted based on the Global Biodiversity Informatics Conference (GBIC 2012) conference. Writing these documents is hard work, they have to balance a set of conflicting visions, predict the future, and communicate a coherent plan to people who either could help make this happen, or feel they have a stake in the outcome.
Leaving all those constraints behind, and waving arms wildly, here's one take on the future of biodiversity informatics. I see three themes. 1. Knowing what we know We have a limited grasp of how much we actually know, and crap tools to summarise this knowledge. I want a Google Analytics for biodiversity data where I can see at a glance the current state of our knowledge (e.g., what is the rate of sequencing of environmental samples in the Mediterranean? How much of Indonesia's amphibian fauna is in protected areas?). These are fairly trivial queries. If Google can analyse web traffic from sites being hit over a million times per day ( ~ 365 million hits per year) we can do the same thing on GBIF-scale databases. There is huge scope here for cool visualisation of the growth of our knowledge, such as this: If biologists were explorers (Mammalia)... from Andrew W Hill on Vimeo. Imagine the GBIF classification like this: filesystem visualisation from wonderful websolutions on Vimeo. 2. Life streamTerrible title, but this is where we monitor change, both "organic" and anthropogenic. This is where we use data mining to do a sentiment analysis of the biosphere, looking to detect changes such as outbreaks of disease, invasive species, etc. This builds on 1 but focusses on change. Imagine a "news service" for biology along the lines of tools available to financial markets (e.g., Silobreaker): This is where we interface with decision makers, in the sense that Braulio Dias's statement "I am convinced that the lack of adequate biodiversity monitoring is at the heart of our difficulties to make convincing arguments" is true, this tackles that question. 3. Modelling the biosphere Time to model all life on Earth (http://dx.doi.org/10.1038/493295a) is our equivalent of a moon shot (oh how I hate that analogy). Purves et al. have made the case, this is the task that will galvanise people outside the taxonomy/biodiversity community. This is real megascience (1. is data collection, 2. is data mining and analysis). Climate modellers and oceanographers get to do this: Can we do the same? 07:49
In an earlier post I discussed using Open Refine (formerly Google Refine) to clean and reconcile taxon names. I've added an additional service that can be used to reconcile author names that uses the Virtual International Authority File (VIAF) API. Using this service we can match authors to VIAF identifiers (you may have noticed these appearing on people's pages in Wikipedia, e.g. Mary J. Rathbun's Wikipedia page lists her VIAF as 61796012).
To use the service follow the instructions in the earlier post but add the service: http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_viaf.php This service is fairly crude, in particular, I make no attempt to score the matches that VIAR returns because this would require parsing and normalising author names. This could be added if needed. If you want some exmaple names to try, here are some taxonomists: George A Boulenger G A Boulenger Wilhelm Michaelsen W Michaelsen Colin Campbell Sanborn Suzanne Hand Philip Hershkovitz Yehudah Leopold Werner W B Spencer Norman Platnick April 16, 201313:27
BMC Ecology has published Alex Hardisty and Dave Roberts' white paper on biodiversity informatics:
Hardisty, A., & Roberts, D. (2013). A decadal view of biodiversity informatics: challenges and priorities. BMC Ecology, 13(1), 16. doi:10.1186/1472-6785-13-16 Here are their 12 recommendations (with some comments of my own):
Food for thought. I suspect we will see the gaggle of biodiversity informatics projects will seek to align themselves with some of these goals, carving up the territory. Sadly, we have yet to find a way to coalesce critical mass around tackling these challenges. It's a cliché, but I can't help thinking "what would Google do?" or, more, precisely, "what would a Google of biodiversity look like?" April 11, 201309:29
Quick notes on "taxon concepts". In order to navigate through taxon names I plan to have at least one taxonomic classification in BioNames. GBIF makes the most sense at this stage.
The model I'm adopting is that the classification is a graph where nodes have the id used by the external database (in this case GBIF). Each node has one or more names attached, and where possible the names are linked to the original description. Where we have synonyms it would be nice to link the synonymy to publication(s) that proposed that relationship. April 10, 201303:59
Donald Hobern drew my attention to nice the way iNaturalist displays taxonomic splits:
In this example, observations identified as Rhipidura fuliginosa are being split into Rhipidura fuliginosa and Rhipidura albiscapa. This immediately reminds me of the idea which keeps circulating around, namely using version control tools to manage taxonomic classification. Some years ago David Shorthouse proposed managing taxonomic classifications using version control, see Taxonomic Consensus as Software Creation. I discussed this in Taxonomy on a hard disk, and Pierre Lindenbaum has an interesting post on treating the NCBI taxonomy as a file system A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy. The idea is that a taxonomy, such as the GBIF backbone taxonomy, could be placed in GitHub where people could clone it, annotated, correct, edit, or otherwise mess with it, then GBIF could pull in those edits and release an updated, cleaner taxonomy. If software version control seems a bit esoteric, it's worth noting that use of GitHub is rapidly becoming much more mainstream in science, and not just for software development. People are using it to store versions of data analysis (e.g., https://github.com/dwinter/Fungal-Foray) and collaboratively write manuscripts (e.g., https://github.com/weecology/data-sharing-paper). The journal eLIFE is depositing articles there (e.g., https://github.com/elifesciences/elife-articles). In addition to all the infrastructure GitHub provides (the ability to identify who did what and when, to roll back changes, to fork classifications, etc.) there is also the attraction of not creating yet more software, but simply editing a classification by moving folders around on your local filesystem. The idea seems irresistible… April 7, 201315:27
Best. Flowchart. EVER! #ias13twitter.com/jcolman/status… — Jonathon Colman (@jcolman) April 7, 2013Came across this paper recently: Liu, C., Shi, L., Xu, X., Li, H., Xing, H., Liang, D., Jiang, K., et al. (2012). DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server. (R. DeSalle, Ed.)PLoS ONE, 7(5), e35146. doi:10.1371/journal.pone.0035146 Despite QR Codes being uncool, there's something appealing about the idea of compressing a DNA barcode sequence into a small image. Imagine having a specimen label with a QR Code, pointing a smart phone at the label using an app that converts the QR Code to a sequence, sends it to BLAST and returns a phylogeny that includes DNA from that specimen (perhaps using a service like http://iphylo.org/~rpage/phyloinformatics/blast). April 4, 201307:16
I'm working on displaying OCR text from BHL using SVG, and these are just some quick notes on font size. Specifically how SVG font size corresponds to the size of letters, and how you work out what point size was used to print text on a BHL page.
SVG font-size corresponds to the EM square of the font. Hence, if I specify a font-size of 100px then text looks like this (you'll need need a browser that supports SVG to see this): The yellow box is the EM square (in this example 100px by 100px). The height of the letter "M" is set by the properties of the font which in this case is Times-Roman which has a capheight of 662. This value (and others) are defined in the font description file (Adobe-Core35_AFMs-314.tar.gz). Below is a diagram showing attributes of Times Roman with respect to the 1000 x 1000 EM square: Couple of things to note. The first is that the height of a digital font is not given by simply adding the capheight and descender, the height of the font is the EM square. If you know the capheight and the font metrics you can compute the size of the EM square (for Times Roman capheight / 0.662 gives you the EM square). Hence it is possible to fairly accurately reproduce printed text in SVG. I had hopes that I could then go on to infer the actual point size used on the printed page (being able to say "this is 10pt" seemed more elegant than this font is "x pixels"). Turns out that "point size" is a terribly elusive concept, see Point Size and the Em Square: Not What People Think. I've clearly lots to learn about typography. BHL would be a gold mine for anyone interested in the development of type faces and printing technology over time. March 26, 201303:21
The new look Biodiversity Heritage Library includes articles extracted from BioStor, which is a step forwards in making the "legacy" biodiversity literature more accessible. But we still have some way to go. In particular the articles lack the obvious decoration of a modern article, the DOI. Consequently these articles still live in a twilight zone where they are cited in the literature but not linked to. DOIs are becoming more common for taxonomic articles. Zookeys has them, and now Zootaxa has adopted them (and will be applying them retrospectively to thousands of already published articles). Major archives of back issues digitised by Taylor and Francis, and Wiley, for example, also have DOIs.
One obstacle to assigning CrossRef DOIs to articles in BHL is the convention that DOIs are typically managed by the publisher of the journal. But in a number of cases the publisher may no longer exist, the journal may no longer be published, or the publisher may lack the commercial resources to support DOIs. In these cases perhaps BHL could adopt the role of publisher? Another approach is that adopted by a number of other digital archives, whereby the archive assigns DOIs to articles, but these DOIs are registered not through CrossRef but with another DOI registration agency, such as DataCite. For example the Swiss Electronic Academic Library Service (SEALS) archive assigns DOIs to individual articles, such as http://dx.doi.org/10.5169/seals-88913. There are some limitations to not using CrossRef DOIs, in particular, you don't get the full benefits of their metadata-based services such as getting metadata from a DOI, discovering DOIs from metadata, or citation linking. But all is to lost. Some services support both CrossRef and DataCite DOIs, such as http://crosscite.org/citeproc. For example, for the DOI 10.5169/seals-88913 we get some basic formatting: Perret, Jean-Luc. (1961). Etudes herpétologiques africaines III. Société Neuchâteloise des Sciences Naturelles. doi:10.5169/seals-88913 This still leaves us lacking some services, such as finding DOIs for articles cited in a manuscript. However this is a service we can provide, and will have to anyway if we want to find all the digitised literature available (e.g., archives such as SEALS as well as numerous instances of DSpace). My preference would be for CrossRef DOIs, but if that proves problematics we can still get much of the functionality we need using other DOI providers. March 20, 201307:57
On eof the things BioNames will need to do is match taxon names to classifications. For example, if I want to display a taxonomic hierarchy for the user to browse through the names, then I need a map between the taxon names that I've collected and one or more classifications. The approach I'm taking is to match strings, wherever possible using both the name and taxon authority. In many cases this is straightforward, especially if there is only one taxon with a name. But often we have cases where the same name has been used more than once for different taxa. For example, here is what ION has for the name "Nystactes".
Nystactes Bohlke2735131Nystactes2787598Nystactes Gloger 18274888093Nystactes Kaup 18294888094 If I want to map these names to GBIF then these are corresponding taxa with the name "Nystactes": Nystactes Böhlke, 19572403398Nystactes Gloger, 18272475109Nystactes Kaup, 18293239722 Clearly the names are almost identical, but there are enough little differences (presence or absence of comma, "o" versus "ö") to make things interesting. To make the mapping I construct a bipartite graph where the nodes are taxon names, divided into two sets based on which database they came from. I then connect the nodes of the graph by edges, weighted by how similar the names are. For example, here is the graph for "Nystactes" (displayed using Google images: I then compute the maximum weighted bipartite matching using a C++ program I wrote. This matching corresponds to the solid lines in the graph above. In this way we can make a sensible guess as to how names in the two databases relate to one another. March 19, 201302:39
One of the fun things about developing web sites is learning new tricks, tools, and techniques. Typically I hack away on my MacBook, and when something seems vaguely usable I stick it on a web server. For BioNames things need to be a little more formalised, especially as I'm collaborating with another developer (Ryan Schenk). Ryan is focussing on the front end, I'm working on the data (harvesting, cleaning, storing). In most projects I've worked on the code to talk to the database and the code to display results have been the same, it was ugly but it got things. For this project these two aspects have to be much more cleaning separated so that Ryan and I can work independently. One way to do this is to have a well-defined API that Ryan can develop against. This means I can hide the sometimes messy details of how to communicate with the data, and Ryan doesn't need to worry about how to get access to the data. Nice idea, but to be workable it requires that the API is documented (if it's just me then the documentation is in my head). Documentation is a pain, and it is easy for it to get out of sync with the code such that what the docs say an API does and what it actually does are two separate things (sound familiar?). What would be great is a tool that enables you to write the API documentation, and make that "live" so that the API output can be tested against. In other words, a tool like apiary.io. Apiary.io is free, very slick, and comes with GitHUb integration. I've started to document the BioNames API at http://docs.bionames.apiary.io/. These documents are "live" in that you can try out the API and get live results from the BioNames database. I'm sure this is all old news to real software developers (as opposed to people like me who know just enough to get themselves into trouble), but it's quite liberating to start with the API first before worrying about what the web site will look like. March 18, 201303:50
Tomorrow the new & improved #bhlib launches!! ow.ly/iVeZb Explore the changes in our Guide! ow.ly/iVf1W — BHL (@BioDivLibrary) March 17, 2013The new look Biodiversity Heritage Library has just launched. It's a complete refresh of the old site, based on the Biodiversity Heritage Library–Australia site. If you want an overview of what's new, BHL have published a guide to the new look site. Congrats to involved in the relaunch. One of the new features draws on the work I've been doing on BioStor. The new BHL interface adds the notion of "parts" of an item, which you can see under the "Table of Contents" tab. For example, the scanned volume 109 of the Proceedings of the Entomological Society of Washington now displays a list of articles within that volume: This means you can now jump to individual articles. Before you had to scroll through the scan, or click through page numbers until you found what you were after. The screenshot above shows the article "Three new species of chewing lice (Phthiraptera: Ischnocera: Philopteridae) from australian parrots (Psittaciformes: Psittacidae)". The details of this article have been extracted from BioStor, where this article appears as http://biostor.org/reference/55323. You can go directly to this article in BHL using the link http://www.biodiversitylibrary.org/part/69723. As an aside, I've chosen this article because it helps demonstrate that BHL has modern content as well as pre-1923 literature, and this article names a louse, Neopsittaconirmus vincesmithi after a former student of mine, Vince Smith. You're nobody in this field unless you've had a louse named after you ;) BioStor has over 90,000 articles, but this is a tiny fraction of the articles contained in BHL content, so there's a long way to go until the entire archive is indexed to article level. There will also be errors in the article metadata derived from BioStor. If we invoke Linus's Law ("given enough eyeballs, all bugs are shallow") then having this content in BHL should help expose those errors more rapidly. As always, I have a few niggles about the site, but I'll save those for another time. For noe, I'm happy to celebrate an extraordinary, open access archive of over 40 million pages. BHL represents one of the few truly indispensable biodiversity resources online. March 15, 201312:21
One of the biggest pains (and self-inflicted wounds) in taxonomy is synonymy, the existence of multiple names for the same taxon. A common cause of synonymy is moving species to different genera in order to have their name reflect their classification. The consequence of this is any attempt to search the literature for basic biological data runs into the problem that observations published at different times by different researchers (e.g., taxonomists, ecologists, parasitologists) may use different names for the same taxon.
Existing taxonomic databases often have lists of synonyms, but these are incomplete, and typically don't provide any evidence why two names are synonyms. Reading literature extracted form the Biodiversity Heritage Library I'm struck by how often I come across papers such as taxonomic revisions, museum catalogues, and checklists, that list two names as synonyms. Wouldn't it be great if we could mine these to automatically build lists of synonyms? One quick and dirty way to do this is look for sets of names that have the same species name but different generic names, e.g.
If such names appear on the same page (i.e., in close proximity) there's a reasonable chance they are synonyms. So, one of the features I'm building in BioNames is an index of names like this. Hence, if we are displaying a page for the name Atlantoxerus getulus that page could also display Sciurus getulus and Xerus getulus as possible synonyms. There's a lot more that could be done with this sort of approach. For example, this approach only works if the the species name remains unchanged. To improve it we'd need to do things like handle changes to the ending of a species name to agree with the gender of the genus, and cases where the taxa are demoted to subspecies (or promoted to species). If we were even clever we'd attempt to parse synonymy lists to extract even more synonyms (for an example see Huber and Klump (PDF available here): Huber, R., & Klump, J. (2009). Charting taxonomic knowledge through ontologies and ranking algorithms. Computers & Geosciences, 35(4), 862–868. doi:10.1016/j.cageo.2008.02.016 Then there's the broader topic of looking at co-occurrence of taxonomic names in general. As I noted a while ago there are examples of pages in BHL that lists taxonomically unrelated taxa that are ecologically closely associated (e.g., hosts and parasites). Hence we could imagine automatically building host-parasite databases by mining the literature. Initially we could simply display lists of names that co-occur frequently. Ideally we'd filter out "accidental" co-occurrences, such as indexes or tables of contents, but there seems to be a lot of potential in automating the extraction of basic information from the taxonomic literature. 04:27
Yet another taxonomic database, this time I can't blame anyone else because I'm the one building it (with some help, as I'll explain below).
BioNames was my entry in EOL's Computable Data Challenge (you can see the proposal here: http://dx.doi.org/10.6084/m9.figshare.92091). In that proposal I outlined my goal: BioNames aims to create a biodiversity “dashboard” where at a glance we can see a summary of the taxonomic and phylogenetic information we have for a given taxon, and that information is seamlessly linked together in one place. It combines classifications from EOL with animal taxonomic names from ION, and bibliographic data from multiple sources including BHL, CrossRef, and Mendeley. The goal is to create a database where the user can drill down from a taxonomic name to see the original description, track the fate of that name through successive revisions, and see other related literature. Publications that are freely available will displayed in situ. If the taxon has been sequenced, the user can see one or more phylogenetic trees for those sequences, where each sequence is in turn linked to the publication that made those sequences available. For a biologist the site provides a quick answer to the basic question “what is this taxon?”, coupled with with graphical displays of the relevant bibliographic and genomic information. The bulk of the funding from EOL is going into interface work by Ryan Schenk (@ryanschenk), author of synynyms among other cool things. EOL's Chief Scientist Cyndy Parr (@cydparr) is providing adult supervision ("Chief Scientist", why can't I have a title like that?). Development of BioNames is taking place in the open as much as we can, so there are some places you can see things unfold:
I've lots of terrible code scattered around which I am in the process of organising into something usable, which I'll then post on GitHub. Working with Ryan is forcing me to be a lot more thoughtful about coding this project, which is a good thing. Currently I'm focussing on building an API that will support the kinds of things we want to do. I'm hoping to make this public shortly. The original proposal was a tad ambitious (no, really). Most of what I hope to do exists in one form or another, but making it robust and usable is a whole other matter. As the project takes shape I hope to post updates here. If you have any suggestions feel free to make them. The current target is to have this "out the door" by the end of May. March 13, 201304:39
This is not a post I'd thought I'd write, because OpenURL is an awful spec. But last week I ended up in vigorous debate on Twitter after I posted what I thought was a casual remark:
If you publish bibliographic data and don't use COinS ocoins.info you are doing it wrong (I'm looking at you @europepmc_news) — Roderic Page (@rdmpage) March 8, 2013This ended up being a marathon thread about OpenURL, accessibility, bibliographic metadata, and more. It spilled over onto a previous blog post (Tight versus loose coupling) where Ed Summers and I debated the merits of Context Object in Span (COinS). This debate still nags at me because I think there's an underlying assumption that people making bibliographic web sites know what's best for their users. Ed wrote: I prefer to encourage publishers to use HTML's metadata facilities using the tag and microdata/RDFa, and build actually useful tools that do something useful with it, like Zotero or Mendeley have done. That's fine, I like embedded metadata, both as a consumer and as a provider (I provide Google Scholar-compatible metadata in BioStor). What I object to is the idea that this is all we need to do. Embedded metadata is great if you want to make individual articles visible to search engines: Tools like Google (or bibliographic managers like Mendeley and Zotero) can "read" the web page, extract structured data, and do something with that. Nice for search engines, nice for repositories (metadata becomes part of their search engine optimisation strategy). But this isn't the only thing a user might want to do. I often find myself confronted with a list of articles on a web site (e.g., a bibliography on a topic, a list of references cited in a paper, the results of a bibliographic search) and those references have no links. Often those links may not have existed when original web page was published, but may exist now. I'd like a tool that helped me find those links. If a web site doesn't provide the functionality you need then, luckily, you are not entirely at the mercy of the people who made the decisions about what you can and can't do. Tools like Greasemonkey pioneered the idea that we can hack a web page to make it more useful. I see COinS as an example of this approach. If the web page doesn't provide links, but has embedded COinS then I can use those to create OpenURL links to try and locate those references. I am no longer bound by the limitations of the web page itself. This strikes me as very powerful, and I use COinS a lot where they are available. For example, CrossRef's excellent search engine supports COinS, which means I can find a reference using that tool, then use the embedded COinS to see whether there is a version of that article digitised by the Biodiversity Heritage Library. This enables me to do stuff that CrossRef itself hasn't anticipated, and that makes their search engine much more valuable to me. In a way this is ironic because CrossRef is predicated on the idea that there is one definitive link to a reference, the DOI. So, what I found frustrating about the conversation with Ed was that it seemed to me that his insistence on following certain standards was at the expense of functionality that I found useful. If the client is the search engine, or the repository, then COinS do indeed seem to offer little apart from God-awful HTML messing up the page. But if you include the user and accept that users may want to do stuff that you don't (indeed can't) anticipate then COinS are useful. This is the "genius of and", why not support both approaches? Now, COinS are not the only way to implement what I want to do, we could imagine other ways to do this. But to support the functionality that they offer we need a way to encode metadata in a web page, a way to extract that metadata and form a query URL, and a set of services that know what to do with that URL. OpenURL and COinS provide all of this right now and work. I'd be all for alternative tools that did this more simply than the Byzantine syntax of OpenURL, but in the absence of such tools I stick by my original tweet: If you publish bibliographic data and don't use COinS you are doing it wrong 04:03
I spend a lot of time searching the web for bibliographic metadata and links to digitised versions of publications. Sometimes I search Google and get nothing, sometimes I get the article I'm after, but often I get something like this:
If I search for Die cestoden der Vogel in Google I get masses of hits for the same thing from multiple sources (e.g., Google Books, Amazon, other booksellers, etc.). For this query we can happily click through pages and pages of results that are all, in some sense, the same thing. Sometimes I get the similar results when searching for an article, multiple hits from sites with metadata on that article, but few, if any with an actual link to the article itself. One byproduct of putting bibliographic metadata on the web is that we are starting to pollute web space with repetitions of the same (or closely similar) metadata. This makes searching for definitive metadata difficult, never mind actually finding the content itself. In some cases we can use tools such as Google Scholar, which clusters multiple versions of the same reference, but Google Scholar is often poor for the kind of literature I am after (e.g., older taxonomic publications). As Alan Ruttenberg (@alanruttenbergpoints out, books would seem to be a case where Google could extend its knowledge graph and cluster the books together (using ISBNs, title matching, etc.). But meantime if you think simply pumping out bibliographic metadata is a good thing, spare a thought for those of us trying to wade through the metadata soup looking for the "good stuff". |
Latest issue
EVOLDIRThe Barcode of LifeiPhyloPhyloseminarSystematics AssociationNESCentThe Genealogical World of Phylogenetic NetworksCiteULike PhylogenyEvolutionary BioinformaticsCladisticsBMC Evolutionary Biology
|