There are currently 0 users and 109 guests online.
Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.
Last update58 min 5 sec ago
March 7, 2014
As part of a project exploring GBIF data I've been playing with displaying GBIF data on Google Maps. The GBIF portal doesn't use Google Maps, which is a pity because Google's terrain and satellite layers are much nicer than the layers used by GBIF (I gather there are issues with the level of traffic that GBIF receives is above the threshold at which Google starts charging for access).
But because the GBIF developers have a nice API it's pretty easy to put GBIF data on Google maps, like this (the map is live):
The source code for this map is available as a gist, and you can see it live above, and at http://bl.ocks.org/rdmpage/9411457.
March 3, 2014
A quick note to myself to document a problem with the GBIF classification of liverworts (I've created issue POR-1879 for this).
While building a new tool to browse GBIF data I ran into a problem that the taxon "Jungermanniales" popped up in two different places in the GBIF classification, which broke a graphical display widget I was using.
If you search GBIF for Jungermanniales you get two results, both listed as "accepted":
Plantae > Marchantiophyta > Jungermanniopsida
Plantae > Bryophyta > Jungermanniopsida
Based on Wikipedia pages for Marchantiophyta, traditionally liverworts such as the Jungermanniales were included in the Bryophyta, but are now placed in the Marchantiophyta. The GBIF classification has both the old and the new placement for liverworts (sigh).
As an aside, I couldn't figure out why Wikipedia gave the following as the reference for the publication of the name "Marchantiophyta":
Stotler, R., & Crandall-Stotler, B. (1977). A Checklist of the Liverworts and Hornworts of North America. The Bryologist. JSTOR. doi:10.2307/3242017
This paper doesn't mention "Marchantiophyta" anywhere. The other authority Wikipedia gives is:
Crandall-Stotler, B., Stotler, R. E., Long, D. G., & Goffinet, B. (2001). Morphology and classification of the Marchantiophyta. (A. J. Shaw, Ed.)Bryophyte Biology. Cambridge University Press. doi:10.1017/cbo9780511754807.002
which is behind a paywall that my University doesn't subscribe too. However, the following reference
Stotler, R., & Crandall-Stotler, B. (2008). Correct Author Citations for Some Upper Rank Names of Liverworts (Marchantiophyta). Taxon, 57(1):289-292. jstor:25065970
Marchantiophyta was validly published (Crandall Stotler & Stotler, 2000) by reference to the Latin description of Hepatophyta in Stotler & Crandall-Stotler (1977:425), a designation formed from an illegitimate generic name, Hepatica Adans. non Mill, and hence not validly published under Article 16.1, cross-referenced in Article 32.1(c) of the ICBN (McNeill & al, 2006). Marchantiophyta phylum nov. is redundant in Doweld (2001) who established a later isonym by likewise citing the Latin description in Stotler & Crandall-Stotler (1977) for his proposed "new name."
And who said taxonomy couldn't be fun?
February 19, 2014
There is a great post by Jeni Tennison on the Open Data Institute blog entitled Five Stages of Data Grief. It resonates so much with my experience working with biodiversity data (such as building BioNames, or exploring data errors in GBIF) that I've decide to reproduce it here.
Five Stages of Data Grief
As organisations come to recognise how important and useful data could be, they start to think about using the data that they have been collecting in new ways. Often data has been collected over many years as a matter of routine, to drive specific processes or sometimes just for the sake of it. Suddenly that data is repurposed. It is probed, analysed and visualised in ways that haven’t been tried before.
Data analysts have a maxim:
If you don’t think you have a quality problem with your data, you haven’t looked at it yet.
Every dataset has its quirks, whether it’s data that has been wrongly entered in the first place, automated processing that has introduced errors, irregularities that come from combining datasets into a consistent structure or simply missing information. Anyone who works with data knows that far more time is needed to clean data into something that can be analysed, and to understand what to leave out, than in actually performing the analysis itself. They also know that analysis and visualisation of data will often reveal bugs that you simply can’t see by staring at a spreadsheet.
But for the people who have collected and maintained such data — or more frequently their managers, who don’t work with the data directly — this realisation can be a bit of a shock. In our last ODI Board meeting, Sir Tim Berners-Lee suggested that the data curators need to go through was something like the five stages of grief described by the Kübler-Ross model.
So here is an outline of what that looks like.Denial
This can’t be right: there’s nothing wrong with our data! Your analysis/code/visualisation must be doing something wrong.
At this stage data custodians can’t believe what they are seeing. Maybe they have been using the data themselves but never run into issues with it because they were only using it in limiting ways. Maybe they had only ever been collecting the data, and not actually using it at all. Or maybe they had been viewing it in a form where the issues with data quality were never surfaced (it’s hard to spot additional spaces, or even zeros, when you just look at a spreadsheet in Excel, for example).
So the first reason that they reach for is that there must be something wrong with the analysis or code that seems to reveal issues with the data. There may follow a wild goose chase that tries to track down the non-existent bug. Take heart: this exercise is useful in that it can pinpoint the precise records that are causing the problems in the first place, which forces the curators to stop denying them.Anger
Who is responsible for these errors? Why haven’t they been spotted before?
As the fact that there are errors in the data comes to be understood, the focus can come to rest on the people who collect and maintain the data. This is the phase that the maintainers of data dread (and can be a reason for resisting sharing the data in the first place), because they get blamed for the poor quality.
This painful phase should eventually result in an evaluation of where errors occur — an evaluation that is incredibly useful, and should be documented and kept for the Acceptance phase of the process — and what might be done to prevent them in future. Sometimes that might result in better systems for data collection but more often than not it will be recognised that some of the errors are legacy issues or simply unavoidable without massively increasing the maintenance burden.Bargaining
What about if we ignore these bits here? Can you tweak the visualisation to hide that?
And so the focus switches again to the analysis and visualisations that reveal the problems in the data, this time with an acceptance that the errors are real, but a desire to hide the problems so that they’re less noticeable.
This phase puts the burden on the analysts who are trying to create views over the data. They may be asked to add some special cases, or tweak a few calculations. Areas of functionality may be dropped in their entirety or radically changed as a compromise is reached between utility of the analysis and low quality data to feed it.Depression
This whole dataset is worthless. There’s no point even trying to capture this data any more.
As the number of exceptions and compromises grows, and a realisation sinks in that those compromises undermine the utility of the analysis or visualisation as a whole, a kind of despair sets in. The barriers to fixing the data or collecting it more effectively may seem insurmountable, and the data curators may feel like giving up trying.
This phase can lead to a re-examination of the reasons for collecting and maintaining the data in the first place. Hopefully, this process can aid everyone in reasserting why the data is useful, regardless of some aspects that are lower quality than others.Acceptance
We know there are some problems with the data. We’ll document them for anyone who wants to use it, and describe the limitations of the analysis.
In the final stage, all those involved recognise that there are some data quality problems, but that these do not render the data worthless. They will understand the limits of analyses and interpretations that they make based on the data, and they try to document them to avoid other people being misled.
The benefits of the previous stages are also recognised. Denial led to double-checking the calculations behind the analyses, making them more reliable. Anger led to re-examination of how the data was collected and maintained, and documentation that helps everyone understand the limits of the data better. Bargaining forced analyses and visualisations to be focused and explicit about what they do and don’t show. Depression helped everyone focus on the user needs from the data. Each stage makes for a better end product.
Of course doing data analysis isn’t actually like being diagnosed with a chronic illness or losing a loved one. There are things that you can do to remedy the situation. So I think we need to add a sixth stage to the five stages of data grief described above:Hope
This could help us spot errors in the data and fix them!
Providing visualisations and analysis provides people with a clearer view about what data has been captured and can make it easier to spot mistakes, such as outliers caused by using the wrong units when entering a value, or new categories created by spelling mistakes. When data gets used to make decisions by the people who capture the data, they have a strong motivation to get the data right. As Francis Irving outlined in his recent Friday Lunchtime Lecture at ODI, Burn the Digital Paper, these feedback loops can radically change how people think about data, and use computers within their organisations.
Making data open for other people to look at provides lots more opportunities for people to spot errors. This can be terrifying — who wants people to know that they are running their organisation based on bad-quality data? — but those who have progressed through the five stages of data grief find hope in another developer maxim:
Given enough eyeballs, all bugs are shallow.
The more people look at your data, the more likely they are to find the problems within it. The secret is to build in feedback mechanisms which allow those errors to be corrected, so that you can benefit from those eyes and increase your data quality to what you thought it was in the first place.
February 10, 2014
I gave a remote presentation at a proiBioSphere workshop this morning. The slides are below (to try and make it a bit more engaging than a desk of Powerpoints I played around with Prezi).
There is a version on Vimeo that has audio as well.
I sketched out the biodiversity "knowledge graph", then talked about how mark-up relates to this, finishing with a few questions. The question that seems to have gotten people a little agitated is the relative importance of markup versus, say, indexing. As Terry Catapano pointed out, in a sense this is really a continuum. If we index content (e.g., locate a string that is a taxonomic name) and flag that content in the text, then we are adding mark-up (if we don't, we are simply indexing, but even then we have mark-up at some level, e.g. "this term occurs some where on this page"). So my question is really what level of markup do we need to do useful work? Much of the discussion so far has centered around very detailed mark-up (e.g., the kind of thing ZooKeys does to each article). My concern has always been how scalable this is, given the size of the taxonomic literature (in which ZooKeys is barely a blip). It's the usual trade off, do we go for breadth (all content indexed, but little or no mark-up), or do we go for depth (extensive mark-up for a subset of articles)? Where you stand on that trade off will determine to what extent you want detailed mark up, versus whether indexing is "good enough".
January 24, 2014
Scott Federhen told me about a nice new feature in GenBank that he's described in a piece for NCBI News. The NCBI taxonomy database now shows a its of type material (where known), and the GenBank sequence database "knows: about types. Here's the summary:
The naming, classification and identification of organisms traditionally relies on the concept of type material, which defines the representative examples ("name-bearing") of a species. For larger organisms, the type material is often a preserved specimen in a museum drawer, but the type concept also extends to type bacterial strains as cultures deposited in a culture collection. Of course, modern taxonomy also relies on molecular sequence information to define species. In many cases, sequence information is available for type specimens and strains. Accordingly, the NCBI has started to curate type material from the Taxonomy database, and are using this data to label sequences from type specimens or strains in the sequence databases. The figure below shows type material as it appears in the NCBI taxonomy entry and a sequence record for the recently described African monkey species, Cercopithecus lomamiensis.
You can query for sequences from type using the query "sequence from type"[filter]. This could lead to some nice automated tools. If you had a bunch of distinct clusters of sequences that were all labelled with the same species name, and one cluster includes a sequence form the type specimen, then the other clusters are candidates for being described as new names.
VertNet has announced that they have implemented issue tracking using GitHub. This is a really interesting development, as figuring out how to capture and make use of annotations in biodiversity databases is a problem that's attracting a lot of attention. VertNet have decided to use GitHub to handle annotations, but in a way that hides most of GitHub from users (developers tend to love things like GitHub, regular folks, not so much, see The Two Cultures of Computing).
The VertNet blog has a detailed walk through of how it works. I've made some comments on that blog, but I'll repeat them here.
At the moment the VertBet interface doesn't show any evidence of issue tracking (there's a link to add an issue, but you can't see if there are any issues). For example, visiting an example CUMV Amphibian 1766 I don't see any evidence on that page that there is an issue for this record (there is an issue, see https://github.com/cumv-vertnet/cumv-amph/issues/1). It think it's important that people see evidence of interaction (that way you might encourage others to participate). This would also enable people to gauge how active collection managers are in resolving issues ("gee, they fixed this problem in a couple of days, cool").
Likewise, it would be nice to have a collection-level summary in the portal. For example, looking at CUMV Amphibian 1766 I'm not able to click through to a page for CUMV Amphibians (why can't I do this at the moment - there needs to be a way for me to get to the collection from a record) to see how many issues there are for the whole collection, and how fast they are being closed.
I think the approach VertNet are using has a lot of potential, although it sidesteps some of the most compelling features of GitHub, namely forking and merging code and other documents. I can't, for example, take a record, edit it, and have those edits merged into the data. It's still a fairly passive "hey, there's a problem here", which means that the burden is still on curators to fix the issue. This raises the whole question of what to do with user-supplied edits. There's a nice paper regarding validating user input into Freebase that is relevant here, see "Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation" (http://dx.doi.org/10.1145/2556195.2556227 [not live yet], PDF here).
January 15, 2014
More for my own benefit than anything else I've decided to list some of the things I plan to work on this year. If nothing else, it may make sobering reading this time next year.
A knowledge graph for biodiversity
Google's introduction of the "knowledge graph" gives us a happy phrase to use when talking about linking stuff together. It doesn't come with all the baggage of the "semantic web", or the ambiguity of "knowledge base". The diagram below is my mental model of the biodiversity knowledge graph (this comes from http://dx.doi.org/10.7717/peerj.190, but I sketched most of this for my Elsevier Challenge entry in 2008, see http://dx.doi.org/10.1038/npre.2008.2579.1).
Parts of this knowledge graph are familiar: articles are published in journals, and have authors. Articles cite other articles (represented by a loop in the diagram below). The topology of this graph gives us citation counts (number of times an article has been cited), impact factor (citations for articles in a given journal), and author-based measures such as the H-index (a function of the distribution of citations for each article you have authored). Beyond simple metrics this graph also gives us the means to track the provenance of an idea (by following the citation trail).
The next step is to grow this graph to include the other things we care about (e.g., taxa, taxon names, specimens, sequences, phylogenies, localities, etc.).
I spent a good deal of last year building BioNames (for background see my blog posts or read the paper in PeerJ http://dx.doi.org/10.7717/peerj.190). BioNames represents a small corner of the biodiversity knowledge graph, namely taxonomic names and their associated publications (with added chocolately goodness of links to taxon concepts and phylogenies). In 2014 I'll continue to clean this data (I seem to be forever cleaning data). So far BioNames is restricted to animal names, but now that the plant folks have relaxed their previously restrictive licensing of plant data (see post on TAXACOM) I'm looking at adding the million or so plant names (once I've linked as many as possible to digital identifiers for the corresponding publications).
Now that I've become more involved in GBIF I'm spending more time thinking about spatial indexing, and our ability to find biodiversity data on a map. There's a great Google ad that appeared on UK TV late last year. In it, Julian Bayliss recounts the use of Google Earth to discover of virgin rainforest (the "Google forest") on Mount Mabu in Mozambique.
It's a great story, but I keep looking at this and wondering "how did we know that we didn't know anything about Mount Mabu?" In other words, can we go to any part of the world and see what we know about that area? GBIF goes a little way there with its specimen distribution maps, which gives some idea of what is now known from Mount Mabu (although the map layers used by GBIF are terrible compared to what Google offers).
But I want to be able to see all the specimens now known from this region (including the new species that have been discovered, e.g. see http://dx.doi.org/10.1007/s12225-011-9277-9 and http://dx.doi.org/10.1080/21564574.2010.516275). Why can't I have a list of publications relevant to this area (e.g., species descriptions, range extensions, ecological studies, conservation reports)? What about DNA sequences from material in this region (e.g., from organismal samples, DNA barcodes, metagenomics, etc.)? If GBIF is to truly be a "Global Biodiversity Information Facility" then I want it to be able to provide me with a lot more information than it currently does. The challenge is how to enable that to happen.
January 9, 2014
Given that it's the start of a new year, and I have a short window before teaching kicks off in earnest (and I have to revise my phyloinformatics course) I'm playing with a few GBIF-related ideas. One topic which comes up a lot is annotating and correcting errors. There has been some work in this area  bit it strikes me as somewhat complicated. I'm wondering whether we couldn't try and keep things simple.
From my perspective there are a bunch of problems to tackle. The first is that occurrence data that ends up in GBIF may be incorrect, and it would be nice if GBIF users could (at the very least) flag those errors, and even better fix them if they have the relevant information. For example, it may be clear that a frog apparently in the middle of the ocean is there because latitude and longitudes were swapped, and this could be easily fixed.
Another issue is that data on an occurrence may not be restricted to a single source. It's tempting to think, for example, that the museum housing a specimen has the authoritative data on that specimen, but this need not be the case. Sometimes museums either lack (or decide not to make available) data such as geographic coordinates, but this information is available from other sources (such as the primary literature, or GenBank, see e.g. Linking GBIF and GenBank). Speaking of Genbank, there is a lot of basic biodiversity data in GenBank (such as georeferenced voucher specimens) and it would be great to add that data to GBIF. One issue, however, is that some of the voucher specimens in GenBank will already be in GBIF, potentially creating duplicate records. Ideally each specimen would be represented just once in GBIF, but for a bunch of reasons this is tricky to do (for a start, few specimens have globally unique identifiers, see DOIs for specimens are here, but we're not quite there yet), hence GBIF has duplicate specimen records. So, we are going to have to live with multiple records for the 'same" thing.
Lastly there is the ongoing bugbear that URLs for GBIF occurrences are not stable. This is frustrating in the extreme because it defeats any attempt to link these occurrences to other data (e.g., DNA sequences, the Biodiversity Heritage Library, etc.). If the URLs regularly break then there is little incentive to go to the trouble of creating links between different data bases, and biodiversity data will remain in separate silos.
So, we have three issues: user edits and corrections of data hosted by GBIF, multiple sources of data on the same occurrence, and lack of persistence links to occurrences.
If we accept that the reality is we will always have duplicates, then the challenge becomes how to deal with them. Let's imagine that we have multiple instances of data on the same occurrence, and that we have some way of clustering those records together (e.g., using the specimen code, the Darwin Core Triple, additional taxonomic information, etc.). Given that we have multiple records we may have multiple values for the same item, such as locality, taxon name, geo-coordinates, etc. One way to reconcile these is to use an approach developed for handling bibliographic metadata derived from citations, as described in (PDF here). If you are building a bibliographic database from lists of literature cited, you need to cluster the citations that are sufficiently similar to be likely to be the same reference. You might also want to combine those records to yield a best estimate of the metadata for the actual reference (in other words, one author might have cited the article with an abbreviated journal name, another author might have cited only the first page, etc., but all might agree on the volume the article occurs in). Councill et al. use Bayesian belief networks to derive an estimate of the correct metadata.
What is nice about this approach is that you retain all the original data, and you can weight each source by some measure of its reliability (i.e., the "prior"). Hence, we could weight a user's edits based on some measure, such as the acceptance of other edits they've made or, say, their authority (a user who is the author of a taxonomic revision of a group might know quite a bit about the specimens belonging to those taxa). If a user edits a GBIF record (say, but adding latitude and longitude values) we could add that as a "new" record, linked to the original, and containing just the edited values (we could also enable the user to confirm that other values are correct).
So, what do we show regular users of GBIF if we have multiple records for the same occurrence? In effect we compute a "consensus" based on the multiple records, tackling into account the prior probabilities that each source is reliable. What about the museums (or other "providers")? Well, they can grab all the other records (e.g., the user edits, the GenBank information, etc.) and use it to update their records, if they so choose. If they do so, next time GBIF harvest their data, the GBIF version of that data is updated, and we can recompute the new "consensus". It would be nice to have some way of recording whether the other edits/records we accepted, so we can gauge the reliability of those sources (a user whose edits are consistently accepted gets "up voted"). The provider could explicitly tell GBIF which edits it accepted, or we could infer them by comparing the new and old versions.
To retain a version history we'd want to keep the new and old provider records. This could be done using timestamps - any record has a creation date, and an expiry date. By default the expiry date is far in the future, but if a record is replaced it's expiry date is set to that time, and it is ignored when indexing the data.
How does this relate to duplicates? Well, GBIF has a habit of deleting whole sets of data if it indexes data from a provider and that provider has done something foolish, such as change the fields GBIF uses to identify the record (another reason why globally unique identifiers for specimens can't come soon enough). Instead of deleting the old records (and breaking any links to those records) GBIF could simply set their expiry date but keep them hanging around. They would not be used to create consensus records for an occurrence, but if someone used a link that had a now deleted occurrence id they could be redirected to the current cluster that corresponds to that old id, and hence the links would be maintained (albeit pointing to possibly edited data).
This is still a bit half-baked, but I think the challenge GBIF faces is how to make the best of messy data which may lack a single definitive source. The ability for users to correct GBIF-hosted data would be a big step forward, as would the addition of data from Genbank and the primary literature (the later has the advantage that in many cases it will presumably have been scrutinised by experts). The trick is to make this simple enough that there is a realistic chance of it being implemented.
 Wang, Z., Dong, H., Kelly, M., Macklin, J. A., Morris, P. J., & Morris, R. A. (2009). Filtered-Push: A Map-Reduce Platform for Collaborative Taxonomic Data Management. 2009 WRI World Congress on Computer Science and Information Engineering (pp. 731–735). Institute of Electrical and Electronics Engineers. doi:10.1109/CSIE.2009.948
 Morris, R. A., Dou, L., Hanken, J., Kelly, M., Lowery, D. B., Ludäscher, B., Macklin, J. A., et al. (2013). Semantic Annotation of Mutable Data. (I. N. Sarkar, Ed.)PLoS ONE, 8(11), e76093. doi:10.1371/journal.pone.0076093
 Councill, I. G., Li, H., Zhuang, Z., Debnath, S., Bolelli, L., Lee, W. C., Sivasubramaniam, A., et al. (2006). Learning metadata from the evidence in an on-line citation matching scheme. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’06 (p. 276). Association for Computing Machinery. doi:10.1145/1141753.1141817
December 12, 2013
The following is a guest blog post by David Schindel and colleagues and is a response to the paper by Antonio Marques et al. in Sciencedoi:10.1126/science.341.6152.1341-a.
Marques, Maronna and Collins (1) rightly call on the biodiversity research community to include latitude/longitude data in database and published records of natural history specimens. However, they have overlooked an important signal that the community is moving in the right direction. The Consortium for the Barcode of Life (CBOL) developed a data standard for DNA barcoding (2) that was approved and implemented in 2005 by the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, ENA and DDBJ) and revised in 2009. . All data records that meet the requirements of the data standard include the reserved keyword 'BARCODE'. The required elements include: (a) information about the voucher specimen from which the DNA barcode sequence was derived (e.g., species name, unique identifier in a specimen repository, country/ocean of origin); (b) a sequence from an approved gene region with minimum length and quality; and (c) primer sequences and the forward and reverse trace files. Participants in the workshops that developed the data standard decided to include latitude and longitude as strongly recommended elements but not as strict requirements for two reasons. First, many voucher specimens from which BARCODE records are generated may have been collected before GPS devices were available. Second, barcoding projects such as the Barcode of Wildlife Project (4) are concentrating on rare and endangered species. Publishing the GPS coordinates of collecting localities would facilitate illegal collecting and trafficking that could contribute to biodiversity loss.
The BARCODE data standard is promoting precisely the trend toward georeferencing called for by Marques, Marrona and Collins. Table 1 shows that there are currently 346,994 BARCODE records in INSDC (3). Of these BARCODE records, 83% include latitude/longitude data. Despite not being a required element in the data standard, this level of georeferencing is much higher than for all cytochrome c oxidase I gene (COI), the BARCODE region, 16S rRNA, and cytochrome b (cytb), another mitochondrial region that was used used for species identification prior to the growth of barcoding. Data are also presented on the numbers and percentages of data records that include information on the voucher specimen from which the nucleotide sequence was obtained. In an increasing number of cases, these voucher specimen identifiers in INSDC are hyperlinked to the online specimen data records in museums, herbaria and other biorepositories. Table 2 provides these same data for the time interval used in the Marques et al. letter (1). These tables indicate the clear effect that the BARCODE data standard is having on the community’s willingness to provide more complete data documentation.
Table 1. Summary of metadata for GPS coordinates and voucher specimens associated with all data records.
Categories of data recordsTotal number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDsBARCODE347,349286,975 (83%)347,077 (~100%)All COI751,955365,949 (49%)531,428 (71%)All 16S4,876,284461,030 (9%)138,921 (3%)All cytb239,7967,776 (3%)84,784 (35%)Table 2. Summary of metadata for GPS coordinates and voucher specimens associated with data records submitted between 1 July 2011 and 15 June 2013.
Total number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDsBARCODE160,615132,192 (82%)160,615 (100%)All COI302,507166,967 (55%)231,462 (77%)All 16S1,535,364232,567 (15%)49,150 (3%)All cytb74,6312,920 (4%)24,386 (33%)
The DNA barcoding community's data standard is demonstrating two positive trends: better documentation of specimens in natural history collections, and new connectivity between databases of species occurrences and DNA sequences. We believe that these trends will become standard practices in the coming years as more researchers, funders, publishers and reviewers acknowledge the value of, and begin to enforce compliance with the BARCODE data standard and related minimum information standards for marker genes (5).
DAVID E. SCHINDEL1, MICHAEL TRIZNA1, SCOTT E. MILLER1, ROBERT HANNER2, PAUL D. N. HEBERT2, SCOTT FEDERHEN3, ILENE MIZRACHI3
The Barcode of Life
The Genealogical World of Phylogenetic Networks
BMC Evolutionary Biology
Molecular Biology and Evolution