BIOSCAN and Taxonomy

BIOSCAN and Taxonomy

The BIOSCAN program offers significant opportunities to support taxonomic science and to accelerate description of the world’s biota. This thread is for general exploration of the linkages between BIOSCAN and taxonomic research. These linkages are in both directions, with BIOSCAN needing expert knowledge to interpret and curate data for different taxonomic groups, and with BIOSCAN offering data and insights for separating and characterising species. Early planning will help to maximise benefits and to avoid potential pitfalls. If appropriate, subtopics will be broken out into separate threads for more focused discussion. This opening post will be maintained as an introduction and overview of discussion topics.

Discuss linkages between BIOSCAN and taxonomy below.

BINs, OTUs and Species

The Barcode Index Numbers (BINs) generated by BOLD clustering serve as the units of discovery for barcoding and metabarcoding. They serve in many contexts as operational taxonomic units (OTUs). Many BINs are already identified as representing known and named species. Others may represent known species or currently unnamed species. In all cases, there may be many-to-many mappings between BINs and biological species. BINs are typically associated with one or more scientific names or OTU labels, based on the identifications submitted with the barcode sequences. What enhancements to BOLD or additional tools and services could be developed to improve the representation and usefulness of this information as a tool for taxonomy?

If the BIN system is maintained as a stable identifier scheme, with predictable rules, particularly for handling cases where additional sequences lead to the need to fuse or split the cluster of specimens associated with a BIN, it can provide a predictable and indeed computable basis for tracking understanding of OTUs and their relationships with known species. What features are required to make this possible?

The species identifications offered when submitting barcode sequences to BOLD vary in their reliability. The same BIN may be associated with several divergent identifications. In some cases, this will represent failure of the barcode sequences to separate related species. In other cases, it will represent mistaken identifications. On the other hand, a single binomial may be associated with multiple BINs, which may indicate intraspecific variation, cryptic species or further misidentifications. Information contained in associated metadata may sometimes help with resolving these questions. In particular, barcode sequences associated with type specimens would reliably anchor the reference for a name within a BIN. However, in general, expert knowledge is required to assist with resolving such issues. What tools should be offered, and what assistance can be given to taxonomists and others, to make such curation feasible and efficient?

BINs and New Species

Mapping the barcodes from millions of specimens inevitably reveals new and apparently unnamed OTUs, many of which will correspond to valid species. Under what circumstances, following what assessment and according to what threshold is it appropriate for these to be described as new species? This has particular relevance in the context of hyperdiverse genera, where massive numbers of apparent species may await description and where it may be difficult to offer reliable morphological characters for well-delimited OTUs. The challenge of offering diagnostic characters increases exponentially with the size of the group.

How should this challenge be addressed? Should BINs be promoted for wider use as interim identifiers for these taxonomic units? (This would again be a reason for ensuring that the BIN system offers predictability and transparency.) Or should new binomials be assigned to these units subject to best-practice criteria and thresholds? (This would have the advantage of placing these taxonomic units within the most widely understood framework for tracking and reporting biodiversity, but may increase the number of names requiring synonymisation in the future.) Finding the right balance will help to ensure that DNA barcoding and the BIN system are supportive of and supported by the taxonomic community.

This is a helpful introduction to a very important topic.

I would place a great emphasis on BIN stability and having a means of tracking old BIN names as new specimens accumulate and BIN allocations change. I may be using the term loosely but BINs are akin to DOIs and there needs to be a way of relating new to old ones. Until that happens BINS indicate OTUs but don’t provide a useful means of labelling them before they enter formal taxonomy.

A minor point - it would be helpful if more information could be placed in BIN and specimen comment fields in BOLD.

Thanks, Charles - stability is certainly very important and identifier schemes such as DOIs do offer us some of what we need, although there are many ways this could be implemented. Essentially, we need identifiers that stably and predictably reference something that (according to some agreed approach) we recognise as a continuing OTU concept. The OTU will continue following addition of new specimens whose sequences fall within the cluster for the BIN and we need a versioning system that allows us to retrieve the content for any BIN at any time. We need also to handle splitting and lumping events driven by new sequences that drive the BOLD algorithms to cluster differently. Lumping events would correspond to a special state for one of the BINs that version it as now subsumed under the other BIN (effectively turning it into a subjective synonym). Splitting events are only slightly more complicated and require a new BIN cluster to be spawned (or resurrected from a synonymised BIN). If we can establish models that work this way, we will have the predictable labeling we need, and the BIN OTU model will be intuitive to classical taxonomists.

I’m replying separately to your comment on the need for more information to go into BIN and specimen comment fields. It would be good if we could identify the kinds of information that would be useful. These might include e.g.:

  1. Plain text notes
  2. Structured annotations that propose values (including replacements for existing values) for properties of the BIN or specimen (e.g. proposing scientific name, correcting locality)
  3. Structured proposals to allocate specimens in different ways from the BOLD clustering algorithm
  4. Responses to these structured annotations and proposal (agreeing or disagreeing) - ideally with some standard mechanism for resolving disagreements

Are there other obvious types of information that you think should be supported?

I think your proposal of new fields are very useful. And I think that they would cover a lot (perhaps something else might be needed but I find your four points a VERY good start. I also believe implementing this into BOLD should not be too difficult.
At the very minimum, a “plain text notes” would be extremely useful addition, as it could in theory accommodate ‘any’ comment…
And yes, the management of BINs should be similar to that of species (as done in more traditional taxonomy). At the end of the day a BIN is/should be a testable hypothesis, pretty much as “Species X from author Y” is also a testable hypothesis. That means that changes, synonyms, resurrection from synonym, etc, would also apply.

Thanks, @Fernandez-Triana. I’m keen to find ways to address these needs as soon as we can.