How can we ensure that BOLD data become public as quickly as possible?

I would like to open discussion on the data sharing policy and principles. While this is a topic that matters for the entire iBOL and BIOSCAN, I believe each nation also should have a clear strategy on that. I had some discussion on that topic with Donald, and we decided to continue discussions here. I would be grateful for active discussion by all of you.

In Finland, we have decided to make ALL barcode data, including sequences and metadata, public almost immediately. A short embargo will be applied to validate the data correctness after the sequences are available in BOLD, but after this step, the data will be made public regardless of their publication status (in scientific connections). In Finland, barcoding is financially supported by the governmental funders (presently Academy of Finland) that means that we use tax payers’ money to generate barcodes. Data generated by public money also should me made available to the public and not kept private by researhers for a long time.

This principle may of course result in a situation which an individual researcher may find undesirable as the data (perhaps based on specimens collected by her/him) is made available for the other researcher before it is published in a scientific connection. However, I find this risk, especially what comes to the taxonomic research, rather small. I like that we should recognize the importance of broad accessability of data as indicated by the point 4 of BIOSCAN Strategic plan that states:

“Ensure that data in BOLD and mBRAVE are to the fullest extent possible, well documented and accessible under open licenses and following the FAIR principles: Follow best practice internationally for accessibility and reuse of BIOSCAN data.”

If I understood it correctly, also the FAIR principles are helpful in protecting using data generated by other researchers in an undesirable manner.

I know that some researcher won’t be happy for the idea of making “their” data available that rapidly, but as I turned to the BOLD project managers in Finland and asked if they have anything against making the projects fully public, no one opposed. I believe that making the generated barcode data open as widely as possible would not benefit only the community, but also individual researchers, and would also enhance collaboration between researchers.

Indeed, it would be really nice to learn about your thoughts regading this topic.

Many thanks, @mamutane, for posting your thoughts on this fundamental matter. I’ve moved this to a new topic of its own so that we can focus on it. As we expand visibility to this forum, I plan to direct people here to add their thinking.

My own perspective is that we should aim for universal public access to BOLD data and work to resolve any issues that make this difficult for researchers and contributers, whether those issues relate to national practice, institutional expectations or personal wishes.

Clearly, the widest possible access to the fullest possible dataset of barcodes gives the best basis for confident application in barcoding and metabarcoding applications and offers a good foundation for detecting discrepancies and curating the data as a whole. Public open access to all BOLD projects would be good for science.

At a time when natural systems are also facing unprecedented pressures from climate change and the biodiversity crisis, making these data open is something we can all do to assist scientists and policy makers around the world to find ways to monitor change and respond to these pressures. If we keep these data hidden, we cannot support good planning and sustainable outcomes.

1 Like

Marko and Donald, many thanks for initiating discussion on this matter. I agree with you that we should have many more BOLD data open for public access than we actually have. We had the same discussion about data release policy many years ago and I still fear that there will not be a simple solution which can be applied for everyone, every region and every case. I share your opinion, Marko, that the risk of being scooped (anticipated) by other, data-mining colleagues is rather small in the vast majority of cases, but in certain groups (e.g. butterflies, hawkmoths, emperor moths) the risk may be larger. Not rarely the paying part of a project (research foundation, ministry, private persons who pay for sequencing) do not want to have the data public before the project will be finished or before the related scientific papers are published. All this needs to be considered. It seems that German entomologists are more reluctant and scared than Finnish entomologists are, … I have several partners who are quite reluctant against having their data public, mainly in those cases where they have invested private money. One last thing will probably be quite specific to my own person: I was very lucky to have had the opportunity to have got generated a huge amount of data, I guess 270,000 barcodes from our museum belonging to more than 55,000 BINs. Validation of these data is very time-consuming (mainly in species from tropical faunas), and I am working hard on it although I have 1000 other tasks here in the museum. Sure, making all data public (many are already public…) may help to get this data validated by the international community, but in most cases the number of potential “validators” (experts) is very small and known to me. Thus, a pre-validation step may be helpful as well to avoid to make (too much) public the most awful errors. I did such a pre-validation step now with the African geometrids assembling all data in a dataset and opening it to some experienced specialists. Samples from tropical countries are often submitted by me as “Lepidoptera” or just (roughly) identified to family level. For such cases an additional field on BOLD (specimen page or better on the record list) “taxonomy unvalidated” might be helpful. But: As I said in the beginning: I agree that we should have many more data public.
with best wishes

1 Like

Thanks @ahausmann - you are correct that we should explore enhancements to the BOLD workflows and data presentation as a linked issue. The goal should be not just to make more data accessible but also to build the community around the data that helps to improve it as a reliable and trustworthy resource.

1 Like

Excellent point, that’s exactly what I mean.

1 Like

Donald, Axel, Many thanks for your insightful comments. It is clear that sequenced paid privately are “owned” by the payer. However, I am hoping that when the sequencing is paid by public money (=tax payers), the data would become open after a short embargo as this would, as Donald pointed out, benefit science. We should perhaps inform people about the low risks and principles of FAIR. The situation is more complicated when the used materials originate from private collections, like is often the case in Finland.

Certainly there are exceptions, for example funded fixed-term research projects. Under such circumstances, there are certainly reasons to keep the data private for several years. However, hopefully not forever.

I have sometimes suggested that as the whole consortium, all project manager together, we would publish all barcodes in a single massive data release paper. In this data, general patterns of Barcode variability (without caring too much about the identifications) could be studied, and as the data is so massive, and if well done, such a paper could be published even at the highest level journal, such as Nature or Science. But this of course would require that as many managers as possible would join the team.

One addition: I also believe that being able to indicate that we do not only publish a lot, but also publish in top journals, would help gaining funding for barcoding in many countries. Of course the open question is that who would take the responsibility of data analysing. (S)he should be someone who is an expert in handling massive data sets and searching interesting patterns in it. Well, I am clearly turning too much to a different topic now.

Thanks again @mamutane. I think such a paper may be a very useful tool for us. As well as analysing what we already have, it could serve as an overview of gaps and needs and how additional investments could point the way to delivering specific results and outcomes.