‘I would like to better identify citations for data’: Community feedback and use cases for the first release of the Data Citation Corpus

Kicking off the year, we had the pleasure of announcing the first release of the Data Citation Corpus. In response to the multiple expressions of interest in learning more about the project and in using the data file for the corpus, we held a webinar dedicated to the Data Citation Corpus.

With over 300 attendees, and plenty of follow on, the discussions we are having with community members are helping us inform the next steps for the project and prioritize areas that the community has signaled as requiring further work. Drawing on these conversations, we wanted to highlight some of the themes that have arisen and address questions posed during the webinar. We also share the link to the webinar recording below for those who missed it or want to review details.

Scope of the first release

The data file for the first release of the Data Citation Corpus includes 1.3 million data citations from DataCite Event Data, and eight million data mentions identified by the Chan Zuckerberg Initiative (CZI) by mining the full text of articles. Anyone interested in the data file can contact us via this form; we are happy to share the data file, and are just using this format to be able to follow up with users with further information or regarding possible collaborations. There is also a public dashboard that provides high-level visualization of the contents of the data file, accessible at http://corpus.datacite.org/dashboard.

The data file covers data-paper links identified through 1) DataCite Event Data, based on the metadata relationships that designate data citations and 2) CZI’s machine learning approach, where a mention of a dataset identifier found in the text of the article is designated as a citation.

By design, the Data Citation Corpus is focused on citations to data. We recognize the importance of understanding usage and citations to all kinds of open outputs (including samples, software, protocols and other) but the scope of the corpus lies with datasets.

The relationship between the dataset and the article relies on the identifier for the dataset. We recognize that open datasets can vary in how they are created and shared, some include a single file and some multiple components. The community has raised questions about how to best handle citations to individual datasets and collections of data, and how to best propagate citations from collections to their individual components. While the corpus does not address this use case at this time, it is something we will continue to explore as part of the corpus development.

Coverage & quality

The text mining completed by CZI involved a list of over 40 repositories (see slide 8 here for the repository list). The repositories were selected due to their standing in the community and because curated terms exist for their accession numbers, but it is important to bear in mind that the group focuses on life sciences disciplines. CZI completed text mining on five million open access papers available via Europe PMC (as the open licenses for those articles permit content mining), and thus, the literature included also falls mostly within the life sciences. We thus acknowledge that the text mining completed includes a slice of the literature. One of the challenges in data citation that had been expressed over the years was the separate handling of DOIs and accession numbers; the collaboration between CZI and DataCite provides a proof-of-principle to address this challenge and, for the first time, bring together citations to data with accession numbers and DOIs. As we move ahead, it is in our pipeline for the Data Citation Corpus to expand coverage of the disciplines represented.

During the webinar, Ana-Maria Istrate (CZI) also touched on the fact that the machine-learning model employed to search the article text would have picked some false positives, for example, if a string matching an accession number was also used to designate a grant number or another research entity. This is another aspect of the data citations that we will continue to work on, and we are happy to collaborate with groups interested in looking at improvements to the model, or in completing further curation of the citations included.

We also discussed metadata elements for the data citations, and some known metadata gaps. While we took steps to identify metadata for affiliation, funder and subject information where available, those metadata are not yet available for many citations, and subject area in particular is only recorded for under 1% of the data-paper links. This is not a problem specific to the Data Citation Corpus, and rather relates to the broader challenge around metadata completeness for both datasets and articles, but we recognize it as a priority area for additional work. We would like to explore approaches to infer discipline information for the datasets, or leverage AI or other approaches to enrich the discipline-level categorization for the citations.

Uses for the Data Citation Corpus

We have received over 70 requests for the data file of the corpus, from groups and individuals in a variety of roles. Those requesting the file have expressed a common interest in better understanding open data practices and the reach of open data, but their specific use cases are tailored to their professional needs:

  • Researchers: Researchers are interested in using the corpus for bibliometric analyses, to study practices around data sharing and reuse, and correlations between specific parameters of datasets (e.g. metadata quality, whether are data are associated with an article) and the level of reuse.
  • Librarians: Many of the librarians seek to identify citations for data hosted by their institutions and are exploring ways to integrate data citations into scholarship assessment reports.
  • Infrastructure Providers: A key focus for this group is to improve data citation coverage in existing platforms and services, and they are looking to check for additional citations in the corpus to complement those they already store or expose. Infrastructure providers also seek to leverage data citations to enhance their search and discovery features for digital objects. 
  • Institutional Administrators: This group’s main interest is to analyze data citations from their institutions to evaluate research impact, and in turn incentivize open science practices.
  • Publishers: Publisher representatives are looking to identify citations for data, compare those to their existing indices, and analyze bibliographic patterns, especially in the context of open access publications.
  • Data Repositories: These seek to compare citations for their datasets with those found in other sources, to showcase the impact and relevance of their datasets.

There are also additional nuanced use cases by other community members and we are keen to hear about the different ways in which the Data Citation Corpus can serve different data evaluation needs. While it is still early days, it is encouraging to see that the value of the Data Citation Corpus is understood across so many sectors of the scholarly community.  

We are still at an early stage and much work lies ahead to enhance the Data Citation Corpus as a tool that addresses the needs of these diverse communities, but we are encouraged by the response we have received so far. We are confident that the corpus aligns with a recognized need to better understand the use and reach of datasets. As we work to improve the current data file and incorporate additional data sources, we will continue to actively engage with the community and report on the many ways the Data Citation Corpus is being used.

If you would like to request access to the current data file (and be on the mailing list for further iterations of the corpus as they become available), please complete this form.  

You can find the slides from the talks at Zenodo (here & here) and watch the recording of the webinar on Youtube.

DataCite launches first release of the Data Citation Corpus

First-of-its-kind aggregation brings together millions of data citations to advance understanding of data usage

https://doi.org/10.60804/r14z-mw10

DataCite, in partnership with the Chan Zuckerberg Initiative (CZI), is delighted to announce the first release of the Data Citation Corpus. A major milestone in the Make Data Count initiative, the release makes eight million data citations openly available and usable for the first time via an interactive dashboard and public data file. We invite the community to engage with the data and provide feedback on this collaborative effort.

As highlighted by Make Data Count, the lack of a centralized resource for citations to datasets has hindered the evaluation of how open data is being used. To address this gap, DataCite, with funding from the Wellcome Trust, has developed an innovative aggregation that brings together for the first time data citations from diverse sources into a comprehensive and publicly accessible resource for the global community.

“There is a pressing need to understand how open data is used, but we have lacked a resource to access this information in a centralized and open manner. The Data Citation Corpus will allow the community to gain access to critical insights on data usage.” said Iratxe Puebla, Director of Make Data Count. “We are thrilled to share the progress from our collaboration with CZI to bring together citations from different sources, and look forward to working with others in the community to expand the breadth and coverage of the corpus.”

The first release of the corpus includes data citations in DataCite and Crossref metadata as well as asserted data citations contributed by CZI, available to the community via a data citation store and dashboard developed by Coko. Leveraging accession numbers from Europe PMC, CZI applied a machine-learning model to a large set of full-text articles and preprints to extract mentions to datasets. This has enabled the first-ever aggregation of citations for datasets with DOIs and accession numbers into a single corpus, enabling a more complete picture of data usage.

“As an organization that invests in research data and reference datasets, we believe it is critical to understand how data is shared and reused to enable new scientific discoveries,” said Patricia Brennan, Vice President of Science Technology at the Chan Zuckerberg Initiative. “DataCite has been a leader in this space, providing critical infrastructure for data citation and for tracking its reuse. We’re proud to support them in their vision to build a comprehensive global corpus of actionable data citations.”

The interactive dashboard of the corpus allows users to visualize and report on citations by a variety of facets, such as funder, data repository, or the journal where the article citing the data is published.

A complete data file of all of the citations is also available for additional analysis and evaluation. Request the data file via this form.

Forthcoming releases will focus on addressing existing metadata gaps, for example, related to the disciplinary information for the datasets, and on incorporating feedback from early adopters. DataCite will also pursue new collaborations with additional citation aggregators to expand the breadth and scale of data citations in the corpus. 

Community input is an integral part of this project and DataCite invites researchers, institutions, funders and infrastructure providers to provide feedback on the first release of the corpus and future development work. Please join us for an online webinar on February 22 to learn more about the first release of the corpus and how to use it. Register now to participate in this interactive session.

About DataCite

DataCite is a global community that shares a common interest: to ensure that research outputs and resources are openly available and connected so that their reuse can advance knowledge across and between disciplines, now and in the future. 

About Chan Zuckerberg Initiative

The Chan Zuckerberg Initiative was founded in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education, to addressing the needs of our communities. Through collaboration, providing resources and building technology, our mission is to help build a more inclusive, just and healthy future for everyone.