Kicking off the year, we had the pleasure of announcing the first release of the Data Citation Corpus. In response to the multiple expressions of interest in learning more about the project and in using the data file for the corpus, we held a webinar dedicated to the Data Citation Corpus.

With over 300 attendees, and plenty of follow on, the discussions we are having with community members are helping us inform the next steps for the project and prioritize areas that the community has signaled as requiring further work. Drawing on these conversations, we wanted to highlight some of the themes that have arisen and address questions posed during the webinar. We also share the link to the webinar recording below for those who missed it or want to review details.

Scope of the first release

The data file for the first release of the Data Citation Corpus includes 1.3 million data citations from DataCite Event Data, and eight million data mentions identified by the Chan Zuckerberg Initiative (CZI) by mining the full text of articles. Anyone interested in the data file can contact us via this form; we are happy to share the data file, and are just using this format to be able to follow up with users with further information or regarding possible collaborations. There is also a public dashboard that provides high-level visualization of the contents of the data file, accessible at http://corpus.datacite.org/dashboard.

The data file covers data-paper links identified through 1) DataCite Event Data, based on the metadata relationships that designate data citations and 2) CZI’s machine learning approach, where a mention of a dataset identifier found in the text of the article is designated as a citation.

By design, the Data Citation Corpus is focused on citations to data. We recognize the importance of understanding usage and citations to all kinds of open outputs (including samples, software, protocols and other) but the scope of the corpus lies with datasets.

The relationship between the dataset and the article relies on the identifier for the dataset. We recognize that open datasets can vary in how they are created and shared, some include a single file and some multiple components. The community has raised questions about how to best handle citations to individual datasets and collections of data, and how to best propagate citations from collections to their individual components. While the corpus does not address this use case at this time, it is something we will continue to explore as part of the corpus development.

Coverage & quality

The text mining completed by CZI involved a list of over 40 repositories (see slide 8 here for the repository list). The repositories were selected due to their standing in the community and because curated terms exist for their accession numbers, but it is important to bear in mind that the group focuses on life sciences disciplines. CZI completed text mining on five million open access papers available via Europe PMC (as the open licenses for those articles permit content mining), and thus, the literature included also falls mostly within the life sciences. We thus acknowledge that the text mining completed includes a slice of the literature. One of the challenges in data citation that had been expressed over the years was the separate handling of DOIs and accession numbers; the collaboration between CZI and DataCite provides a proof-of-principle to address this challenge and, for the first time, bring together citations to data with accession numbers and DOIs. As we move ahead, it is in our pipeline for the Data Citation Corpus to expand coverage of the disciplines represented.

During the webinar, Ana-Maria Istrate (CZI) also touched on the fact that the machine-learning model employed to search the article text would have picked some false positives, for example, if a string matching an accession number was also used to designate a grant number or another research entity. This is another aspect of the data citations that we will continue to work on, and we are happy to collaborate with groups interested in looking at improvements to the model, or in completing further curation of the citations included.

We also discussed metadata elements for the data citations, and some known metadata gaps. While we took steps to identify metadata for affiliation, funder and subject information where available, those metadata are not yet available for many citations, and subject area in particular is only recorded for under 1% of the data-paper links. This is not a problem specific to the Data Citation Corpus, and rather relates to the broader challenge around metadata completeness for both datasets and articles, but we recognize it as a priority area for additional work. We would like to explore approaches to infer discipline information for the datasets, or leverage AI or other approaches to enrich the discipline-level categorization for the citations.

Uses for the Data Citation Corpus

We have received over 70 requests for the data file of the corpus, from groups and individuals in a variety of roles. Those requesting the file have expressed a common interest in better understanding open data practices and the reach of open data, but their specific use cases are tailored to their professional needs:

  • Researchers: Researchers are interested in using the corpus for bibliometric analyses, to study practices around data sharing and reuse, and correlations between specific parameters of datasets (e.g. metadata quality, whether are data are associated with an article) and the level of reuse.
  • Librarians: Many of the librarians seek to identify citations for data hosted by their institutions and are exploring ways to integrate data citations into scholarship assessment reports.
  • Infrastructure Providers: A key focus for this group is to improve data citation coverage in existing platforms and services, and they are looking to check for additional citations in the corpus to complement those they already store or expose. Infrastructure providers also seek to leverage data citations to enhance their search and discovery features for digital objects. 
  • Institutional Administrators: This group’s main interest is to analyze data citations from their institutions to evaluate research impact, and in turn incentivize open science practices.
  • Publishers: Publisher representatives are looking to identify citations for data, compare those to their existing indices, and analyze bibliographic patterns, especially in the context of open access publications.
  • Data Repositories: These seek to compare citations for their datasets with those found in other sources, to showcase the impact and relevance of their datasets.

There are also additional nuanced use cases by other community members and we are keen to hear about the different ways in which the Data Citation Corpus can serve different data evaluation needs. While it is still early days, it is encouraging to see that the value of the Data Citation Corpus is understood across so many sectors of the scholarly community.  

We are still at an early stage and much work lies ahead to enhance the Data Citation Corpus as a tool that addresses the needs of these diverse communities, but we are encouraged by the response we have received so far. We are confident that the corpus aligns with a recognized need to better understand the use and reach of datasets. As we work to improve the current data file and incorporate additional data sources, we will continue to actively engage with the community and report on the many ways the Data Citation Corpus is being used.

If you would like to request access to the current data file (and be on the mailing list for further iterations of the corpus as they become available), please complete this form.  

You can find the slides from the talks at Zenodo (here & here) and watch the recording of the webinar on Youtube.