A trusted central aggregate of all data citations to further our understanding of data usage and advance meaningful data metrics

Make Data Count seeks to work with the community to further our understanding of data usage to monitor impact, inform future funding, and improve the dissemination of research. The development of a trusted central aggregate of all citations to research data across articles, preprints, government documents, and other outputs will help achieve our goal of building responsible, meaningful data metrics.

In 2023, the Wellcome Trust awarded funds to build an open Data Citation Corpus to dramatically transform the data citation landscape. Through this award, DataCite has partnered with Chan Zuckerberg Initiative, EMBL-EBI, and other organizations that identify and assert data citations.

Why the corpus

Information about the use of data is currently stored in disparate locations, which limits our understanding of the reach and impact of open data. The Data Citation Corpus addresses this challenge by providing a comprehensive, centralized resource that compiles data citations from a variety of sources and makes them accessible to the community. The corpus aims to enable different stakeholders —including funders and institutions— to evaluate the reach of open datasets produced and shared by researchers, and facilitate large-scale analyses to build evidence on data usage practices across institutions and disciplines. The corpus will be made available as an open CC0 community resource.

Data Sources

The corpus aggregates data citations collected via:

  • Persistent identifier authorities: Sources that collect citations as part of their DOI registration workflow, such as DataCite and Crossref.
  • Third-Party Aggregators: Sources that aggregate or discover citations through various techniques, such as full-text mining and curation. For example, the Chan Zuckerberg Initiative (CZI) has contributed data citations identified through mining the text of publications via a machine-learning algorithm.

To support the trustworthiness of the information stored, the corpus will expose where multiple sources have provided the same citation and indicate the sources for the citation. Citations will be deduplicated for aggregation, but users will be able to access – and filter by- the provenance of records.

First release of the corpus

The first release of the corpus demonstrates the value of incorporating data citations from different sources and the ways in which users will be able to interact with the corpus.

The first release is based on a seed file that includes data citations from the following sources:

  • Data citations from DataCite and Crossref DOI metadata, via Event Data.
  • Data citations from the CZI Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of journal articles and preprints in Europe PMC

The citations in the first release of the corpus are available via a data file.

The corpus also includes a dashboard that allows users to visualize the current content of the corpus or narrow the results according to specific filters, such as the affiliation associated with the dataset or the repository where the dataset is hosted.

The dashboard visualizations include:

  • Citations counts over time: Count of data citations from 2013 to 2023, these can be filtered by Repository, Subject, Affiliation, Funder, Journal and Publisher.
  • Citation counts by publisher: Count of data citations by publisher, these can be filtered by Repository, Subject, Affiliation and Funder.
  • Counts of unique repositories, journals, subjects, affiliations, funders: Breakdown of the current coverage in the corpus for journals, affiliations, repositories and subjects.
  • Citation counts by subject: Count of data citations per dataset subject based (where available), these can be filtered by Affiliation, Funder and Publisher.
  • Citation counts by source of citation: Counts for citations ingested from DataCite Event Data and CZI Science Knowledge Graph.
  • Data citations corpus growth: Citation counts and ingest date (into the corpus) by identifier type over time.

Next steps

The next stages of our work on the data citation corpus will involve addressing existing gaps in metadata for data citations, enhancements to the dashboard and corpus visualizations, and ingestion of data citations from additional sources.

We will pursue additional work to achieve broad coverage of data citations in the corpus. Ultimately, we hope that the corpus becomes a valuable resource as part of processes that evaluate the impact and reach of datasets, and facilitates bibliometric analyses to build evidence on data usage. With tools that enable greater transparency in the assessment of the use and reach of open datasets, we can incentivize greater data sharing, and in turn, accelerate and enhance the rigor of research discoveries.

Get involved

Would you like to learn more about the Data Citation Corpus or become a pilot partner in the project? Please complete this form and we will follow up with you.

Here are ways in which community members can contribute to the corpus:

RepositoriesSubmit citations via the metadata when registering DOIs. DataCite provides documentation on Contributing Citations and References via DataCite DOI metadata. DOI metadata can be updated to include citations after the initial registration.

Track data citations for hosted the datasets and display this and other usage information on the landing page for the dataset records. DataCite provides information on consuming data citations via DataCite Event Data.
Organizations that identify or collect data citations via their own processesSubmit citations to the Data Citation Corpus.
We welcome expressions of interest, do get in touch if you are interested in learning more. 
PublishersSubmit data citations as part of the Crossref metadata deposit – see Crossref’s Data and software citation deposit guide.

STM, DataCite and Crossref’s joint statement on research data lists recommended best practices for data sharing and data citation.
Institutions and fundersWe are keen to learn more about potential uses for the corpus as part of institutional processes. If you are interested in providing feedback and becoming a pilot partner for the project, please email Iratxe Puebla, Director of Make Data Count.
Watch the webinar for the first release of the Data Citation Corpus to learn more about the project, the contribution by Chan Zuckerberg Initiative, and what’s included in the first release of the corpus.

Publisher Data Citation Resources