A key learning from our initiative is that the community needs a clear understanding of data reuse to monitor impact, inform future funding, and improve the dissemination of research. The development of a trusted central aggregate of all references to research data across articles, preprints, government documents, and other outputs will help achieve our goal of building responsible, meaningful data metrics.

In 2023, The Wellcome Trust awarded funds to build the Open Global Data Citation Corpus to dramatically transform the data citation landscape. Through this award, DataCite has partnered with Chan Zuckerberg Initiative, EMBL-EBI, and other organizations that scrape and assert data citations.

Why the corpus

Information about the use and reuse of data is currently stored in disparate locations, which limits our understanding of the reuse and impact of open data. The Open Global Data Citation Corpus addresses the challenge that data citations currently exist in third-party systems by providing a comprehensive, centralized resource that compiles data citations from a variety of sources and makes those accessible to the community. 

The corpus aims to enable different stakeholders —including funders and institutions— to evaluate the reach of open datasets produced and shared by researchers, and enable large-scale analyses to build evidence on practices around data reuse across institutions and disciplines. The corpus will be made available as an open CC0 community resource.

Data Sources

The corpus will include citations collected via persistent identifier metadata as well as data citations identified by third-party sources through techniques such as machine learning and curation.

  • Persistent identifier authorities: Sources that collect citations as part of their DOI registration workflow, such as DataCite and Crossref.
  • Third-Party Aggregators: Sources that aggregate or discover citations through various techniques, such as full-text mining and curation. For example, the Chan Zuckerberg Initiative (CZI) is contributing data citations identified through the CZI Knowledge Graph, which mines the text of publications via a machine-learning algorithm.

The citations in the corpus will be available for download via a data dump and retrievable via an API functionality. The corpus will also include a user interface to allow visualizations according to different parameters (e.g. institution, data repository and others).

To support the trustworthiness of the information stored, the Data Citation Corpus will expose where multiple sources have provided the same citation and indicate the sources for the citation. Citations will be deduplicated for aggregation in interfaces and APIs, but users will be able to access – and filter by- the provenance of records.

The Corpus Prototype

The corpus prototype will provide an initial demonstration to showcase the value of incorporating data citations from different sources and the possible ways in which users will be able to interact with the corpus.

The prototype is based on a seed file that includes data citations from the following sources:

  • Data citations from DataCite and Crossref DOI metadata, via Event Data.
  • Data citations from Chan Zuckerberg Initiative Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of articles in Europe PMC. 

The dashboard for the prototype allows users to visualize the full content of the corpus or narrow the results according to specific filters, such as the affiliation associated with the dataset or the repository where the dataset is hosted.

Corpus development: Next steps

The prototype is the initial stage of our work toward the development of the data citation corpus. We will next focus on developing an MVP for the corpus that will involve:

  • Ingestion of data citations from additional sources
  • Enhancements to the dashboard and corpus visualizations, including the analysis and refinement of the metadata fields (facets) available to filter content in the corpus
  • Enrichment of the subject information for data citations
  • Development of an API

After completion of the MVP, we will pursue additional work to achieve broad coverage of data citations in the corpus and bring it into production. At that stage, the corpus will be available via a data dump, an API, and the dashboard of corpus visualizations.

We will seek community input so that we can continue to add data sources to the corpus and work on strengthening the quality and coverage of data citations included. Ultimately, we hope that the corpus will be a valuable resource as part of processes that evaluate the impact and reach of shared datasets, and facilitate bibliometric analyses to build evidence around data reuse. With tools that enable greater transparency and rigor in the assessment of the use and reach of open datasets, we can incentivize greater data sharing and in turn support the community in delivering faster and more rigorous research discoveries.

Get involved

Community feedback will be key throughout the development of the corpus and we are keen to collaborate with individuals, groups and organizations interested in contributing citations to the corpus or exploring the use of the corpus as part of their evaluation processes. If you would like to learn more about the corpus or discuss a possible collaboration, please contact Iratxe Puebla, Director of Make Data Count.

Here are ways in which community members can contribute to the development of the corpus:

RepositoriesSubmit citations via the metadata when registering DOIs. Documentation on contributing citations via DataCite DOI metadata is available at Contributing Citations and References. DOI metadata can be updated to include citations after the initial registration.

Track data citations for hosted the datasets and display this and other usage metrics on the landing page for the dataset records. DataCite provides information on consuming data citations via DataCite Event Data.
Organizations that identify or collect data citations via their own processesSubmit citations to the open data citation corpus.
We welcome expressions of interest, do get in touch if you are interested in learning more. 
PublishersSubmit data citations as part of the Crossref metadata deposit. Guidance on submitting data citations is available via Crossref’s Data and software citation deposit guide.
Institutions and fundersWe are keen to learn more about potential uses for the corpus as part of institutional processes, if you are interested in providing feedback on the corpus, please email Iratxe Puebla, Director of Make Data Count.
Watch the project kick-off webinar to learn more about the data citation corpus, including perspectives from DataCite, Wellcome Trust, Chan Zuckerberg Initiative, EMBL-EBI, COKI, OpenAIRE, and OpenCitations.

Publisher Data Citation Resources

Publisher Standards

Repository Standards