DataCite launches first release of the Data Citation Corpus

First-of-its-kind aggregation brings together millions of data citations to advance understanding of data usage

https://doi.org/10.60804/r14z-mw10

DataCite, in partnership with the Chan Zuckerberg Initiative (CZI), is delighted to announce the first release of the Data Citation Corpus. A major milestone in the Make Data Count initiative, the release makes eight million data citations openly available and usable for the first time via an interactive dashboard and public data file. We invite the community to engage with the data and provide feedback on this collaborative effort.

As highlighted by Make Data Count, the lack of a centralized resource for citations to datasets has hindered the evaluation of how open data is being used. To address this gap, DataCite, with funding from the Wellcome Trust, has developed an innovative aggregation that brings together for the first time data citations from diverse sources into a comprehensive and publicly accessible resource for the global community.

“There is a pressing need to understand how open data is used, but we have lacked a resource to access this information in a centralized and open manner. The Data Citation Corpus will allow the community to gain access to critical insights on data usage.” said Iratxe Puebla, Director of Make Data Count. “We are thrilled to share the progress from our collaboration with CZI to bring together citations from different sources, and look forward to working with others in the community to expand the breadth and coverage of the corpus.”

The first release of the corpus includes data citations in DataCite and Crossref metadata as well as asserted data citations contributed by CZI, available to the community via a data citation store and dashboard developed by Coko. Leveraging accession numbers from Europe PMC, CZI applied a machine-learning model to a large set of full-text articles and preprints to extract mentions to datasets. This has enabled the first-ever aggregation of citations for datasets with DOIs and accession numbers into a single corpus, enabling a more complete picture of data usage.

“As an organization that invests in research data and reference datasets, we believe it is critical to understand how data is shared and reused to enable new scientific discoveries,” said Patricia Brennan, Vice President of Science Technology at the Chan Zuckerberg Initiative. “DataCite has been a leader in this space, providing critical infrastructure for data citation and for tracking its reuse. We’re proud to support them in their vision to build a comprehensive global corpus of actionable data citations.”

The interactive dashboard of the corpus allows users to visualize and report on citations by a variety of facets, such as funder, data repository, or the journal where the article citing the data is published.

A complete data file of all of the citations is also available for additional analysis and evaluation. Request the data file via this form.

Forthcoming releases will focus on addressing existing metadata gaps, for example, related to the disciplinary information for the datasets, and on incorporating feedback from early adopters. DataCite will also pursue new collaborations with additional citation aggregators to expand the breadth and scale of data citations in the corpus. 

Community input is an integral part of this project and DataCite invites researchers, institutions, funders and infrastructure providers to provide feedback on the first release of the corpus and future development work. Please join us for an online webinar on February 22 to learn more about the first release of the corpus and how to use it. Register now to participate in this interactive session.

About DataCite

DataCite is a global community that shares a common interest: to ensure that research outputs and resources are openly available and connected so that their reuse can advance knowledge across and between disciplines, now and in the future. 

About Chan Zuckerberg Initiative

The Chan Zuckerberg Initiative was founded in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education, to addressing the needs of our communities. Through collaboration, providing resources and building technology, our mission is to help build a more inclusive, just and healthy future for everyone. 

Make Data Count Summit: The Time Is Now to Advance Data Metrics

This post has been cross-posted on the DataCite blog.

A critical piece of open data infrastructure that has received insufficient attention is the evaluation of data usage. We still lack a clear understanding and a body of evidence on how data are being accessed, utilized, and incorporated into research activities.

While interest in this topic is increasing, there has so far not been a dedicated event for discussions on the evaluation of data usage, and on the development of the data metrics required to support such evaluation across both research and government. To address this need, we hosted the Make Data Count Summit on 12-13 September 2023, as a forum to bring diverse stakeholders together to tackle nuanced issues about the importance of open data metrics. The event brought together over 120 attendees in Washington DC, including representatives from research institutions, funders and government, researchers, publishers, and infrastructure providers with the goal to hone in on actionable items for agencies and institutions to advance data metrics and the evaluation of data usage.

Strong Foundations for Data Metrics

The first day of the event included presentations from ongoing efforts toward data metrics. Daniella Lowenberg (University of California, Office of the President) provided an overview of the work of Make Data Count since 2014. The lessons learnt from the initiative’s work on developing standards and engaging the community have paved a renewed focus on supporting open infrastructure and on building evidence on data usage practices to continue to refine data metrics for diverse uses. Her takeaway: it’s time for a new focus in the open data world, and that focus is undivided attention to the development of open data metrics.

Matt Buys from DataCite and Carly Strasser from the Chan Zuckerberg Initiative provided an update on the ongoing collaborative project to build an Open Global Data Citation Corpus. This momentous project seeks to aggregate citations to data from a variety of sources, including citations from DataCite metadata as well as those from other sources, such as data mentions extracted from full-text articles through machine learning and identifiers from EMBL-EBI. The goal is that once completed, the corpus will provide data usage information to the community, openly, and at a scale not possible before.

Matt Buys (DataCite) and Carly Strasser (Chan Zuckerberg Initiative) present the collaborative project to build an Open Global Data Citation Corpus.

Julia Lane from New York University presented her vision for data as a public asset and her work on the ‘Democratizing Data’ project, which has developed algorithms to identify mentions to data as part of full-text articles. This project is mining the content of articles in Scopus to surface mentions to data and visualize those through dashboards for stakeholders to explore.

In the panel discussion ‘Policy and Administrative Priorities for Data Metrics in the US’, panelists from different agencies discussed recent developments in the United States seeking to open data, such as the Evidence-act and last year’s OSTP memo. These policies have provided important impetus for not only opening up administrative and research data but also for agencies to consider what their priorities should be for understanding and evaluating the use of data that has been opened up.

Data Metrics Must Be Embedded Across the Ecosystem and Supported by Evidence

On the second day, Nancy Potok, CEO at NAPx Consulting and former Chief Statistician of the United States provided an overview on the foundations and lessons learnt from the five years since the US Evidence-act, which promoted the release of data to make it more accessible to the public, as well as the use of data to inform policy development. 

The subsequent sessions explored the needs for data metrics across different areas of the research process, including funding agencies, institutions, and scholarly communications. Institutional processes for tenure and promotion were highlighted as an area of particular importance in order to drive awareness among researchers and adoption of data evaluation and data metrics.

We also heard the latest evidence on data usage practices and data citation from a panel of bibliometricians, who highlighted the discoverability of datasets and metadata completeness as areas of improvement. The panelists called for further research to inform meaningful data metrics so that we avoid the pitfalls of defaulting to oversimplified and opaque metrics.

Dr Stefanie Haustein (University of Ottawa and ScholCommLab) introduces the session dedicated to bibliometrics studies on data reuse

Prioritizing Data Metrics Now

During the Summit, we invited attendees to share their experiences and suggestions in two breakout group discussions. While these discussions highlighted that data usage evaluation is a complex subject that will require many nuanced conversations across sectors, the message was also clear that we must iterate in incremental steps, and not let perfect be the enemy of good as we drive the conversations forward.

A few of the topics highlighted during the breakout conversations include:

  • Data metrics are nuanced, we will need to provide clear information and resources for a wide range of stakeholders so that they have the information they need to join the conversation and also to lead it within their communities
  • There is a need to raise awareness and engagement on data metrics across all institutional levels – from individual researchers and administrators to institutional leaders and program managers.
  • Data metrics must be anchored on transparent information and built upon consistent practices, while also being mindful of domain-specific needs

We thank all the attendees for their engagement during the different sessions. It is clear that there is a shared interest in driving the evaluation of data usage and an understanding that we should collectively work towards meaningful evidence-based metrics. The Make Data Count initiative will be taking forward these conversations and we invite everyone interested to collaborate with us, as we all move forward to advance meaningful data metrics.

You can access the slides for the talks presented at the Summit on Zenodo.