Join us for the Make Data Count Summit 2024

Make Data Count is pleased to announce the Make Data Count Summit 2024, a two-day meeting dedicated to responsible data metrics and the evaluation of open data usage and impact.

The Make Data Count Summit 2024 will be held in London, UK on September 5-6 and will provide a forum for discussions about evidence-based data metrics and responsible assessment of open data. The Summit will bring together representatives of the research and policy communities, data administrators, funders, policymakers, publishers and infrastructure providers.

Our Make Data Count Summit in Washington last year highlighted that there are strong foundations for data metrics, but also a need to raise awareness and pursue conversations about data evaluation across sectors. By hosting this second Make Data Count Summit next September, we want to provide a dedicated venue for continued engagement with this topic. 

The event will incorporate facilitated discussions, panels and brainstorming sessions. The Make Data Count Summit will foster a common understanding of existing data metrics and their use, showcase developments in infrastructure and practices supporting data metrics, and engage groups across policy, research, and research-supporting organizations in the next steps toward evidence-based metrics and the meaningful assessment of research data. 

For more details about the event, as well as registration details, visit

Note this is an in-person only event. If you are interested in keeping up-to-date with the discussions but cannot attend in person, do sign up for the Make Data Count newsletter.

Bringing the COUNTER Code of Practice for Research Data and COUNTER Release 5.1 together to enhance normalization

This blog post is cross-posted on the COUNTER site.

The COUNTER Code of Practice for Research Data provided a key milestone in data evaluation practices by making it possible to report comparable usage counts across platforms. Over the last months, Make Data Count and COUNTER have collaborated to explore what a suitable direction would be for an update to the Code of Practice for Research Data, and we are now sharing our proposal to merge the Code of Practice for Research Data with COUNTER R5.1 for public consultation.

Read on for the context and motivation for this update, and details on how you can share feedback on the proposal.

The COUNTER Code of Practice for Research Data

The COUNTER Code of Practice for Research Data was released in 2017 as a framework to enable repositories and data-publishing platforms to report the usage of datasets in a standardized way. By providing a common framework to process and report counts for data views and downloads (e.g. noting whether usage originated from humans or machines, filtering out activity from web robots and spiders), the Code of Practice made it possible to report usage counts comparable across platforms. This framework has been implemented by a number of repositories, including Zenodo, Dryad and the Environmental Protection Agency repository.

The development of the Code of Practice for Research Data drew from COUNTER’s experience with standards for usage metrics for scholarly resources, and its recommendations aligned as much as possible with Release 5 of the COUNTER Code of Practice. At the time, the main COUNTER Code of Practice was tailored towards publisher platforms, and the data community felt it was important to have a dedicated framework for reporting on the use of datasets. However, there was also interest in maintaining close communication about updates to both Codes, and in exploring further alignment across the codes in future releases.

Aligning the Codes

In the six years since the release of the Code of Practice for Research Data, repository infrastructure has developed substantially, and COUNTER completed Release 5.1 of the main Code (R5.1), which has extended the type of outputs for which usage reports can be reported, including datasets. The Make Data Count and COUNTER teams have resumed the discussions about an update for the Code of Practice for Research Data, and based on our exploration of the current status of both codes, we are proposing to merge the Code of Practice for Research Data with R5.1. We base this recommendation on the following factors:

  • Many repositories host diverse research outputs, including articles, data, theses and many others. In order to report usage counts for the variety of resources hosted, the repositories would need to maintain two Codes, resulting in duplication of effort and resources, and a risk of data usage counts being processed differently over time depending on the Code the repository is applying. Having a single Code to implement simplifies implementation by mixed-content repositories, and ensures consistency in usage reports. 
  • One of the items that arose early on as part of discussions for a revision to the Code of Practice for Research Data was how to accommodate additional granularity to report usage of datasets that included multiple files. R5.1 now includes Components, which allows reporting usage for both an item and for files nested within the item. This structure allows repositories to report on usage of the dataset as a whole, or to report usage of individual files within the data record at a more granular level. 
  • Key aspects of the Code of Practice for Research Data and R5.1 overlap or have matching attributes. While some differences between the Codes exist, we feel that these can be addressed by providing dedicated guidance to repositories that report data usage on how to complete those fields (e.g instances where the fields do not apply to data and may be left blank).

This update will require some changes to the implementation of how usage reports are created for data, but it is important to note that the process for reporting usage reports does not change. Processed data usage reports can be sent to DataCite for aggregation, and are made accessible via DataCite’s API.

As part of our work on this proposal, we have consulted with members of repository teams for their input. We thank Zach Crockett (KBase), Alex Ioannidis (Zenodo), Pablo Saiz (Zenodo) and Ana van Gulick (Figshare) for their input and suggestions.

Share your feedback on the proposal

We are confident that this proposal to merge the Code of Practice for Research Data with R5.1 will bring efficiencies to repository teams and ensure consistent and robust reporting of data usage, but of course, we want your feedback!

Please check the details of the proposal and share your feedback on the suggested implementation by repositories hosting data via this form, or the COUNTER Github repository. You can also email us with any comments, queries or concerns: Iratxe Puebla (Make Data Count), Tasha Mellins-Cohen (COUNTER). The proposal is open for public consultation until 31 July 2024. We’ll present the proposal at the COUNTER conference on May 16, do join the session to hear more.

Job Posting: Make Data Count Communications Consultant

Make Data Count is seeking a part-time Communications Consultant to support the initiative’s outreach and dissemination of our work on data metrics with the community. 

Make Data Count is a global, community-driven initiative focused on establishing standardized metrics for the evaluation and reward of research data reuse and impact. Make Data Count facilitates the recognition of data as a primary research output, promoting data sharing and reuse across the community.

The Communications Consultant will work with the Make Data Count Director and the Make Data Count Advisory Group to communicate events, resources and activities related to Make Data Count and data metrics.


The Communications Consultant will handle a diverse set of communication tasks based on the strategic direction set by the Make Data Count Director, including: 

Drafting and editing

  • Blog posts
  • Make Data Count newsletter
  • Social media posts for X/Twitter, Mastodon and LinkedIn

Community engagement

  • Post on X/Twitter, Mastodon and LinkedIn, as well as relevant Slack groups, responding to replies and engaging the community
  • Develop Make Data Count’s presence on these and other relevant networks

Events coordination

  • Support the organization of webinars or other virtual events
  • Communications and coordination of preparations for the Make Data Count Summit 

This is a remote contractor position. The work is expected to be around 14 hrs/week. Compensation is expected to range 35-50 euros/hour or equivalent currency, depending on experience. Expected start date can be as early as 3 June 2024.

Desired skills

The ideal candidate will be:

  • Experienced in outreach, event organization and/or community management
  • Proficient in a variety of communications tools, including WordPress, Mailchimp, etc
  • Experienced managing social media accounts
  • Passionate about open science and interested in driving positive change in research culture and research evaluation practices
  • Familiar with the open research data ecosystem, including recent trends and developments 
  • A strong writer and verbal communicator. Other proficiencies (e.g. video, audio, illustration, etc) will be an advantage

How to apply

Please send a short cover letter, resume/CV, and three work samples (these could include blog posts or other similar articles, videos, illustrations, etc) to with the words “Communications Consultant” in the subject line. We will review applications on a rolling basis.

‘I would like to better identify citations for data’: Community feedback and use cases for the first release of the Data Citation Corpus

Kicking off the year, we had the pleasure of announcing the first release of the Data Citation Corpus. In response to the multiple expressions of interest in learning more about the project and in using the data file for the corpus, we held a webinar dedicated to the Data Citation Corpus.

With over 300 attendees, and plenty of follow on, the discussions we are having with community members are helping us inform the next steps for the project and prioritize areas that the community has signaled as requiring further work. Drawing on these conversations, we wanted to highlight some of the themes that have arisen and address questions posed during the webinar. We also share the link to the webinar recording below for those who missed it or want to review details.

Scope of the first release

The data file for the first release of the Data Citation Corpus includes 1.3 million data citations from DataCite Event Data, and eight million data mentions identified by the Chan Zuckerberg Initiative (CZI) by mining the full text of articles. Anyone interested in the data file can contact us via this form; we are happy to share the data file, and are just using this format to be able to follow up with users with further information or regarding possible collaborations. There is also a public dashboard that provides high-level visualization of the contents of the data file, accessible at

The data file covers data-paper links identified through 1) DataCite Event Data, based on the metadata relationships that designate data citations and 2) CZI’s machine learning approach, where a mention of a dataset identifier found in the text of the article is designated as a citation.

By design, the Data Citation Corpus is focused on citations to data. We recognize the importance of understanding usage and citations to all kinds of open outputs (including samples, software, protocols and other) but the scope of the corpus lies with datasets.

The relationship between the dataset and the article relies on the identifier for the dataset. We recognize that open datasets can vary in how they are created and shared, some include a single file and some multiple components. The community has raised questions about how to best handle citations to individual datasets and collections of data, and how to best propagate citations from collections to their individual components. While the corpus does not address this use case at this time, it is something we will continue to explore as part of the corpus development.

Coverage & quality

The text mining completed by CZI involved a list of over 40 repositories (see slide 8 here for the repository list). The repositories were selected due to their standing in the community and because curated terms exist for their accession numbers, but it is important to bear in mind that the group focuses on life sciences disciplines. CZI completed text mining on five million open access papers available via Europe PMC (as the open licenses for those articles permit content mining), and thus, the literature included also falls mostly within the life sciences. We thus acknowledge that the text mining completed includes a slice of the literature. One of the challenges in data citation that had been expressed over the years was the separate handling of DOIs and accession numbers; the collaboration between CZI and DataCite provides a proof-of-principle to address this challenge and, for the first time, bring together citations to data with accession numbers and DOIs. As we move ahead, it is in our pipeline for the Data Citation Corpus to expand coverage of the disciplines represented.

During the webinar, Ana-Maria Istrate (CZI) also touched on the fact that the machine-learning model employed to search the article text would have picked some false positives, for example, if a string matching an accession number was also used to designate a grant number or another research entity. This is another aspect of the data citations that we will continue to work on, and we are happy to collaborate with groups interested in looking at improvements to the model, or in completing further curation of the citations included.

We also discussed metadata elements for the data citations, and some known metadata gaps. While we took steps to identify metadata for affiliation, funder and subject information where available, those metadata are not yet available for many citations, and subject area in particular is only recorded for under 1% of the data-paper links. This is not a problem specific to the Data Citation Corpus, and rather relates to the broader challenge around metadata completeness for both datasets and articles, but we recognize it as a priority area for additional work. We would like to explore approaches to infer discipline information for the datasets, or leverage AI or other approaches to enrich the discipline-level categorization for the citations.

Uses for the Data Citation Corpus

We have received over 70 requests for the data file of the corpus, from groups and individuals in a variety of roles. Those requesting the file have expressed a common interest in better understanding open data practices and the reach of open data, but their specific use cases are tailored to their professional needs:

  • Researchers: Researchers are interested in using the corpus for bibliometric analyses, to study practices around data sharing and reuse, and correlations between specific parameters of datasets (e.g. metadata quality, whether are data are associated with an article) and the level of reuse.
  • Librarians: Many of the librarians seek to identify citations for data hosted by their institutions and are exploring ways to integrate data citations into scholarship assessment reports.
  • Infrastructure Providers: A key focus for this group is to improve data citation coverage in existing platforms and services, and they are looking to check for additional citations in the corpus to complement those they already store or expose. Infrastructure providers also seek to leverage data citations to enhance their search and discovery features for digital objects. 
  • Institutional Administrators: This group’s main interest is to analyze data citations from their institutions to evaluate research impact, and in turn incentivize open science practices.
  • Publishers: Publisher representatives are looking to identify citations for data, compare those to their existing indices, and analyze bibliographic patterns, especially in the context of open access publications.
  • Data Repositories: These seek to compare citations for their datasets with those found in other sources, to showcase the impact and relevance of their datasets.

There are also additional nuanced use cases by other community members and we are keen to hear about the different ways in which the Data Citation Corpus can serve different data evaluation needs. While it is still early days, it is encouraging to see that the value of the Data Citation Corpus is understood across so many sectors of the scholarly community.  

We are still at an early stage and much work lies ahead to enhance the Data Citation Corpus as a tool that addresses the needs of these diverse communities, but we are encouraged by the response we have received so far. We are confident that the corpus aligns with a recognized need to better understand the use and reach of datasets. As we work to improve the current data file and incorporate additional data sources, we will continue to actively engage with the community and report on the many ways the Data Citation Corpus is being used.

If you would like to request access to the current data file (and be on the mailing list for further iterations of the corpus as they become available), please complete this form.  

You can find the slides from the talks at Zenodo (here & here) and watch the recording of the webinar on Youtube.

GREI recommendations to support consistent practices to collect, expose and aggregate citations to open data

The Generalist Repository Ecosystem Initiative (GREI) has as one of its objectives the implementation of open metrics. A consistent approach to data citations is an important step to drive meaningful metrics that provide visibility on data usage, signal the added value of data repositories and enable reporting on the reach of NIH-funded research data. Make Data Count has engaged with the GREI repositories to review their existing approaches to data citations and develop a common resource on best practices for handling data citations at repositories.

Why data citations? 

Data citations are a useful measure to gain understanding on the use of research data. Data citations recognize the individual(s) or organization(s) that collected and shared the data, and researcher surveys regularly show that researchers value receiving citations to their dataset (see for example The State of Open Data report, or research by Kathleen Gregory and colleagues). 

Kathleen Gregory et al. ‘A survey investigating disciplinary differences in data citation’. Figure 14 showing preferences for how respondents would like others to refer to their own data.

Many repositories have taken steps to implement workflows to collect and expose citations, and importantly, recent developments in machine learning have opened up new ways to identify citations to data and scale the data citations available to the community.

GREI repositories best practices

All of the GREI repositories (Dataverse, Dryad, Figshare, Open Science Framework, Mendeley Data, Vivli, and Zenodo) already collect data citations or have it on their roadmap to add this feature. Building on their practices and experience in this area, this group of generalist repositories has developed a set of recommendations for handling data citations in repositories. The recommendations include information for how repositories can handle different aspects of data citations:

Workflows to collect, store and expose data citations

  • Collecting data citations: This can take place through self report by authors as part of the data deposit process, or by harvesting data citations from external sources such as DataCite, Crossref, Dimensions, Europe PMC or NASA ADS.
  • Storing data citations: Repositories collect data citations via the metadata for the datasets they host, the recommendations for handling citations are based on the metadata fields recommended by GREI to establish a relationship between the dataset and the citing object.
  • Exposing data citations: Repositories should expose the citations on the landing page for the dataset record, indicating the provenance (i.e. source) for the data citation. 

‘Cite As’ template

The recommendations advise data repositories to provide a citation template on the landing page of the dataset, in order to encourage researchers and other parties to cite datasets they use. 

‘Cite As’ template example from Dryad.

Aggregation & discoverability of data citations

In order to enable aggregation and discoverability of the connections between datasets and other scholarly objects, data citations should be submitted to DataCite. In addition to making data citations available to the community via its API services, Data Cite also exposes citations via the DataCite Commons portal, which enables searches for resources with persistent identifiers and connections to metadata provided by DataCite, Crossref, ORCID, ROR and re3data.

Data Citation Corpus

The community has so far lacked a straightforward way of obtaining information about data citations from different repositories and across the literature. To address this challenge, DataCite is working on the development of the Data Citation Corpus, which will provide a centralized resource that compiles data citations from a variety of sources, and make data citation information readily and openly available to the community.

The Data Citation Corpus will include data citations in DataCite; this incorporates citations deposited by DataCite-member repositories, including the GREI repositories. We invite all repositories to contribute their data citations to DataCite so that those citations can be integrated into the Data Citation Corpus.

A need for further community discussion 

The discussions within the GREI group leading to these recommendations have highlighted many areas of alignment across these generalist repositories for the handling of data citations. At the same time, our discussions also made it clear that there are areas where additional community discussion is needed in order to develop further consensus and guidance. A particular theme that sparked interest relates to the designation of provenance of data citations, and the level of detail that repositories and DataCite should provide on this. The group felt that signaling the level of validation for the data citation (i.e. whether the citation is verified by an independent curator, it is self reported by the author, harvested from another source etc.) could help increase trust in this information, but we feel that guidance on this would require a broader community conversation. We welcome input from the community on this and other topics that may be useful to explore for future updates to the GREI recommendations.

We hope that this GREI resource for data citations encourages other repositories to adopt workflows to collect, store and expose citations to data in an open and consistent manner. We welcome feedback from the community on these recommendations, do you have feedback or suggestions? Please contact GREI.

About GREI

The Generalist Repository Ecosystem Initiative (GREI) is a U.S. National Institutes of Health (NIH) program sponsored by the Office of Data Science Strategy that has brought together seven generalist repositories to collaborate on establishing “a common set of cohesive and consistent capabilities, services, metrics, and social infrastructure” and increasing awareness and adoption of the FAIR principles.

DataCite launches first release of the Data Citation Corpus

First-of-its-kind aggregation brings together millions of data citations to advance understanding of data usage

DataCite, in partnership with the Chan Zuckerberg Initiative (CZI), is delighted to announce the first release of the Data Citation Corpus. A major milestone in the Make Data Count initiative, the release makes eight million data citations openly available and usable for the first time via an interactive dashboard and public data file. We invite the community to engage with the data and provide feedback on this collaborative effort.

As highlighted by Make Data Count, the lack of a centralized resource for citations to datasets has hindered the evaluation of how open data is being used. To address this gap, DataCite, with funding from the Wellcome Trust, has developed an innovative aggregation that brings together for the first time data citations from diverse sources into a comprehensive and publicly accessible resource for the global community.

“There is a pressing need to understand how open data is used, but we have lacked a resource to access this information in a centralized and open manner. The Data Citation Corpus will allow the community to gain access to critical insights on data usage.” said Iratxe Puebla, Director of Make Data Count. “We are thrilled to share the progress from our collaboration with CZI to bring together citations from different sources, and look forward to working with others in the community to expand the breadth and coverage of the corpus.”

The first release of the corpus includes data citations in DataCite and Crossref metadata as well as asserted data citations contributed by CZI, available to the community via a data citation store and dashboard developed by Coko. Leveraging accession numbers from Europe PMC, CZI applied a machine-learning model to a large set of full-text articles and preprints to extract mentions to datasets. This has enabled the first-ever aggregation of citations for datasets with DOIs and accession numbers into a single corpus, enabling a more complete picture of data usage.

“As an organization that invests in research data and reference datasets, we believe it is critical to understand how data is shared and reused to enable new scientific discoveries,” said Patricia Brennan, Vice President of Science Technology at the Chan Zuckerberg Initiative. “DataCite has been a leader in this space, providing critical infrastructure for data citation and for tracking its reuse. We’re proud to support them in their vision to build a comprehensive global corpus of actionable data citations.”

The interactive dashboard of the corpus allows users to visualize and report on citations by a variety of facets, such as funder, data repository, or the journal where the article citing the data is published.

A complete data file of all of the citations is also available for additional analysis and evaluation. Request the data file via this form.

Forthcoming releases will focus on addressing existing metadata gaps, for example, related to the disciplinary information for the datasets, and on incorporating feedback from early adopters. DataCite will also pursue new collaborations with additional citation aggregators to expand the breadth and scale of data citations in the corpus. 

Community input is an integral part of this project and DataCite invites researchers, institutions, funders and infrastructure providers to provide feedback on the first release of the corpus and future development work. Please join us for an online webinar on February 22 to learn more about the first release of the corpus and how to use it. Register now to participate in this interactive session.

About DataCite

DataCite is a global community that shares a common interest: to ensure that research outputs and resources are openly available and connected so that their reuse can advance knowledge across and between disciplines, now and in the future. 

About Chan Zuckerberg Initiative

The Chan Zuckerberg Initiative was founded in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education, to addressing the needs of our communities. Through collaboration, providing resources and building technology, our mission is to help build a more inclusive, just and healthy future for everyone. 

Make Data Count Summit: The Time Is Now to Advance Data Metrics

This post has been cross-posted on the DataCite blog.

A critical piece of open data infrastructure that has received insufficient attention is the evaluation of data usage. We still lack a clear understanding and a body of evidence on how data are being accessed, utilized, and incorporated into research activities.

While interest in this topic is increasing, there has so far not been a dedicated event for discussions on the evaluation of data usage, and on the development of the data metrics required to support such evaluation across both research and government. To address this need, we hosted the Make Data Count Summit on 12-13 September 2023, as a forum to bring diverse stakeholders together to tackle nuanced issues about the importance of open data metrics. The event brought together over 120 attendees in Washington DC, including representatives from research institutions, funders and government, researchers, publishers, and infrastructure providers with the goal to hone in on actionable items for agencies and institutions to advance data metrics and the evaluation of data usage.

Strong Foundations for Data Metrics

The first day of the event included presentations from ongoing efforts toward data metrics. Daniella Lowenberg (University of California, Office of the President) provided an overview of the work of Make Data Count since 2014. The lessons learnt from the initiative’s work on developing standards and engaging the community have paved a renewed focus on supporting open infrastructure and on building evidence on data usage practices to continue to refine data metrics for diverse uses. Her takeaway: it’s time for a new focus in the open data world, and that focus is undivided attention to the development of open data metrics.

Matt Buys from DataCite and Carly Strasser from the Chan Zuckerberg Initiative provided an update on the ongoing collaborative project to build an Open Global Data Citation Corpus. This momentous project seeks to aggregate citations to data from a variety of sources, including citations from DataCite metadata as well as those from other sources, such as data mentions extracted from full-text articles through machine learning and identifiers from EMBL-EBI. The goal is that once completed, the corpus will provide data usage information to the community, openly, and at a scale not possible before.

Matt Buys (DataCite) and Carly Strasser (Chan Zuckerberg Initiative) present the collaborative project to build an Open Global Data Citation Corpus.

Julia Lane from New York University presented her vision for data as a public asset and her work on the ‘Democratizing Data’ project, which has developed algorithms to identify mentions to data as part of full-text articles. This project is mining the content of articles in Scopus to surface mentions to data and visualize those through dashboards for stakeholders to explore.

In the panel discussion ‘Policy and Administrative Priorities for Data Metrics in the US’, panelists from different agencies discussed recent developments in the United States seeking to open data, such as the Evidence-act and last year’s OSTP memo. These policies have provided important impetus for not only opening up administrative and research data but also for agencies to consider what their priorities should be for understanding and evaluating the use of data that has been opened up.

Data Metrics Must Be Embedded Across the Ecosystem and Supported by Evidence

On the second day, Nancy Potok, CEO at NAPx Consulting and former Chief Statistician of the United States provided an overview on the foundations and lessons learnt from the five years since the US Evidence-act, which promoted the release of data to make it more accessible to the public, as well as the use of data to inform policy development. 

The subsequent sessions explored the needs for data metrics across different areas of the research process, including funding agencies, institutions, and scholarly communications. Institutional processes for tenure and promotion were highlighted as an area of particular importance in order to drive awareness among researchers and adoption of data evaluation and data metrics.

We also heard the latest evidence on data usage practices and data citation from a panel of bibliometricians, who highlighted the discoverability of datasets and metadata completeness as areas of improvement. The panelists called for further research to inform meaningful data metrics so that we avoid the pitfalls of defaulting to oversimplified and opaque metrics.

Dr Stefanie Haustein (University of Ottawa and ScholCommLab) introduces the session dedicated to bibliometrics studies on data reuse

Prioritizing Data Metrics Now

During the Summit, we invited attendees to share their experiences and suggestions in two breakout group discussions. While these discussions highlighted that data usage evaluation is a complex subject that will require many nuanced conversations across sectors, the message was also clear that we must iterate in incremental steps, and not let perfect be the enemy of good as we drive the conversations forward.

A few of the topics highlighted during the breakout conversations include:

  • Data metrics are nuanced, we will need to provide clear information and resources for a wide range of stakeholders so that they have the information they need to join the conversation and also to lead it within their communities
  • There is a need to raise awareness and engagement on data metrics across all institutional levels – from individual researchers and administrators to institutional leaders and program managers.
  • Data metrics must be anchored on transparent information and built upon consistent practices, while also being mindful of domain-specific needs

We thank all the attendees for their engagement during the different sessions. It is clear that there is a shared interest in driving the evaluation of data usage and an understanding that we should collectively work towards meaningful evidence-based metrics. The Make Data Count initiative will be taking forward these conversations and we invite everyone interested to collaborate with us, as we all move forward to advance meaningful data metrics.

You can access the slides for the talks presented at the Summit on Zenodo.

Announcing the Inaugural MDC Summit

Make Data Count (MDC) is convening a two-day summit dedicated to the evaluation of open data usage, reach, and impact. 

Our inaugural Make Data Count Summit, taking place in Washington, DC, on September 12 and 13, will bring together representatives of the research community, government data administrators, funders, policymakers, publishers, and infrastructure providers to discuss and solve the diverse and complex challenges of implementing open data assessment metrics and the infrastructures that support them.

For years the MDC initiative has been focused on bringing together the research ecosystem to prioritize open data metrics through the development of social and technical infrastructure for data citation and data usage. By bringing together groups from across the research and policy landscape, the Summit’s goal is to evaluate and highlight key success stories and use cases from the last decades of investments made into open data, and establish  a collective vision for evidence-based data metrics. 

Incorporating facilitated discussions, panels, and brainstorming sessions, the event will be focused on identifying concrete next steps to drive adoption and recognition of data metrics. The meeting will be focused especially on how to draw on existing data metrics initiatives to develop solutions for improving academic and governmental infrastructures that support data impact evaluation globally. 

For more details and to register for the Summit:

Data citations in context: We need disciplinary metadata to move forward

By Kathleen Gregory and Anton Ninkov

Data citations hold great promise for a variety of stakeholders. Unfortunately, due in part to a lack of metadata, i.e. about disciplinary domains, many of those promises remain out of reach. Metadata providers – repositories, publishers and researchers – play a key role in improving the current situation.  

The potentials of data citations are many. From the research perspective, citations to data can help researchers discover existing datasets and understand or verify claims made in the academic literature. Citations are also seen as a way to give credit for producing, managing and sharing data, as well as to provide legal attribution. Researchers, funders and repository managers also hope that data citations can provide a mechanism for tracking and understanding the use and ‘impact’ of research data [1]. Bibliometricians, who study patterns in scholarly communication by tracing publications, citations and related metadata, are also interested in using data citations to understand engagements and relationships between data and other forms of research output. 

Figure 1. Perspectives about the potentials of data citation [2]

Realizing the potential of data citations relies on having complete, detailed and standardized metadata describing the who, what, when, where and how of data and their associated work. As we are discovering in the Meaningful Data Counts project, which brings together bibliometricians and members of the research data community as part of the broader Make Data Count initiative, the metadata needed to provide context for both data and data citations are often not provided in standardized ways…if they are provided at all. 

As a first step in this project, we have been mapping the current state of metadata, shared data, and data citations available in the DataCite corpus. Our openly available jupyter notebook pulls realtime metadata about data in DataCite [3] and demonstrates both the evolving nature of the corpus and the lack of available metadata. In particular, our work highlights the current lack of information about a critical metadata element for providing context about data citations – the disciplinary domain where data were created. 

For example, we find that the amount of data available in DataCite has increased by more than 1.5 million individual datasets over a 7 month period from January to July 2021, when the corpus increased from 8,243,204 to 9,930,000 datasets. In January, as few as 5.7% of the available datasets had metadata describing their disciplinary domain according to the most commonly used subject classification system (see the treemap in Figure 2). In July, despite the increased number of datasets overall, the percentage with a disciplinary domain dropped slightly to 5.63%.

Figure 2. Data with metadata describing disciplinary domain, according to the OECD Fields of Science classification, retrieved on July 9th, 2021. For an interactive version of this tree map, with the most current data, please see our Jupyter Notebook [3]

These low percentages reflect the fact that providing information about the subject or disciplinary domain of data is not a required field in the DataCite metadata schema. For the nearly 6% of data that do have subject information, the corpus contains multiple classification schemes of differing granularity levels, ranging from the more general to the more specific. DataCite currently works to automatically map these classifications to each other in order to improve disciplinary metadata. Organizations which submit their data to DataCite also have a role to play in improving these disciplinary descriptions, as this information underlies many of these mapping efforts.  

Subject or disciplinary classifications for data are typically created using three methods:

  • Intellectually, where researchers, data creators or data curators use their expertise to assign a relevant subject classification. 
  • Automatically, where automated techniques are used to extract subject information from other data descriptions, e.g. the title or abstract (if available)
  • By proxy, where data are assigned the same subject classification as a related entity, e.g. when data are given the same subject classification as the repository where they are stored. This can be done either automatically or manually. 

Of these three methods, the intellectual method tends to be the most common, and also the most accurate and time-consuming approach. This method is often carried out by those closest to the data, i.e. researchers/data creators or data curators, who have expert knowledge about the data’s subject or disciplinary context which may be difficult to determine either automatically or by proxy.

While our work also exposes other examples of missing or incomplete metadata [4], we highlight here the current lack of information about disciplinary domains, as disciplinary information is important across all the perspectives shown in Figure 1. For example, disciplinary norms influence how data are shared, how they are made available, how they are understood and how they are reused. Information about disciplines is important for discovering data and is typically used by funders and research evaluators to place academic work in context. Disciplinary analyses are also a critical step in contextualizing citation practices in bibliometric studies, as citation behaviours have repeatedly been shown to follow discipline-specific patterns. Without disciplinary metadata, placing data citations into context will remain elusive and meaningful data metrics cannot be developed. 

In order to move forward with understanding data citations in context, we need better metadata – metadata about disciplinary domains, but also metadata describing other aspects of data creation and use. Metadata providers, from publishers to researchers to data repositories, can help to improve the current situation by working to create complete metadata records describing their data. Only with such metadata can the potentials of data citation be achieved. 


[1] These perspectives are visible, e.g. in the Joint Declaration of Data Citation Principles:

Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014

[2] Gregory, K. (2021, July). Bringing data in sight: Data citations in research. Presentation. Presented at Forum Bibliometrie 2021, Technical University of Munich, online. 

[3] Ninkov, A. (2021). antonninkov/ISSI2021: Datasets on DataCite – an Initial Bibliometric Investigation (1.0) [Computer software]. Zenodo.

[4] Ninkov, A., Gregory, K.; Peters, I., Haustein, S. (2021). Datasets on DataCite – An initial bibliometric investigation. Proceedings of the 18th International Conference of the International Society for Scientometrics and Informetrics, Leuven, Belgium (virtual). Preprint: