Bringing the COUNTER Code of Practice for Research Data and COUNTER Release 5.1 together to enhance normalization

https://doi.org/10.60804/8hvp-e561

This blog post is cross-posted on the COUNTER site.

The COUNTER Code of Practice for Research Data provided a key milestone in data evaluation practices by making it possible to report comparable usage counts across platforms. Over the last months, Make Data Count and COUNTER have collaborated to explore what a suitable direction would be for an update to the Code of Practice for Research Data, and we are now sharing our proposal to merge the Code of Practice for Research Data with COUNTER R5.1 for public consultation.

Read on for the context and motivation for this update, and details on how you can share feedback on the proposal.

The COUNTER Code of Practice for Research Data

The COUNTER Code of Practice for Research Data was released in 2017 as a framework to enable repositories and data-publishing platforms to report the usage of datasets in a standardized way. By providing a common framework to process and report counts for data views and downloads (e.g. noting whether usage originated from humans or machines, filtering out activity from web robots and spiders), the Code of Practice made it possible to report usage counts comparable across platforms. This framework has been implemented by a number of repositories, including Zenodo, Dryad and the Environmental Protection Agency repository.

The development of the Code of Practice for Research Data drew from COUNTER’s experience with standards for usage metrics for scholarly resources, and its recommendations aligned as much as possible with Release 5 of the COUNTER Code of Practice. At the time, the main COUNTER Code of Practice was tailored towards publisher platforms, and the data community felt it was important to have a dedicated framework for reporting on the use of datasets. However, there was also interest in maintaining close communication about updates to both Codes, and in exploring further alignment across the codes in future releases.

Aligning the Codes

In the six years since the release of the Code of Practice for Research Data, repository infrastructure has developed substantially, and COUNTER completed Release 5.1 of the main Code (R5.1), which has extended the type of outputs for which usage reports can be reported, including datasets. The Make Data Count and COUNTER teams have resumed the discussions about an update for the Code of Practice for Research Data, and based on our exploration of the current status of both codes, we are proposing to merge the Code of Practice for Research Data with R5.1. We base this recommendation on the following factors:

  • Many repositories host diverse research outputs, including articles, data, theses and many others. In order to report usage counts for the variety of resources hosted, the repositories would need to maintain two Codes, resulting in duplication of effort and resources, and a risk of data usage counts being processed differently over time depending on the Code the repository is applying. Having a single Code to implement simplifies implementation by mixed-content repositories, and ensures consistency in usage reports. 
  • One of the items that arose early on as part of discussions for a revision to the Code of Practice for Research Data was how to accommodate additional granularity to report usage of datasets that included multiple files. R5.1 now includes Components, which allows reporting usage for both an item and for files nested within the item. This structure allows repositories to report on usage of the dataset as a whole, or to report usage of individual files within the data record at a more granular level. 
  • Key aspects of the Code of Practice for Research Data and R5.1 overlap or have matching attributes. While some differences between the Codes exist, we feel that these can be addressed by providing dedicated guidance to repositories that report data usage on how to complete those fields (e.g instances where the fields do not apply to data and may be left blank).

This update will require some changes to the implementation of how usage reports are created for data, but it is important to note that the process for reporting usage reports does not change. Processed data usage reports can be sent to DataCite for aggregation, and are made accessible via DataCite’s API.

As part of our work on this proposal, we have consulted with members of repository teams for their input. We thank Zach Crockett (KBase), Alex Ioannidis (Zenodo), Pablo Saiz (Zenodo) and Ana van Gulick (Figshare) for their input and suggestions.

Share your feedback on the proposal

We are confident that this proposal to merge the Code of Practice for Research Data with R5.1 will bring efficiencies to repository teams and ensure consistent and robust reporting of data usage, but of course, we want your feedback!

Please check the details of the proposal and share your feedback on the suggested implementation by repositories hosting data via this form, or the COUNTER Github repository. You can also email us with any comments, queries or concerns: Iratxe Puebla (Make Data Count), Tasha Mellins-Cohen (COUNTER). The proposal is open for public consultation until 31 July 2024. We’ll present the proposal at the COUNTER conference on May 16, do join the session to hear more.

‘I would like to better identify citations for data’: Community feedback and use cases for the first release of the Data Citation Corpus

Kicking off the year, we had the pleasure of announcing the first release of the Data Citation Corpus. In response to the multiple expressions of interest in learning more about the project and in using the data file for the corpus, we held a webinar dedicated to the Data Citation Corpus.

With over 300 attendees, and plenty of follow on, the discussions we are having with community members are helping us inform the next steps for the project and prioritize areas that the community has signaled as requiring further work. Drawing on these conversations, we wanted to highlight some of the themes that have arisen and address questions posed during the webinar. We also share the link to the webinar recording below for those who missed it or want to review details.

Scope of the first release

The data file for the first release of the Data Citation Corpus includes 1.3 million data citations from DataCite Event Data, and eight million data mentions identified by the Chan Zuckerberg Initiative (CZI) by mining the full text of articles. Anyone interested in the data file can contact us via this form; we are happy to share the data file, and are just using this format to be able to follow up with users with further information or regarding possible collaborations. There is also a public dashboard that provides high-level visualization of the contents of the data file, accessible at http://corpus.datacite.org/dashboard.

The data file covers data-paper links identified through 1) DataCite Event Data, based on the metadata relationships that designate data citations and 2) CZI’s machine learning approach, where a mention of a dataset identifier found in the text of the article is designated as a citation.

By design, the Data Citation Corpus is focused on citations to data. We recognize the importance of understanding usage and citations to all kinds of open outputs (including samples, software, protocols and other) but the scope of the corpus lies with datasets.

The relationship between the dataset and the article relies on the identifier for the dataset. We recognize that open datasets can vary in how they are created and shared, some include a single file and some multiple components. The community has raised questions about how to best handle citations to individual datasets and collections of data, and how to best propagate citations from collections to their individual components. While the corpus does not address this use case at this time, it is something we will continue to explore as part of the corpus development.

Coverage & quality

The text mining completed by CZI involved a list of over 40 repositories (see slide 8 here for the repository list). The repositories were selected due to their standing in the community and because curated terms exist for their accession numbers, but it is important to bear in mind that the group focuses on life sciences disciplines. CZI completed text mining on five million open access papers available via Europe PMC (as the open licenses for those articles permit content mining), and thus, the literature included also falls mostly within the life sciences. We thus acknowledge that the text mining completed includes a slice of the literature. One of the challenges in data citation that had been expressed over the years was the separate handling of DOIs and accession numbers; the collaboration between CZI and DataCite provides a proof-of-principle to address this challenge and, for the first time, bring together citations to data with accession numbers and DOIs. As we move ahead, it is in our pipeline for the Data Citation Corpus to expand coverage of the disciplines represented.

During the webinar, Ana-Maria Istrate (CZI) also touched on the fact that the machine-learning model employed to search the article text would have picked some false positives, for example, if a string matching an accession number was also used to designate a grant number or another research entity. This is another aspect of the data citations that we will continue to work on, and we are happy to collaborate with groups interested in looking at improvements to the model, or in completing further curation of the citations included.

We also discussed metadata elements for the data citations, and some known metadata gaps. While we took steps to identify metadata for affiliation, funder and subject information where available, those metadata are not yet available for many citations, and subject area in particular is only recorded for under 1% of the data-paper links. This is not a problem specific to the Data Citation Corpus, and rather relates to the broader challenge around metadata completeness for both datasets and articles, but we recognize it as a priority area for additional work. We would like to explore approaches to infer discipline information for the datasets, or leverage AI or other approaches to enrich the discipline-level categorization for the citations.

Uses for the Data Citation Corpus

We have received over 70 requests for the data file of the corpus, from groups and individuals in a variety of roles. Those requesting the file have expressed a common interest in better understanding open data practices and the reach of open data, but their specific use cases are tailored to their professional needs:

  • Researchers: Researchers are interested in using the corpus for bibliometric analyses, to study practices around data sharing and reuse, and correlations between specific parameters of datasets (e.g. metadata quality, whether are data are associated with an article) and the level of reuse.
  • Librarians: Many of the librarians seek to identify citations for data hosted by their institutions and are exploring ways to integrate data citations into scholarship assessment reports.
  • Infrastructure Providers: A key focus for this group is to improve data citation coverage in existing platforms and services, and they are looking to check for additional citations in the corpus to complement those they already store or expose. Infrastructure providers also seek to leverage data citations to enhance their search and discovery features for digital objects. 
  • Institutional Administrators: This group’s main interest is to analyze data citations from their institutions to evaluate research impact, and in turn incentivize open science practices.
  • Publishers: Publisher representatives are looking to identify citations for data, compare those to their existing indices, and analyze bibliographic patterns, especially in the context of open access publications.
  • Data Repositories: These seek to compare citations for their datasets with those found in other sources, to showcase the impact and relevance of their datasets.

There are also additional nuanced use cases by other community members and we are keen to hear about the different ways in which the Data Citation Corpus can serve different data evaluation needs. While it is still early days, it is encouraging to see that the value of the Data Citation Corpus is understood across so many sectors of the scholarly community.  

We are still at an early stage and much work lies ahead to enhance the Data Citation Corpus as a tool that addresses the needs of these diverse communities, but we are encouraged by the response we have received so far. We are confident that the corpus aligns with a recognized need to better understand the use and reach of datasets. As we work to improve the current data file and incorporate additional data sources, we will continue to actively engage with the community and report on the many ways the Data Citation Corpus is being used.

If you would like to request access to the current data file (and be on the mailing list for further iterations of the corpus as they become available), please complete this form.  

You can find the slides from the talks at Zenodo (here & here) and watch the recording of the webinar on Youtube.

Open Metrics Require Open Infrastructure

By: John Chodacki, Martin Fenner, Daniella Lowenberg

Today, Zenodo announced their intentions to remove the altmetrics.com badges from their landing pages–and we couldn’t be more energized by their commitment to open infrastructure, supporting their mission to make scientific information open and free.

“We strongly believe that metadata about records including citation data & other data used for computing metrics should be freely available without barriers” – Zenodo Leadership

In the scholarly communications space, many organizations rally around the idea that we want the world’s knowledge to be discoverable, accessible, and auditable. However, we are not all playing by the same rules. While some groups work to build shared infrastructure, others work to build walls. This can be seen by the use of building barriers to entry around freely open information, or, information that should be open and free but isn’t. 

In light of emerging needs for metrics and our work at Make Data Count (MDC) to build open infrastructure for data metrics, we believe that it is necessary for corporations or entities that provide analytics and researcher tools to share the raw data sources behind their work. In short, if we trust these metrics enough to display on our websites or add to our CVs, then we should also demand that they be available for us to audit. 

This isn’t a new idea. The original movement to build Article Level Metrics (ALMs) and alternative metrics were founded on this principle. The challenge is that while infrastructure groups have continued to work to capture these raw metrics, the lopsided ecosystem has allowed corporations to productize and sell them, regardless of there being a true value-add on top of open information or not. 

We believe that the open metrics space should be supported, through contributions and usage, by everyone: non-profits, corporations, and community initiatives alike. In supporting open metrics, though, it is particularly important to acknowledge the projects and membership organizations that have moved the needle by networking research outputs through PIDs and rich metadata. We can acknowledge these organizations by advocating for open science graphs and bibliometrics research to be based on their data, so that others can reproduce and audit the assumptions made. Other ideals that we believe should guide the development of the open metrics space include:

  • Publishers and/or products that deal in building connections between research outputs should supply these assertions to community projects with full permissive CC0 license. 
  • Companies, projects, and products that collect and clean metrics data are doing hard work.  We should applaud them. But we should also recognize when metrics are factual assertions (e.g., counts, citations), they should be openly accessible. 
  • Innovation must continue and, similarly, productization can and should help drive innovation. However, only as a value add. Aggregating, reporting, making data consumption easier, building analysis tools and creating impact indicators from open data can all be valuable. But, we should not reward any project that provides these services at the expense of the underlying data being closed to auditing and reuse.
  • Show our work. We ask researchers to explain their methods and protocols and publish the data that underlies their research. We can and must do the same for the metrics we use to judge them by–and we must hold all actors in this space accountable in this regard as we work toward full transparency.   

These principles are core to our mission to build the infrastructure for open data metrics. As emphasis shifts in scholarly communication toward “other research outputs” beyond the journal article, we believe it is important to build intentionally open infrastructure, not repeating mistakes made in the metrics systems developed for articles. We know that it is possible for the community to come together and develop the future of open metrics, in a non-prescriptive manner, and importantly built on completely open and reproducible infrastructure.

Publishers: Make Your Data Citations Count!

Many publishers have implemented open data policies and have publicly declared  their support of data as a valuable component of the research process. But to give credit to researchers and incentivize behavior for data publishing, the community needs to promote proper citation of data. Many publishers have also endorsed the FORCE Data Citation Principles, Scholix, and other data citation initiatives, but still we have not seen implementation or benefits of proper data citation indexing at the journal level. Make Data Count provides incentives and aims to show researchers the value of their research data by displaying data usage and citation metrics. However, to be able to expose citations, publishers need to promote and index data citations with Crossref so that repositories utilizing the Make Data Count infrastructure can pull citations, evaluate use patterns, and display them publicly.

So, how as a publisher, can you support open research data and incentivize researchers to think about data like articles?

  1. Implement policies that advise researchers to deposit data to a stable repository that gives a persistent, citable identifier for the dataset
  2. Guide researchers to cite their own data or other data related to their article in their references list
  3. Acknowledge data citations in the article, data availability statement, and/or reference list, tag it as a data citation, and send this in XML to Crossref via the references list or in the relationships type. Crossref has put together a simple guide here.

How to engage with MDC?

Making Data Count (MDC) team members are at the center of many initiatives that focus on aspects of metrics, including DLM. We leverage existing channels to build a new data usage standard and to promote integration and adoption amongst data centers and data consumers.

If you want to get in contact and start collaborating with us, please:

  • Join our mailing list
  • Follow us on Twitter
  • Contact us directly!

 

 

Understanding the problem

Journal articles are the currency of scholarly research. As a result, we as a community, use sophisticated methods to gauge the impact of research and measure the attention it receives by analyzing article citations, article page views and downloads, and social media metrics. While imprecise, these metrics offer us a way to identify relationships and better understand relative impact. One of the many challenges with these efforts is that scholarly research is made up of a much larger and richer set of outputs beyond traditional publications. Foremost among them is research data. In order to track and report the reach of research data, we must build and maintain new, unique methods for collecting metrics on complex research data. Our project will build the metrics infrastructure required to elevate data to a first class research output.

In 2014, members of this proposal group were involved in an NSF EAGER research grant entitled, Making Data Count: Developing a Data Metrics Pilot . That effort surveyed scholars and publishers and determined which metrics and approaches would offer the most value to the research community. We spent one-year researching the priorities of the community and exploring how ideas common to article level metrics (ALM) could be translated to conventions in data level metrics (DLM) and building a prototype DLM service. We determined that the community values data citation, data usage, and data download statistics more than they value the metrics focused on social media. Based on this research, the project partners went a step further and isolated the gaps in existing data metrics efforts:

  • there are no community-driven standards for data usage stats;
  • no open source tools to collect usage stats according to standards;
  • and no central place to store, index and access data usage stats, together with other DLM, in particular data citations.

This project proposes to fill these gaps by engaging in the following activities:

  1. We will work with COUNTER to develop and publish code of practice recommendations for how data usage should be measured and reported
  2. We will deploy a central online DLM hub based on the Lagotto software for acquiring, managing, and presenting these metrics
  3. We will integrate new data sources and clients of aggregated metrics to serve as exemplars for the integration of data repositories and discovery platforms into a robust DLM ecosystem
  4. We will encourage the growth and uptake of DLMs through an engaged stakeholder community that will advocate, grow, and help sustain DLM services

Throughout each of these activities, we will encourage the growth and uptake of DLM through an engaged stakeholder community that will advocate, grow and sustain the services. As a result, the community will finally have the infrastructure needed to build relationships and better understand the relative impact of research data.