‘I would like to better identify citations for data’: Community feedback and use cases for the first release of the Data Citation Corpus

Kicking off the year, we had the pleasure of announcing the first release of the Data Citation Corpus. In response to the multiple expressions of interest in learning more about the project and in using the data file for the corpus, we held a webinar dedicated to the Data Citation Corpus.

With over 300 attendees, and plenty of follow on, the discussions we are having with community members are helping us inform the next steps for the project and prioritize areas that the community has signaled as requiring further work. Drawing on these conversations, we wanted to highlight some of the themes that have arisen and address questions posed during the webinar. We also share the link to the webinar recording below for those who missed it or want to review details.

Scope of the first release

The data file for the first release of the Data Citation Corpus includes 1.3 million data citations from DataCite Event Data, and eight million data mentions identified by the Chan Zuckerberg Initiative (CZI) by mining the full text of articles. Anyone interested in the data file can contact us via this form; we are happy to share the data file, and are just using this format to be able to follow up with users with further information or regarding possible collaborations. There is also a public dashboard that provides high-level visualization of the contents of the data file, accessible at http://corpus.datacite.org/dashboard.

The data file covers data-paper links identified through 1) DataCite Event Data, based on the metadata relationships that designate data citations and 2) CZI’s machine learning approach, where a mention of a dataset identifier found in the text of the article is designated as a citation.

By design, the Data Citation Corpus is focused on citations to data. We recognize the importance of understanding usage and citations to all kinds of open outputs (including samples, software, protocols and other) but the scope of the corpus lies with datasets.

The relationship between the dataset and the article relies on the identifier for the dataset. We recognize that open datasets can vary in how they are created and shared, some include a single file and some multiple components. The community has raised questions about how to best handle citations to individual datasets and collections of data, and how to best propagate citations from collections to their individual components. While the corpus does not address this use case at this time, it is something we will continue to explore as part of the corpus development.

Coverage & quality

The text mining completed by CZI involved a list of over 40 repositories (see slide 8 here for the repository list). The repositories were selected due to their standing in the community and because curated terms exist for their accession numbers, but it is important to bear in mind that the group focuses on life sciences disciplines. CZI completed text mining on five million open access papers available via Europe PMC (as the open licenses for those articles permit content mining), and thus, the literature included also falls mostly within the life sciences. We thus acknowledge that the text mining completed includes a slice of the literature. One of the challenges in data citation that had been expressed over the years was the separate handling of DOIs and accession numbers; the collaboration between CZI and DataCite provides a proof-of-principle to address this challenge and, for the first time, bring together citations to data with accession numbers and DOIs. As we move ahead, it is in our pipeline for the Data Citation Corpus to expand coverage of the disciplines represented.

During the webinar, Ana-Maria Istrate (CZI) also touched on the fact that the machine-learning model employed to search the article text would have picked some false positives, for example, if a string matching an accession number was also used to designate a grant number or another research entity. This is another aspect of the data citations that we will continue to work on, and we are happy to collaborate with groups interested in looking at improvements to the model, or in completing further curation of the citations included.

We also discussed metadata elements for the data citations, and some known metadata gaps. While we took steps to identify metadata for affiliation, funder and subject information where available, those metadata are not yet available for many citations, and subject area in particular is only recorded for under 1% of the data-paper links. This is not a problem specific to the Data Citation Corpus, and rather relates to the broader challenge around metadata completeness for both datasets and articles, but we recognize it as a priority area for additional work. We would like to explore approaches to infer discipline information for the datasets, or leverage AI or other approaches to enrich the discipline-level categorization for the citations.

Uses for the Data Citation Corpus

We have received over 70 requests for the data file of the corpus, from groups and individuals in a variety of roles. Those requesting the file have expressed a common interest in better understanding open data practices and the reach of open data, but their specific use cases are tailored to their professional needs:

  • Researchers: Researchers are interested in using the corpus for bibliometric analyses, to study practices around data sharing and reuse, and correlations between specific parameters of datasets (e.g. metadata quality, whether are data are associated with an article) and the level of reuse.
  • Librarians: Many of the librarians seek to identify citations for data hosted by their institutions and are exploring ways to integrate data citations into scholarship assessment reports.
  • Infrastructure Providers: A key focus for this group is to improve data citation coverage in existing platforms and services, and they are looking to check for additional citations in the corpus to complement those they already store or expose. Infrastructure providers also seek to leverage data citations to enhance their search and discovery features for digital objects. 
  • Institutional Administrators: This group’s main interest is to analyze data citations from their institutions to evaluate research impact, and in turn incentivize open science practices.
  • Publishers: Publisher representatives are looking to identify citations for data, compare those to their existing indices, and analyze bibliographic patterns, especially in the context of open access publications.
  • Data Repositories: These seek to compare citations for their datasets with those found in other sources, to showcase the impact and relevance of their datasets.

There are also additional nuanced use cases by other community members and we are keen to hear about the different ways in which the Data Citation Corpus can serve different data evaluation needs. While it is still early days, it is encouraging to see that the value of the Data Citation Corpus is understood across so many sectors of the scholarly community.  

We are still at an early stage and much work lies ahead to enhance the Data Citation Corpus as a tool that addresses the needs of these diverse communities, but we are encouraged by the response we have received so far. We are confident that the corpus aligns with a recognized need to better understand the use and reach of datasets. As we work to improve the current data file and incorporate additional data sources, we will continue to actively engage with the community and report on the many ways the Data Citation Corpus is being used.

If you would like to request access to the current data file (and be on the mailing list for further iterations of the corpus as they become available), please complete this form.  

You can find the slides from the talks at Zenodo (here & here) and watch the recording of the webinar on Youtube.

Open Metrics Require Open Infrastructure

By: John Chodacki, Martin Fenner, Daniella Lowenberg

Today, Zenodo announced their intentions to remove the altmetrics.com badges from their landing pages–and we couldn’t be more energized by their commitment to open infrastructure, supporting their mission to make scientific information open and free.

“We strongly believe that metadata about records including citation data & other data used for computing metrics should be freely available without barriers” – Zenodo Leadership

In the scholarly communications space, many organizations rally around the idea that we want the world’s knowledge to be discoverable, accessible, and auditable. However, we are not all playing by the same rules. While some groups work to build shared infrastructure, others work to build walls. This can be seen by the use of building barriers to entry around freely open information, or, information that should be open and free but isn’t. 

In light of emerging needs for metrics and our work at Make Data Count (MDC) to build open infrastructure for data metrics, we believe that it is necessary for corporations or entities that provide analytics and researcher tools to share the raw data sources behind their work. In short, if we trust these metrics enough to display on our websites or add to our CVs, then we should also demand that they be available for us to audit. 

This isn’t a new idea. The original movement to build Article Level Metrics (ALMs) and alternative metrics were founded on this principle. The challenge is that while infrastructure groups have continued to work to capture these raw metrics, the lopsided ecosystem has allowed corporations to productize and sell them, regardless of there being a true value-add on top of open information or not. 

We believe that the open metrics space should be supported, through contributions and usage, by everyone: non-profits, corporations, and community initiatives alike. In supporting open metrics, though, it is particularly important to acknowledge the projects and membership organizations that have moved the needle by networking research outputs through PIDs and rich metadata. We can acknowledge these organizations by advocating for open science graphs and bibliometrics research to be based on their data, so that others can reproduce and audit the assumptions made. Other ideals that we believe should guide the development of the open metrics space include:

  • Publishers and/or products that deal in building connections between research outputs should supply these assertions to community projects with full permissive CC0 license. 
  • Companies, projects, and products that collect and clean metrics data are doing hard work.  We should applaud them. But we should also recognize when metrics are factual assertions (e.g., counts, citations), they should be openly accessible. 
  • Innovation must continue and, similarly, productization can and should help drive innovation. However, only as a value add. Aggregating, reporting, making data consumption easier, building analysis tools and creating impact indicators from open data can all be valuable. But, we should not reward any project that provides these services at the expense of the underlying data being closed to auditing and reuse.
  • Show our work. We ask researchers to explain their methods and protocols and publish the data that underlies their research. We can and must do the same for the metrics we use to judge them by–and we must hold all actors in this space accountable in this regard as we work toward full transparency.   

These principles are core to our mission to build the infrastructure for open data metrics. As emphasis shifts in scholarly communication toward “other research outputs” beyond the journal article, we believe it is important to build intentionally open infrastructure, not repeating mistakes made in the metrics systems developed for articles. We know that it is possible for the community to come together and develop the future of open metrics, in a non-prescriptive manner, and importantly built on completely open and reproducible infrastructure.

Publishers: Make Your Data Citations Count!

Many publishers have implemented open data policies and have publicly declared  their support of data as a valuable component of the research process. But to give credit to researchers and incentivize behavior for data publishing, the community needs to promote proper citation of data. Many publishers have also endorsed the FORCE Data Citation Principles, Scholix, and other data citation initiatives, but still we have not seen implementation or benefits of proper data citation indexing at the journal level. Make Data Count provides incentives and aims to show researchers the value of their research data by displaying data usage and citation metrics. However, to be able to expose citations, publishers need to promote and index data citations with Crossref so that repositories utilizing the Make Data Count infrastructure can pull citations, evaluate use patterns, and display them publicly.

So, how as a publisher, can you support open research data and incentivize researchers to think about data like articles?

  1. Implement policies that advise researchers to deposit data to a stable repository that gives a persistent, citable identifier for the dataset
  2. Guide researchers to cite their own data or other data related to their article in their references list
  3. Acknowledge data citations in the article, data availability statement, and/or reference list, tag it as a data citation, and send this in XML to Crossref via the references list or in the relationships type. Crossref has put together a simple guide here.

How to engage with MDC?

Making Data Count (MDC) team members are at the center of many initiatives that focus on aspects of metrics, including DLM. We leverage existing channels to build a new data usage standard and to promote integration and adoption amongst data centers and data consumers.

If you want to get in contact and start collaborating with us, please:

  • Join our mailing list
  • Follow us on Twitter
  • Contact us directly!

 

 

Understanding the problem

Journal articles are the currency of scholarly research. As a result, we as a community, use sophisticated methods to gauge the impact of research and measure the attention it receives by analyzing article citations, article page views and downloads, and social media metrics. While imprecise, these metrics offer us a way to identify relationships and better understand relative impact. One of the many challenges with these efforts is that scholarly research is made up of a much larger and richer set of outputs beyond traditional publications. Foremost among them is research data. In order to track and report the reach of research data, we must build and maintain new, unique methods for collecting metrics on complex research data. Our project will build the metrics infrastructure required to elevate data to a first class research output.

In 2014, members of this proposal group were involved in an NSF EAGER research grant entitled, Making Data Count: Developing a Data Metrics Pilot . That effort surveyed scholars and publishers and determined which metrics and approaches would offer the most value to the research community. We spent one-year researching the priorities of the community and exploring how ideas common to article level metrics (ALM) could be translated to conventions in data level metrics (DLM) and building a prototype DLM service. We determined that the community values data citation, data usage, and data download statistics more than they value the metrics focused on social media. Based on this research, the project partners went a step further and isolated the gaps in existing data metrics efforts:

  • there are no community-driven standards for data usage stats;
  • no open source tools to collect usage stats according to standards;
  • and no central place to store, index and access data usage stats, together with other DLM, in particular data citations.

This project proposes to fill these gaps by engaging in the following activities:

  1. We will work with COUNTER to develop and publish code of practice recommendations for how data usage should be measured and reported
  2. We will deploy a central online DLM hub based on the Lagotto software for acquiring, managing, and presenting these metrics
  3. We will integrate new data sources and clients of aggregated metrics to serve as exemplars for the integration of data repositories and discovery platforms into a robust DLM ecosystem
  4. We will encourage the growth and uptake of DLMs through an engaged stakeholder community that will advocate, grow, and help sustain DLM services

Throughout each of these activities, we will encourage the growth and uptake of DLM through an engaged stakeholder community that will advocate, grow and sustain the services. As a result, the community will finally have the infrastructure needed to build relationships and better understand the relative impact of research data.