The Generalist Repository Ecosystem Initiative (GREI) has as one of its objectives the implementation of open metrics. A consistent approach to data citations is an important step to drive meaningful metrics that provide visibility on data usage, signal the added value of data repositories and enable reporting on the reach of NIH-funded research data. Make Data Count has engaged with the GREI repositories to review their existing approaches to data citations and develop a common resource on best practices for handling data citations at repositories.

Why data citations? 

Data citations are a useful measure to gain understanding on the use of research data. Data citations recognize the individual(s) or organization(s) that collected and shared the data, and researcher surveys regularly show that researchers value receiving citations to their dataset (see for example The State of Open Data report, or research by Kathleen Gregory and colleagues). 

Kathleen Gregory et al. ‘A survey investigating disciplinary differences in data citation’. Figure 14 showing preferences for how respondents would like others to refer to their own data.

Many repositories have taken steps to implement workflows to collect and expose citations, and importantly, recent developments in machine learning have opened up new ways to identify citations to data and scale the data citations available to the community.

GREI repositories best practices

All of the GREI repositories (Dataverse, Dryad, Figshare, Open Science Framework, Mendeley Data, Vivli, and Zenodo) already collect data citations or have it on their roadmap to add this feature. Building on their practices and experience in this area, this group of generalist repositories has developed a set of recommendations for handling data citations in repositories. The recommendations include information for how repositories can handle different aspects of data citations:

Workflows to collect, store and expose data citations

  • Collecting data citations: This can take place through self report by authors as part of the data deposit process, or by harvesting data citations from external sources such as DataCite, Crossref, Dimensions, Europe PMC or NASA ADS.
  • Storing data citations: Repositories collect data citations via the metadata for the datasets they host, the recommendations for handling citations are based on the metadata fields recommended by GREI to establish a relationship between the dataset and the citing object.
  • Exposing data citations: Repositories should expose the citations on the landing page for the dataset record, indicating the provenance (i.e. source) for the data citation. 

‘Cite As’ template

The recommendations advise data repositories to provide a citation template on the landing page of the dataset, in order to encourage researchers and other parties to cite datasets they use. 

‘Cite As’ template example from Dryad.

Aggregation & discoverability of data citations

In order to enable aggregation and discoverability of the connections between datasets and other scholarly objects, data citations should be submitted to DataCite. In addition to making data citations available to the community via its API services, Data Cite also exposes citations via the DataCite Commons portal, which enables searches for resources with persistent identifiers and connections to metadata provided by DataCite, Crossref, ORCID, ROR and re3data.

Data Citation Corpus

The community has so far lacked a straightforward way of obtaining information about data citations from different repositories and across the literature. To address this challenge, DataCite is working on the development of the Data Citation Corpus, which will provide a centralized resource that compiles data citations from a variety of sources, and make data citation information readily and openly available to the community.

The Data Citation Corpus will include data citations in DataCite; this incorporates citations deposited by DataCite-member repositories, including the GREI repositories. We invite all repositories to contribute their data citations to DataCite so that those citations can be integrated into the Data Citation Corpus.

A need for further community discussion 

The discussions within the GREI group leading to these recommendations have highlighted many areas of alignment across these generalist repositories for the handling of data citations. At the same time, our discussions also made it clear that there are areas where additional community discussion is needed in order to develop further consensus and guidance. A particular theme that sparked interest relates to the designation of provenance of data citations, and the level of detail that repositories and DataCite should provide on this. The group felt that signaling the level of validation for the data citation (i.e. whether the citation is verified by an independent curator, it is self reported by the author, harvested from another source etc.) could help increase trust in this information, but we feel that guidance on this would require a broader community conversation. We welcome input from the community on this and other topics that may be useful to explore for future updates to the GREI recommendations.

We hope that this GREI resource for data citations encourages other repositories to adopt workflows to collect, store and expose citations to data in an open and consistent manner. We welcome feedback from the community on these recommendations, do you have feedback or suggestions? Please contact GREI.

About GREI

The Generalist Repository Ecosystem Initiative (GREI) is a U.S. National Institutes of Health (NIH) program sponsored by the Office of Data Science Strategy that has brought together seven generalist repositories to collaborate on establishing “a common set of cohesive and consistent capabilities, services, metrics, and social infrastructure” and increasing awareness and adoption of the FAIR principles.