Overview

The Data Citation Corpus is a project by DataCite and Make Data Count funded by the Wellcome Trust, which has as focus the development of a comprehensive, centralized and publicly-available resource of data citations from a variety of sources. 

The first release of the Data Citation Corpus was delivered on January 30, 2024. The first release consisted of a data file of 10 million data citations and a dashboard for visualizing the contents of the data file. 

Each data citation record is comprised of:

  1. A pair of identifiers: An identifier for the dataset (a DOI or an Accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited.  
  2. Various metadata for the dataset and for the citing object.
{
    “id”: “84edcfb0-60f2-4384-bab9-cfad6fd46f18”,
    “created”: “2023-06-07T11:40:02.55+00:00”,
    “updated”: “2023-06-07T11:40:02.55+00:00”,
    “repository”: {
        “title”: “PANGAEA”,
        “external_id”: null
    },
    “publisher”: {
        “title”: “Elsevier BV”,
        “external_id”: null
    },
    “journal”: {
        “title”: “Science of The Total Environment”,
        “external_id”: null
    },
    “title”: “Masses of individual polychlorinated biphenyl congeners in gas phases in air in Chicago, Il, USA in 2009”,
    “objId”: “https://doi.org/10.1016/j.scitotenv.2021.151505”,
    “subjId”: “https://doi.org/10.1594/pangaea.935233”,
    “publishedDate”: “2022-03-01T00:00:00+00:00”,
    “accessionNumber”: null,
    “doi”: “10.1594/pangaea.935233”,
    “relationTypeId”: “is-supplement-to”,
    “source”: “datacite”,
    “affiliations”: [
        {
            “title”: “University of Iowa”,
            “external_id”: “https://ror.org/036jqmy94”
        }
    ],
    “funders”: [
        {
            “title”: “National Institute of Environmental Health Sciences”,
            “external_id”: “https://doi.org/10.13039/100000066”
        }
    ],
    “subjects”: [
        “Environmental Engineering”
    ]
}

Data file

The data file is available in JSON and CSV formats. The JSON file is the version of record.

Version 1.0 of the corpus data file was released on January 30, 2024. Release v1.1 is an optimized version of v1.0 designed to make the original citation records more usable. No citations have been added to or removed from the dataset in v1.1. The data file is available on Zenodo: https://zenodo.org/records/11216814.

Feedback on the data file can be submitted via Github. For general questions, email info@makedatacount.org.

Data Sources

The file includes two sources for data citations:

SourceDataset-article relationshipDocumentation
DataCite Event DataCitations are determined based on the resource type and the relation type designated in the metadata for the dataset or the article:

ResourceType= Dataset; relationType=IsReferencedBy/IsCitedBy/IsSupplementTo

ResourceType= Text; relationType= References/Cites/IsSupplementedBy
DataCite documentation on contributing citations

DataCite event data model
Chan Zuckerberg (CZI) Science Knowledge GraphMention to dataset identifier (accession number or DOI) identified in the text of an article by NER Model (SciBERT Model)More information about open sourced algorithm is forthcoming

The scope of the data file covers dataset-article pairs, i.e. it includes pairs where the citing object is a journal article or a preprint and the cited object is a dataset. DataCite Event Data includes records where the citing object involves a range of resource types (e.g. datasets, software), for the purposes of the data file of the Corpus, the only citations from DataCite Event Data included are those where the cited object is a dataset and the citing object is an article.

In addition to the identifier for the dataset and the citing object, each record includes metadata fields for the journal, publisher and publication date for the citing object (from Crossref metadata) and the repository where the dataset is hosted (via DataCite or EMBL-EBI). Where additional metadata fields are available (e.g. for affiliation, subject or other) this is included in the data citation record. Coverage of these additional metadata fields varies across citations.

Data Structure

Each data citation record includes the following fields:

FieldDescriptionRequired?
idInternal identifier for the citationYes
createdDate of item’s incorporation into the corpusYes
updatedDate of item’s most recent update in corpusYes
repositoryRepository where cited data is storedNo
publisherPublisher for the article citing the dataNo
journalJournal for the article citing the dataNo
titleTitle of cited dataNo
objIdDOI of article where data is citedYes
subjIdDOI or accession number of cited dataYes
publishedDateDate when citing article was publishedNo
accessionNumberAccession number of cited dataNo
DOIDOI of cited dataNo
relationTypeIdRelation type in metadata between citation object and subjectNo
sourceIdSource where citation was harvestedYes
subjectsSubject information for datasetNo
affiliationsAffiliation information for creator of cited dataNo
fundersFunding information for cited dataNo

Dashboard

The dashboard at http://corpus.datacite.org/dashboard provides an overview of the content of the data file. This includes the six visualizations below with filtering options according to different facts (e.g. affiliation, repository, journal etc):

  • Citation counts over time: Count of data citations spanning the time frame of currently available corpus data, from 2013 to 2023.
  • Citation counts by publisher: Count of data citations by publisher.
  • Counts of unique repositories, journals, subjects, affiliations, funders: Breakdown of the current coverage in the corpus for journals, affiliations, repositories and subjects.
  • Citation counts by subject: Count of data citations per dataset subject, note this displays the distribution of records that contain this metadata field and is not representative of the full set of records in the data file.
  • Citation counts by source of citation: Counts for citations ingested from DataCite Event Data and CZI Science Knowledge Graph.
  • Data citations corpus growth: Citation counts and ingest date (into the corpus) by identifier type over time.

Existing Limitations & Planned Enhancements

The initial release of the Data Citation Corpus brings together for the first time data citations associated with datasets with DOIs and accession number IDs. There are a few limitations and considerations that users should bear in mind when analyzing data in the first release. We plan to address these limitations in the course of future development work.

Relation types

The relation types that DataCite Event Data uses to designate citations are: IsReferencedBy, IsCitedBy, IsSupplementTo (see DataCite documentation on contributing citations). We note that the data file includes a small number of records with relation types beyond these three and we will be filtering out those extraneous citations from the next version of the corpus.

We should also note that Event Data includes some citations originating from article metadata registered at Crossref.  Data citations originating from Crossref may carry relation types different from the three listed above for DataCite metadata.

Metadata coverage

As noted above, coverage of metadata fields varies across records. We will work to add additional information to the existing data citations, particularly regarding:

  • Affiliation details with ROR ID
  • Funder details with ROR ID or Crossref Funder ID
  • Subject information

Note that information for metadata fields has been included in the data file as originally available, no deduplication or disambiguation has been completed (e.g. for affiliation information).

NER model output

The data citations identified by the NER model have not been curated. The methodology to identify these citations, as provided by CZI, is outlined in the Appendix below. CZI completed several steps to minimize false positives coming from the NER model, this included:

  • Output mentions only included where model probability >= 0.8
  • Validated the links by checking the URL responses, the final file only contains URLs that had a status_code of 200
  • Excluded reference lists from the text mining of articles to avoid false positives related to selecting DOIs for articles vs datasets.

In spite of these steps, we know that false positives arose, for example, the model identified some grant numbers similar to accession numbers, and attributed some accession numbers to the incorrect repository. We will take steps to identify and remove false positives from the data file as part of our further work.

Disciplinary coverage

The data citations identified via CZI’s NER model involved mining a set of articles indexed in Europe PMC, which has a biomedical scope. The accession numbers were associated with repositories also focused on life sciences disciplines. As a result, the data citations identified will be originating mostly from disciplines in the life sciences. The repositories included are well established in their fields and attract wide use in those disciplines, and so they provided a good starting point to identify citations to accession number IDs, but we will seek to extend the disciplinary coverage of the Data Citation Corpus as we ingest data citations from additional sources.

Appendix: CZI Science methodology

Full-text articles included

The set of articles employed for text mining involved 5.3 million articles, where the full text was available open access in Europe PMC.

Repositories mined

The list of repositories the NER model mined for are listed below, with a row entry for DOIs.

  • List of terms mined for come from https://europepmc.org/pub/databases/pmc/TextMinedTerms/
  • All but three repositories were linked through identifiers.org, those not linked are ebisc, gisaid, hipsci 
  • For the purposes of the data file, mentions to identifiers to eudract and nct were excluded as those are clinical trial registries. Per trial best practices, clinical trials are generally registered prior to the recruitment of patients, as a result, there is no guarantee that the clinical trial record will include a dataset from the trial.

Repository name (as it appears in Corpus file)Identifier prefixLinking Methodology

Where data is the extracted_word (or data mention)
Linked through identifiers.org

Is the link an identifiers.org link?
ArrayExpressarrayexpresshttps://identifiers.org/arrayexpress:datasetY
BioModelsbiomodels.dbhttps://identifiers.org/biomodels.db:datasetY
BioProjectbioprojecthttps://identifiers.org/bioproject:datasetY
biosamplebiosamplehttps://identifiers.org/biosample:datasetY
BioStudiesbiostudieshttps://identifiers.org/biostudies:datasetY
CATHcathhttps://identifiers.org/cath:datasetY
chebichebihttps://identifiers.org/chebi:dataset Y
ChEMBLchemblhttps://identifiers.org/chembl:datasetY
Complex Portal (CP)complexportalhttps://identifiers.org/complexportal:datasetY
dbgapdbgaphttps://identifiers.org/dbgap:datasetY
doihttps://dx.doi.org/:datasetsometimes
EBiSC Catalogue (European Bank for induced pluripotent Stem Cells catalogue)ebischttps://cells.ebisc.org/datasetN
Experimental Factor Ontologyefohttps://identifiers.org/efo:datasetY
The European Genome-phenome Archive (EGA)egahttps://identifiers.org/ega.dataset:datasetY
The Electron Microscopy Data Bank (EMDB)emdbhttps://identifiers.org/emdb:datasetY
Electron Microscopy Public Image Archive (EMPIAR)empiarhttps://identifiers.org/empiar:datasetY
Ensemblensemblhttps://identifiers.org/ensembl:datasetY
EU Clinical Trial Register (EudraCT)eudracthttps://identifiers.org/euclinicaltrials:datasetY
Genome assembly databasegcahttps://identifiers.org/insdc.gca:datasetY
European Nucleotide Archivegenhttps://identifiers.org/ena.embl:datasetY
Gene Expression Omnibus (GEO)geohttps://identifiers.org/geo:datasetY
GISAIDgisaidhttp://gisaid.org/EPI/datasetN
Gene Ontologygohttps://identifiers.org/go:datasetY
HUGO Gene Nomenclature Committeehgnchttps://identifiers.org/hgnc:datasetY
Human induced pluripotent stem cell initiativehipscihttp://www.hipsci.org/lines/#/lines/datasetN
The Human Protein Atlashpahttps://identifiers.org/hpa:datasetY
The International Genome Sample Resourceigsrhttps://identifiers.org/coriell:datasetY
IntActintacthttps://identifiers.org/intact:datasetY
InterProinterprohttps://identifiers.org/interpro:datasetY
MetaboLightsmetabolightshttps://identifiers.org/metabolights:datasetY
MGnifymetagenomicshttps://identifiers.org/mgnify.samp:datasetY
mintminthttps://identifiers.org/mint:datasetY
ClinicalTrials.govncthttps://identifiers.org/clinicaltrials:datasetY
OMIMomimhttps://identifiers.org/mim:datasetY
Orphadataorphadatahttps://identifiers.org/orphanet:datasetY
The Protein Data Bankpdbhttps://identifiers.org/pdb:datasetY
Pfam Protein Familiespfamhttps://identifiers.org/pfam:datasetY
PRIDE Proteomics Identification Database*pxdhttps://identifiers.org/pride:datasetY
Reactomereactomehttps://identifiers.org/reactome:datasetY
NCBI Reference Sequence Databaserefseqhttps://identifiers.org/refseq:datasetY
dbSNP Reference SNPrefsnphttps://identifiers.org/dbsnp:datasetY
Rfamrfamhttps://identifiers.org/rfam:datasetY
RNAcentralrnacentralhttps://identifiers.org/rnacentral:datasetY
Research Resource Identifiersrridhttps://identifiers.org/rrid:datasetY
TreeFamtreefamhttps://identifiers.org/treefam:datasetY
uniparcuniparchttps://identifiers.org/uniparc:datasetY
UniProtuniprothttps://identifiers.org/uniprot:datasetY
*Listed as deactivated in identifiers.org as of 22 Mar 2024

DOIs

DOIs are particularly noisy as the model does not always distinguish between data and non-data DOIs. The primary reason being that there are many article-article citations in reference lists and the model could not easily distinguish this. A couple of steps were taken to try and address this:

  • Excluded reference lists from the text mining of articles
  • Completed an additional content negotiation step to verify that the DOI corresponds to a dataset

We intend to work with the community to better address this in future iterations.