The Generalist Repository Ecosystem Initiative (GREI) has as one of its objectives the implementation of open metrics. A consistent approach to data citations is an important step to drive meaningful metrics that provide visibility on data usage, signal the added value of data repositories and enable reporting on the reach of NIH-funded research data. Make Data Count has engaged with the GREI repositories to review their existing approaches to data citations and develop a common resource on best practices for handling data citations at repositories.
Why data citations?
Data citations are a useful measure to gain understanding on the use of research data. Data citations recognize the individual(s) or organization(s) that collected and shared the data, and researcher surveys regularly show that researchers value receiving citations to their dataset (see for example The State of Open Data report, or research by Kathleen Gregory and colleagues).
Many repositories have taken steps to implement workflows to collect and expose citations, and importantly, recent developments in machine learning have opened up new ways to identify citations to data and scale the data citations available to the community.
GREI repositories best practices
All of the GREI repositories (Dataverse, Dryad, Figshare, Open Science Framework, Mendeley Data, Vivli, and Zenodo) already collect data citations or have it on their roadmap to add this feature. Building on their practices and experience in this area, this group of generalist repositories has developed a set of recommendations for handling data citations in repositories. The recommendations include information for how repositories can handle different aspects of data citations:
Workflows to collect, store and expose data citations
Collecting data citations: This can take place through self report by authors as part of the data deposit process, or by harvesting data citations from external sources such as DataCite, Crossref, Dimensions, Europe PMC or NASA ADS.
Storing data citations: Repositories collect data citations via the metadata for the datasets they host, the recommendations for handling citations are based on the metadata fields recommended by GREI to establish a relationship between the dataset and the citing object.
Exposing data citations: Repositories should expose the citations on the landing page for the dataset record, indicating the provenance (i.e. source) for the data citation.
‘Cite As’ template
The recommendations advise data repositories to provide a citation template on the landing page of the dataset, in order to encourage researchers and other parties to cite datasets they use.
Aggregation & discoverability of data citations
In order to enable aggregation and discoverability of the connections between datasets and other scholarly objects, data citations should be submitted to DataCite. In addition to making data citations available to the community via its API services, Data Cite also exposes citations via the DataCite Commons portal, which enables searches for resources with persistent identifiers and connections to metadata provided by DataCite, Crossref, ORCID, ROR and re3data.
Data Citation Corpus
The community has so far lacked a straightforward way of obtaining information about data citations from different repositories and across the literature. To address this challenge, DataCite is working on the development of the Data Citation Corpus, which will provide a centralized resource that compiles data citations from a variety of sources, and make data citation information readily and openly available to the community.
The Data Citation Corpus will include data citations in DataCite; this incorporates citations deposited by DataCite-member repositories, including the GREI repositories. We invite all repositories to contribute their data citations to DataCite so that those citations can be integrated into the Data Citation Corpus.
A need for further community discussion
The discussions within the GREI group leading to these recommendations have highlighted many areas of alignment across these generalist repositories for the handling of data citations. At the same time, our discussions also made it clear that there are areas where additional community discussion is needed in order to develop further consensus and guidance. A particular theme that sparked interest relates to the designation of provenance of data citations, and the level of detail that repositories and DataCite should provide on this. The group felt that signaling the level of validation for the data citation (i.e. whether the citation is verified by an independent curator, it is self reported by the author, harvested from another source etc.) could help increase trust in this information, but we feel that guidance on this would require a broader community conversation. We welcome input from the community on this and other topics that may be useful to explore for future updates to the GREI recommendations.
We hope that this GREI resource for data citations encourages other repositories to adopt workflows to collect, store and expose citations to data in an open and consistent manner. We welcome feedback from the community on these recommendations, do you have feedback or suggestions? Please contact GREI.
The Generalist Repository Ecosystem Initiative (GREI) is a U.S. National Institutes of Health (NIH) program sponsored by the Office of Data Science Strategy that has brought together seven generalist repositories to collaborate on establishing “a common set of cohesive and consistent capabilities, services, metrics, and social infrastructure” and increasing awareness and adoption of the FAIR principles.
DataCite, in partnership with the Chan Zuckerberg Initiative (CZI), is delighted to announce the first release of the Data Citation Corpus. A major milestone in the Make Data Count initiative, the release makes eight million data citations openly available and usable for the first time via an interactive dashboard and public data file. We invite the community to engage with the data and provide feedback on this collaborative effort.
As highlighted by Make Data Count, the lack of a centralized resource for citations to datasets has hindered the evaluation of how open data is being used. To address this gap, DataCite, with funding from the Wellcome Trust, has developed an innovative aggregation that brings together for the first time data citations from diverse sources into a comprehensive and publicly accessible resource for the global community.
“There is a pressing need to understand how open data is used, but we have lacked a resource to access this information in a centralized and open manner. The Data Citation Corpus will allow the community to gain access to critical insights on data usage.” said Iratxe Puebla, Director of Make Data Count. “We are thrilled to share the progress from our collaboration with CZI to bring together citations from different sources, and look forward to working with others in the community to expand the breadth and coverage of the corpus.”
The first release of the corpus includes data citations in DataCite and Crossref metadata as well as asserted data citations contributed by CZI, available to the community via a data citation store and dashboard developed by Coko. Leveraging accession numbers from Europe PMC, CZI applied a machine-learning model to a large set of full-text articles and preprints to extract mentions to datasets. This has enabled the first-ever aggregation of citations for datasets with DOIs and accession numbers into a single corpus, enabling a more complete picture of data usage.
“As an organization that invests in research data and reference datasets, we believe it is critical to understand how data is shared and reused to enable new scientific discoveries,” said Patricia Brennan, Vice President of Science Technology at the Chan Zuckerberg Initiative. “DataCite has been a leader in this space, providing critical infrastructure for data citation and for tracking its reuse. We’re proud to support them in their vision to build a comprehensive global corpus of actionable data citations.”
The interactive dashboard of the corpus allows users to visualize and report on citations by a variety of facets, such as funder, data repository, or the journal where the article citing the data is published.
A complete data file of all of the citations is also available for additional analysis and evaluation. Request the data file via this form.
Forthcoming releases will focus on addressing existing metadata gaps, for example, related to the disciplinary information for the datasets, and on incorporating feedback from early adopters. DataCite will also pursue new collaborations with additional citation aggregators to expand the breadth and scale of data citations in the corpus.
Community input is an integral part of this project and DataCite invites researchers, institutions, funders and infrastructure providers to provide feedback on the first release of the corpus and future development work. Please join us for an online webinar on February 22 to learn more about the first release of the corpus and how to use it. Register now to participate in this interactive session.
DataCite is a global community that shares a common interest: to ensure that research outputs and resources are openly available and connected so that their reuse can advance knowledge across and between disciplines, now and in the future.
About Chan Zuckerberg Initiative
The Chan Zuckerberg Initiative was founded in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education, to addressing the needs of our communities. Through collaboration, providing resources and building technology, our mission is to help build a more inclusive, just and healthy future for everyone.
A critical piece of open data infrastructure that has received insufficient attention is the evaluation of data usage. We still lack a clear understanding and a body of evidence on how data are being accessed, utilized, and incorporated into research activities.
While interest in this topic is increasing, there has so far not been a dedicated event for discussions on the evaluation of data usage, and on the development of the data metrics required to support such evaluation across both research and government. To address this need, we hosted the Make Data Count Summit on 12-13 September 2023, as a forum to bring diverse stakeholders together to tackle nuanced issues about the importance of open data metrics. The event brought together over 120 attendees in Washington DC, including representatives from research institutions, funders and government, researchers, publishers, and infrastructure providers with the goal to hone in on actionable items for agencies and institutions to advance data metrics and the evaluation of data usage.
Strong Foundations for Data Metrics
The first day of the event included presentations from ongoing efforts toward data metrics. Daniella Lowenberg (University of California, Office of the President) provided an overview of the work of Make Data Count since 2014. The lessons learnt from the initiative’s work on developing standards and engaging the community have paved a renewed focus on supporting open infrastructure and on building evidence on data usage practices to continue to refine data metrics for diverse uses. Her takeaway: it’s time for a new focus in the open data world, and that focus is undivided attention to the development of open data metrics.
Matt Buys from DataCite and Carly Strasser from the Chan Zuckerberg Initiative provided an update on the ongoing collaborative project to build an Open Global Data Citation Corpus. This momentous project seeks to aggregate citations to data from a variety of sources, including citations from DataCite metadata as well as those from other sources, such as data mentions extracted from full-text articles through machine learning and identifiers from EMBL-EBI. The goal is that once completed, the corpus will provide data usage information to the community, openly, and at a scale not possible before.
Julia Lane from New York University presented her vision for data as a public asset and her work on the ‘Democratizing Data’ project, which has developed algorithms to identify mentions to data as part of full-text articles. This project is mining the content of articles in Scopus to surface mentions to data and visualize those through dashboards for stakeholders to explore.
In the panel discussion ‘Policy and Administrative Priorities for Data Metrics in the US’, panelists from different agencies discussed recent developments in the United States seeking to open data, such as the Evidence-act and last year’s OSTP memo. These policies have provided important impetus for not only opening up administrative and research data but also for agencies to consider what their priorities should be for understanding and evaluating the use of data that has been opened up.
Data Metrics Must Be Embedded Across the Ecosystem and Supported by Evidence
On the second day, Nancy Potok, CEO at NAPx Consulting and former Chief Statistician of the United States provided an overview on the foundations and lessons learnt from the five years since the US Evidence-act, which promoted the release of data to make it more accessible to the public, as well as the use of data to inform policy development.
The subsequent sessions explored the needs for data metrics across different areas of the research process, including funding agencies, institutions, and scholarly communications. Institutional processes for tenure and promotion were highlighted as an area of particular importance in order to drive awareness among researchers and adoption of data evaluation and data metrics.
We also heard the latest evidence on data usage practices and data citation from a panel of bibliometricians, who highlighted the discoverability of datasets and metadata completeness as areas of improvement. The panelists called for further research to inform meaningful data metrics so that we avoid the pitfalls of defaulting to oversimplified and opaque metrics.
Prioritizing Data Metrics Now
During the Summit, we invited attendees to share their experiences and suggestions in two breakout group discussions. While these discussions highlighted that data usage evaluation is a complex subject that will require many nuanced conversations across sectors, the message was also clear that we must iterate in incremental steps, and not let perfect be the enemy of good as we drive the conversations forward.
A few of the topics highlighted during the breakout conversations include:
Data metrics are nuanced, we will need to provide clear information and resources for a wide range of stakeholders so that they have the information they need to join the conversation and also to lead it within their communities
There is a need to raise awareness and engagement on data metrics across all institutional levels – from individual researchers and administrators to institutional leaders and program managers.
Data metrics must be anchored on transparent information and built upon consistent practices, while also being mindful of domain-specific needs
We thank all the attendees for their engagement during the different sessions. It is clear that there is a shared interest in driving the evaluation of data usage and an understanding that we should collectively work towards meaningful evidence-based metrics. The Make Data Count initiative will be taking forward these conversations and we invite everyone interested to collaborate with us, as we all move forward to advance meaningful data metrics.
You can access the slides for the talks presented at the Summit on Zenodo.
Make Data Count (MDC) is convening a two-day summit dedicated to the evaluation of open data usage, reach, and impact.
Our inaugural Make Data Count Summit, taking place in Washington, DC, on September 12 and 13, will bring together representatives of the research community, government data administrators, funders, policymakers, publishers, and infrastructure providers to discuss and solve the diverse and complex challenges of implementing open data assessment metrics and the infrastructures that support them.
For years the MDC initiative has been focused on bringing together the research ecosystem to prioritize open data metrics through the development of social and technical infrastructure for data citation and data usage. By bringing together groups from across the research and policy landscape, the Summit’s goal is to evaluate and highlight key success stories and use cases from the last decades of investments made into open data, and establish a collective vision for evidence-based data metrics.
Incorporating facilitated discussions, panels, and brainstorming sessions, the event will be focused on identifying concrete next steps to drive adoption and recognition of data metrics. The meeting will be focused especially on how to draw on existing data metrics initiatives to develop solutions for improving academic and governmental infrastructures that support data impact evaluation globally.
For more details and to register for the Summit: summit.makedatacount.org
Data citations hold great promise for a variety of stakeholders. Unfortunately, due in part to a lack of metadata, i.e. about disciplinary domains, many of those promises remain out of reach. Metadata providers – repositories, publishers and researchers – play a key role in improving the current situation.
The potentials of data citations are many. From the research perspective, citations to data can help researchers discover existing datasets and understand or verify claims made in the academic literature. Citations are also seen as a way to give credit for producing, managing and sharing data, as well as to provide legal attribution. Researchers, funders and repository managers also hope that data citations can provide a mechanism for tracking and understanding the use and ‘impact’ of research data . Bibliometricians, who study patterns in scholarly communication by tracing publications, citations and related metadata, are also interested in using data citations to understand engagements and relationships between data and other forms of research output.
Figure 1. Perspectives about the potentials of data citation 
Realizing the potential of data citations relies on having complete, detailed and standardized metadata describing the who, what, when, where and how of data and their associated work. As we are discovering in the Meaningful Data Counts project, which brings together bibliometricians and members of the research data community as part of the broader Make Data Count initiative, the metadata needed to provide context for both data and data citations are often not provided in standardized ways…if they are provided at all.
As a first step in this project, we have been mapping the current state of metadata, shared data, and data citations available in the DataCite corpus. Our openly available jupyter notebook pulls realtime metadata about data in DataCite  and demonstrates both the evolving nature of the corpus and the lack of available metadata. In particular, our work highlights the current lack of information about a critical metadata element for providing context about data citations – the disciplinary domain where data were created.
For example, we find that the amount of data available in DataCite has increased by more than 1.5 million individual datasets over a 7 month period from January to July 2021, when the corpus increased from 8,243,204 to 9,930,000 datasets. In January, as few as 5.7% of the available datasets had metadata describing their disciplinary domain according to the most commonly used subject classification system (see the treemap in Figure 2). In July, despite the increased number of datasets overall, the percentage with a disciplinary domain dropped slightly to 5.63%.
Figure 2. Data with metadata describing disciplinary domain, according to the OECD Fields of Science classification, retrieved on July 9th, 2021. For an interactive version of this tree map, with the most current data, please see our Jupyter Notebook 
These low percentages reflect the fact that providing information about the subject or disciplinary domain of data is not a required field in the DataCite metadata schema. For the nearly 6% of data that do have subject information, the corpus contains multiple classification schemes of differing granularity levels, ranging from the more general to the more specific. DataCite currently works to automatically map these classifications to each other in order to improve disciplinary metadata. Organizations which submit their data to DataCite also have a role to play in improving these disciplinary descriptions, as this information underlies many of these mapping efforts.
Subject or disciplinary classifications for data are typically created using three methods:
Intellectually, where researchers, data creators or data curators use their expertise to assign a relevant subject classification.
Automatically, where automated techniques are used to extract subject information from other data descriptions, e.g. the title or abstract (if available)
By proxy, where data are assigned the same subject classification as a related entity, e.g. when data are given the same subject classification as the repository where they are stored. This can be done either automatically or manually.
Of these three methods, the intellectualmethod tends to be the most common, and also the most accurate and time-consuming approach. This method is often carried out by those closest to the data, i.e. researchers/data creators or data curators, who have expert knowledge about the data’s subject or disciplinary context which may be difficult to determine either automatically or by proxy.
While our work also exposes other examples of missing or incomplete metadata , we highlight here the current lack of information about disciplinary domains, as disciplinary information is important across all the perspectives shown in Figure 1. For example, disciplinary norms influence how data are shared, how they are made available, how they are understood and how they are reused. Information about disciplines is important for discovering data and is typically used by funders and research evaluators to place academic work in context. Disciplinary analyses are also a critical step in contextualizing citation practices in bibliometric studies, as citation behaviours have repeatedly been shown to follow discipline-specific patterns. Without disciplinary metadata, placing data citations into context will remain elusive and meaningful data metrics cannot be developed.
In order to move forward with understanding data citations in context, we need better metadata – metadata about disciplinary domains, but also metadata describing other aspects of data creation and use. Metadata providers, from publishers to researchers to data repositories, can help to improve the current situation by working to create complete metadata records describing their data. Only with such metadata can the potentials of data citation be achieved.
 These perspectives are visible, e.g. in the Joint Declaration of Data Citation Principles:
 Ninkov, A., Gregory, K.; Peters, I., Haustein, S. (2021). Datasets on DataCite – An initial bibliometric investigation. Proceedings of the 18th International Conference of the International Society for Scientometrics and Informetrics, Leuven, Belgium (virtual). Preprint: https://doi.org/10.5281/zenodo.4730857
Daniella Lowenberg, Rachael Lammey, Matthew B. Jones, John Chodacki, Martin Fenner
In the last decade, attitudes towards open data publishing have continued to shift, including a rising interest in data citation as well as incorporating open data in research assessment (see Parsons et al. for an overview). This growing emphasis on data citation is driving incentives and evaluation systems for researchers publishing their data. While increased efforts and interest in data citation are a move in the right direction for understanding research data impact and assessment, there are clear difficulties and roadblocks in having universal and accessible data citation across all research disciplines. But these roadblocks can be mitigated and do not need to keep us in a consistent limbo.
The unique properties of data as a citable object have attracted much needed attention, although it has also created an unhelpful perception that data citation is a challenge and requires uniquely burdensome processes to implement. This perception of difficulty begins with defining a ‘citation’ for data. The reality is that all citations are relationships between scholarly objects. A ‘data citation’ can be as simple as a journal article or other dataset declaring that a dataset was important to the creation of that work. This is not a unique challenge. However, many publishers and funders have elevated the relationship of data that “underlies the research” into a Data Availability Statement (DAS). This has helped address some issues publishers have found with typesetting or production techniques that stripped non-articles from citations. However, because of this segmentation of data from typical citation lists, and the exclusion of data citations in article metadata, many communities have felt they are in a stalemate about how to move forward.
Data citations are targeted as an area to explore in terms of research assessment. However, we do not have a clear understanding of how many data citations exist or how often data are reused. In the last few years, the majority of data citation conversations have been facilitated through groups at Research Data Alliance (via Scholix), Earth Science Information Partners (ESIP), EMBL- European Bioinformatics Institute (EMBL-EBI), American Geophysical Union (AGU), and FORCE11. These conversations have focused primarily on datasets and articles that have DOIs from DataCite and Crossref, respectively, emphasizing the relationship between datasets and published articles. While those relationships are areas that need broad uptake from repositories and publishers alike, they do not illustrate the full picture. Many citations are not being accounted for, namely biomedical datasets with accession numbers and compact identifiers that are not registered through DataCite but readily accessible through resolvers like identifiers.org. There is also a lack of understanding around the citations of datasets in other scholarly and non-scholarly (e.g., government documents, policy papers) outputs.
For these reasons, we have tried to ensure that conversations about data citation are not framed solely around the notion of assigning “credit” or around assigning any specific meaning to citations, for that matter. Without a full picture of how many citations exist, how datasets are composed across disciplines, how citation behavior varies across disciplines, and what context the citations are used in, it is impossible and inappropriate to use citations as a shorthand for credit. The community is working towards a better understanding of citation behavior—and we believe we will get there—but we need to be careful and considered in doing so to avoid repeating previous mistakes (e.g., creating another impact factor).
Why data citation is perceived as difficult
Data are complex. As mentioned in our 2019 book, data are nuanced. This means data citations are complex and understanding these nuances are essential for understanding true measures of data reuse. For instance, there is work to be done to understand the role of provenance, dataset to dataset re-usage, data aggregation and derivation, and other ways for measuring usage of datasets without a formal “citation”.
Data citations are complex. There is a well-established concept of scholarly citations, using reference lists of citations formatted in a defined citation style. The main challenges with the current approach center on making citations machine-readable using standardized metadata instead of citation styles meant for human readers, as well as on making citations machine-accessible using open APIs and aggregators. These are general challenges to be addressed with citation, but there are additional questions specific to handling data citations: is there a DOI or other persistent identifier and basic citation metadata for the dataset, is there tooling to bring this information into the citation style used by the publisher, should data citations go into the article reference list, what to do when the number of datasets cited in a publication goes into the 1000s or more, should datasets in reference lists be called out as data, etc.
There’s a lack of consistency in guidance. Despite the growing interest among various stakeholders (researchers, journals, repositories, preprint servers, and others) in supporting data citation, there is no consistency in guidance for and across these groups. These inconsistencies are in respect to how citations should be formatted, how citations should be marked up and indexed, and what the role for each stakeholder should be (especially repositories). Some of this can be attributed to a constant reinvent-the-wheel approach as well as to the wide variety of stakeholder groups and hubs for this information—understandably, people are confused between how Scholix fits with OpenAire, Crossref, and DataCite, nevermind the profusion of other overlapping initiatives and projects in this space that can make it even more difficult to navigate. It is clear that our best way forward is to not consistently reinvent the wheel, spawning new groups and initiatives, but rather to build on the existing work, leveraging the successes of the last decade of investment in data citations, and finding solutions for the more advanced issues at hand. In short: let’s focus on developing the most basic, clear guidance, and work upwards from there.
There’s a tension between data availability statements and data citations. In the last decade, publishers and funders have heavily focused on requiring data availability statements and ensuring that is the way to designate when articles have an associated dataset published. These data availability statements are rarely marked up as a relation in the article metadata or a note of re-use (outside of self citation). If we continue to focus solely on data availability statements as a required first step, which have yet to solve the “machine readability problem”, we will lose slim resources that would be better used to think about how each journal publisher can designate data reuse and citations.
Guidance and decision points
Understanding the many intricacies of data, citations, and data citation, we propose the following path forward for our communities to work effectively towards achieving widespread implementation of data citation and data reuse. This path forward begins with making decisions around clear guidance that needs to be provided, shifting focus away from “decision-pending” attitude and moving forward with clear recommendations on the following:
Best practices for citing datasets in articles, preprints, and books. We have multiple sets of best practices. We don’t need more guidance documents, we need consolidation and rationalization of the guidance that already exists.
Simplifying relationship type complexity. The complexity of ontologies for relationships is causing unnecessary churn and delays in implementation. Providers should simplify this; however, the community shouldn’t wait. We can and should implement viable solutions now. We should be promoting datasets in reference lists as a first viable solution.
How non-DOIs are cited. We have too many conversations happening about DOIs and not enough happening about citation in other identifier communities. These communities need to reach some simple conventions around putting data citations into reference lists with globally unique PIDs and citation metadata, in order to avoid requiring massive text mining efforts looking for string matches to, for example, “PDB:6VXX”, the identifier for the spike protein for COVID-19.
Publisher support for those who are not working with Crossref. Not all publishers use Crossref services or have the ability to implement Crossref’s approaches to data citations. We need to focus attention on accessible methods for reference extraction (e.g., from PDFs) and larger support for smaller publishers that do not have the resources to retool to fit current guidance.
The role for data repositories. Publishers are key to implementing data citation but data repositories must also focus on declaring relationships to articles and other outputs in their metadata. Data repositories should focus on making their datasets citable through PIDs and declaring robust metadata as well as reporting all known citations and linkages publicly so they can be used for aggregation.
Researchers should cite data despite these infrastructure hold-ups. Regardless of the hurdles to implementing all of the established best practices, the basic fact remains that researchers can currently cite data and they should, using approaches available today.
Choosing adoption over perfection
Perfection is the enemy of good and finding solutions for every complexity of data citation does not need to be a roadblock to get started. We can use a phased approach to begin implementing best practices for data citations right now:
Phase I: basic implementation
Align as much as possible with existing community practices and workflows (e.g., using reference lists)
Phase II: advanced implementation
Address special use cases (e.g., relation types, machine-readable data availability statements, dynamic data, dataset-dataset provenance)
Phase III: beyond data citation
Build infrastructure for other indicators assessing data reuse
While we have dabbled in all three of these phrases already, we are still largely stuck in Phase I, constantly reinventing the same basic wheel that keeps spinning around the same place.
Our focus should be on how to scale these best practices across all publishers and repositories, supporting the diverse research landscape. This includes advancing the conversation beyond the DOI-based focus. Once that happens we can really move forward with building mechanisms for credit and understanding data re-use for research assessment.
Despite the agenda ahead, there are many steps that can be taken right now to continue towards the dreamstate. The community should not wait for infrastructure to be perfect before engaging in data citation support.
This is important, so let’s say it again! The community should not wait for infrastructure to be perfect before engaging in data citation support.
Data citations are harder when we act like the adoption hurdles are insurmountable, so let’s simplify. Our infrastructure for data citations will continue to improve, use cases will continue to be defined and evolve, and we need as many broad stakeholders as possible to hop on board now and work with us towards comprehensive support for data citation.
By: John Chodacki, Martin Fenner, Daniella Lowenberg
Today, Zenodo announced their intentions to remove the altmetrics.com badges from their landing pages–and we couldn’t be more energized by their commitment to open infrastructure, supporting their mission to make scientific information open and free.
“We strongly believe that metadata about records including citation data & other data used for computing metrics should be freely available without barriers” – Zenodo Leadership
In the scholarly communications space, many organizations rally around the idea that we want the world’s knowledge to be discoverable, accessible, and auditable. However, we are not all playing by the same rules. While some groups work to build shared infrastructure, others work to build walls. This can be seen by the use of building barriers to entry around freely open information, or, information that should be open and free but isn’t.
In light of emerging needs for metrics and our work at Make Data Count (MDC) to build open infrastructure for data metrics, we believe that it is necessary for corporations or entities that provide analytics and researcher tools to share the raw data sources behind their work. In short, if we trust these metrics enough to display on our websites or add to our CVs, then we should also demand that they be available for us to audit.
This isn’t a new idea. The original movement to build Article Level Metrics (ALMs) and alternative metrics were founded on this principle. The challenge is that while infrastructure groups have continued to work to capture these raw metrics, the lopsided ecosystem has allowed corporations to productize and sell them, regardless of there being a true value-add on top of open information or not.
We believe that the open metrics space should be supported, through contributions and usage, by everyone: non-profits, corporations, and community initiatives alike. In supporting open metrics, though, it is particularly important to acknowledge the projects and membership organizations that have moved the needle by networking research outputs through PIDs and rich metadata. We can acknowledge these organizations by advocating for open science graphs and bibliometrics research to be based on their data, so that others can reproduce and audit the assumptions made. Other ideals that we believe should guide the development of the open metrics space include:
Publishers and/or products that deal in building connections between research outputs should supply these assertions to community projects with full permissive CC0 license.
Companies, projects, and products that collect and clean metrics data are doing hard work. We should applaud them. But we should also recognize when metrics are factual assertions (e.g., counts, citations), they should be openly accessible.
Innovation must continue and, similarly, productization can and should help drive innovation. However, only as a value add. Aggregating, reporting, making data consumption easier, building analysis tools and creating impact indicators from open data can all be valuable. But, we should not reward any project that provides these services at the expense of the underlying data being closed to auditing and reuse.
Show our work. We ask researchers to explain their methods and protocols and publish the data that underlies their research. We can and must do the same for the metrics we use to judge them by–and we must hold all actors in this space accountable in this regard as we work toward full transparency.
These principles are core to our mission to build the infrastructure for open data metrics. As emphasis shifts in scholarly communication toward “other research outputs” beyond the journal article, we believe it is important to build intentionally open infrastructure, not repeating mistakes made in the metrics systems developed for articles. We know that it is possible for the community to come together and develop the future of open metrics, in a non-prescriptive manner, and importantly built on completely open and reproducible infrastructure.
Since 2014, the Make Data Count (MDC) initiative has focused on building the social and technical infrastructure for the development of research data metrics. With funding from the National Science Foundation, Gordon and Betty Moore Foundation, and Alfred P. Sloan Foundation, the initiative has transformed from a research project with an aim to understand what researchers value about their data, to an infrastructure development project, and now into a full-fledged adoption initiative. The team is proud to announce additional funding from the Sloan Foundation to focus on widespread adoption of standardized data usage and data citation practices, the building blocks for open research data metrics.
Expanded team & expanded scope
In broadening our scope and refining our adoption efforts, we are thrilled to announce new MDC team members. By including key community players in the adoption and research landscapes, we can look beyond infrastructure development and more effectively reach our publisher and repository stakeholders.
Crossref: We welcome Crossref, who will help guide our data citation work at publishers in conjunction with existing data citation initiatives (e.g., Scholix). By having an increased presence at publisher meetings and building up support in the Crossref member community, we aim to see many more journals properly contributing to the data citation landscape.
Bibliometricians: With an increased pressure by research stakeholders to have data metrics at the ready, we are pleased to be working with a group of expert bibliometricians who will begin studies into researcher behavior around data re-use. It is essential that our driving motives for the development of data metrics are evidence based and we welcome Dr. Stefanie Haustein (University of Ottawa, Co-Director ScholCommLab) and Dr. Isabella Peters (ZBW – Leibniz Information Centre for Economics) and their labs to our team.
“I am excited to join and work closely together with the MDC team on the development of data metrics. Our team at the ScholCommLab in Canada and Isabella’s research group in Germany will use a mixed-methods approach and apply bibliometric as well as qualitative methods to analyze discipline-specific data citation and reuse patterns. We hope to provide much-needed evidence to develop meaningful data metrics that can help researchers showcase the importance of data sharing.” – Dr. Stefanie Haustein
Our goals for the MDC initiative going forward are three-fold:
Increased adoption of standardized data usage across repositories through enhanced processing and reporting services
Increased implementations of proper data citation practices at publishers by working in conjunction with publisher advocacy groups and societies
Promotion of bibliometrics qualitative and quantitative studies around data usage and citation behaviors
“The responsible use and application of data metrics and data citation must become a community norm across all disciplines if data creation, curation, stewardship, reuse and discovery are to be properly valued. By partnering with key infrastructure providers and researchers, Make Data Count is ensuring that the adoption of data metrics and data citation are researcher led, discipline specific and evidence based. This is crucial if we are to avoid the perverse consequences created by the misuse of article citations and metrics, such as those based on journal rank and impact factor.” – Dr. Catriona MacCallum, Director of Open Science, Hindawi
“MDC has put data metrics at the center of the debate on data sharing. Now, it is time to make data metrics a reality. The development of an ambitious infrastructure for data metrics, supported by the research of Stefanie Haustein, Isabella Peters and colleagues, creates the unique environment to turn data metrics into a tangible reality; expanding the analytical toolset for scientometric research and science policy making. Such transformation is meant to contribute not only to increase the importance of data sharing in scientific practice, but also to radically transform how science is being currently developed, measured and evaluated.” – Dr. Rodrigo Costas, Senior Researcher, CWTS, Leiden University
Driven by two separate grant funds, one focused on the deployment of data usage services, a bibliometrics dashboard, and publisher data citation campaigns (PI Lowenberg) and the other on understanding what is meaningful for data metrics (PI Haustein), the MDC team is moving full steam ahead on these adoption goals. The MDC initiative can only be effective with broad and diverse community participation. Follow along for announcements of webinars and events for community involvement and check out our announcement at the ScholCommLab blog for more details on the bibliometrics work ahead.
The Make Data Count team has been working on various infrastructure and outreach projects focused on how to measure the reach and impact of research data. While busy driving adoption of these frameworks and services, we have yet to discuss where we’re at in terms of high-level challenges and where we believe we need to go to.
To clarify to the community what our opinions and approaches are in terms of open data metrics, members from the Make Data Count team (Daniella Lowenberg, John Chodacki, Martin Fenner, Matt Jones) sat down and wrote a book that we hope will jump start a community conversation. We would love to hear your feedback and look forward to engaging with you on the topic.
Research data is at the center of science, and to date it has been difficult to understand its impact. To assess the reach of open data, and to advance data-driven discovery, the research and research supporting communities need open, trusted data metrics.
In Open Data Metrics: Lighting the Fire, the authors propose a path forward for the development of data metrics. They acknowledge historic players and milestones in the process and demonstrate the need for standardized, transparent, community-led approaches to establish open data metrics as the new normal.