Make Data Count Update: November, 2017

The Make Data Count (MDC) project is moving ahead with full force and the team wanted to take a moment to update the research stakeholder community on our project resources and roadmap.

In September, the MDC team sat down and mapped out the project plan for our two-year grant. Working in an agile method, we defined a “minimum viable product” (mvp) that comprises a full ecosystem of data usage and citation metrics flowing in and out of the technical hub and displayed on the DataONE repositories, Dash (California Digital Library Data Publishing Platform), and DataCite by summer of 2018.

Screen Shot 2017-11-08 at 11.30.12 AM

This fall the MDC team also spent time traveling to several conferences to gather early adopters and gauge interest in data usage metrics. Many energetic and thoughtful discussions occurred regarding what the MDC-envisioned full ecosystem of data usage metrics will look like and how various stakeholders can contribute. The main takeaway: there is a need for a comprehensive and standardized way to count and display data level metrics.

MDC

Coming up, representatives from the MDC team will be (and hope you can join us) at:

So, what is MDC working on outside of these presentations?

All of the MDC project work can be tracked on Github, and we encourage you to follow along.

MDC_Roadmap

  • MDC and COUNTER are gathering community feedback from the COUNTER Code of Practice for Research Data Draft and turning this outline into a full narrative to be posted as a preprint in December.
  • DataCite is working to build out a Data Level Metrics Hub that will ingest data citations and data usage metrics, use the COUNTER recommendation as a standard to log crunch, and push out standardized usage metrics for display on repository interfaces.
  • Our first repositories, listed above, will be working to log process usage metrics against the COUNTER recommendation and technical hub for implementation.
  • Designs for displayed data metrics on repository interfaces will be created and tested.
  • Conversations with any groups that may want to be involved will continue- the more community feedback & support the better!

How can you help?

Everyone: We put out a COUNTER Code of Practice for Data Usage Draft and would appreciate community feedback. As stated above, this recommendation is what the usage metrics ecosystem will be standardized against. We also need help with mass outreach about our project, so please help us spread the word!

Repositories: We are collecting the names of those who would be interested in log file crunching against our COUNTER recommendation and hub and be early adopters of data level metrics; please get in touch if your repository supplies DOIs and would be interested.

Publishers: Support data citations! The data citation information is coming from CrossRef Event Data and DataCite, and the more that publishers support data citations in article publication, the more data can be fed into our hub.

Researchers: We want to give you credit for your research data. We are always looking for beta testers of our system and would appreciate your input. Please get in touch if you or your labs are interested in getting involved.

Join our mailing list & follow us on Twitter (@MakeDataCount)

COUNTER Code of Practice for Research Data Draft 1

Following our draft update and executive summary, Make Data Count and COUNTER are proud to release our first draft of a Code of Practice for Research Data.

This first iteration is meant to be a draft, and our goal is to receive input and feedback from the community. We ask that you please comment on and mark up the document with questions, suggestions, and/or overall feedback. Our intention is to build off of this feedback and iterate on future versions of this Code of Practice.

COUNTER Code of Practice for Research Data Draft 1

COUNTER Code of Practice for Research Data Draft Update

As a research and scholarly communications community, we value methods to gauge the impact of research outputs, and we do this in the forms of citations and downloads. But, until now this has been limited to traditional journal publications, and scholarly research is much more than an article. Foremost, data play a major role in the research process and deserve to be valued as a first-class scholarly output. But to do so, an infrastructure for measure the impact of research data needs to be developed that responds to the community’s needs. The first step is developing a measurement standard for data usage.

Counting data is much harder than journal counting pdf articles. Data are complex objects with a variety of file formats, numerous versions, and one dataset can be part of or derived from another dataset. Earlier this year, COUNTER (represented by Paul Needham) and Make Data Count  team members  (Martin Fenner (DataCite), Matt Jones (DataONE), John Chodacki and Daniella Lowenberg (CDL)) met for two days to talk through the use cases, definitions, and hurdles in properly counting data usage and to put together the first iteration of the COUNTER Code of Practice for Research Data.

Both days were filled with discussions about what stakeholders would value in data usage. For instance, funders may want to know statistics about all data from a specific program, or an institution may want to know statistics about all of their researchers’ data aggregated. Discussions continued about how data can be dynamic, have partial downloads, and data can vary in volume. Concerns about geolocation and IP addresses across different country standards were also discussed.

We talked through puzzling definitions, reminiscent of word problems: “If someone accesses and downloads package A, they get a copy of granule, but if that granule is in another download, package B, they get a copy of it. If granule was downloaded without either package, does it count for package A or B for the total count? Both? Neither?”.

Hint: The answer is both.

As the first draft takes shape, we want to share a summary of our recommendation. Make Data Count would appreciate your community input and feedback as soon as we release the first version at the beginning of September. Stay tuned for the first release, and follow us on our website for where we will be doing virtual and in-person sessions. Additionally, if you are working in the data metrics or data usage space, please get in touch with us!

COUNTER Code of Practice for Data Usage Executive Summary:

The Code of Practice for Research Data enables data publishers and vendors to produce consistent and credible usage data for research data. This allows libraries, funders and other stakeholders to compare data received from different vendors and data publishers, and to understand and demonstrate the value of research data.

This is the first draft release of a Code of Practice for Research Data specifically targeting research data usage. The recommendations are aligned as much as possible with the COUNTER Code of Practice Release 5 for the major categories of e-resources (journals, databases, books, reference works, and multimedia databases). Many definitions, processing rules and reporting recommendations apply to research data in the same way as they apply to  other resources. For example, this applies to data granularity, i.e. how to report usage for datasets that are available as single files, datasets consisting of multiple files, and/or collections of datasets.

While there is much more heterogeneity in this granularity for research data compared to other e-resources, the basic approach to data processing and reporting is fundamentally the same. The Dataset (a collection of data, published or curated by a single agent), is the content item for which we report usage and this can be in the form of investigations (e.g. how many times metadata are accessed) and requests (how many times data are retrieved). Investigations and requests for components of the dataset can be reported in the same way as other e-resources under COUNTER Code of Practice Release 5, in that the total number of requests are summed across the components of a given dataset. Sessions allow the differentiation between total investigations and requests of a dataset (in which all accesses are summed) and unique investigations and requests (in which accesses are only counted once if they are within a unique user session), similar to the reporting for other content items.

Some aspects of the processing and reporting of usage data are unique to research data, and the Code of Practice for Research Data thus needs to deviate from the COUNTER Code of Practice Release 5 and specifically address them. This starts with the main use cases for data usage reporting; subscription access to research data is uncommon, and thus breaking down the usage data by institution accessing the research data less relevant. There is interest in understanding the geographic distribution of investigations and requests to research data, but these usage data can have a lower granularity (by country rather than by institution), and can be aggregated and openly shared.

COUNTER Code of Practice Release 5 limits the usage data to human users and filters out all known robots, crawlers and spiders. While the same exclusion list can be applied to research data, there is legitimate non-human usage by scripts and other tools used by researchers, and these usage data should be included in the reporting.

Versioning is much more common and complex with research data compared to other e-resources, and the Code of Practice for Research Data addresses this. We recommend to report both the usage data for each specific version, and the combined usage for all versions. The Code of Practice for Research Data Draft 1 will not fully address the particular challenges associated with reporting usage for dynamically generated datasets.

Research data can be retrieved in a wide variety of file formats, different from text-based e-resources. For the Code of Practice for Research Data Draft 1 we will not  break down requests by file format. We do include the volume of data transferred as part of the reporting, as again the variations are much greater than for other e-resources. Reporting request data transfer volume in addition to the number of requests and investigations also helps with understanding differences between data repositories in how data are packaged and made available for retrieval.

The Code of Practice for Research Data will enable the reporting of usage stats by different data repositories following common best practices, and thus is an essential step towards realizing usage stats as a metric available to the community to better understand how publicly available datasets are being reused. This complements ongoing work on establishing best practices and services for data citation.

 

COUNTER Code of Practice for Research Data

COUNTER is a non-profit organization supported by a global community of library, publisher and vendor members, who contribute to the development of the Code of Practice through working groups and outreach.

COUNTER and MDC  are collaborating to develop and publish a Code of Practice for Research Data. This set of recommendations will focus on how data usage should be measured and reported.

COUNTER Code of Practice for Data Usage Update (August 18, 2017) 

COUNTER Code of Practice for Research Data Draft 1 (September 8, 2017)

logo

 

 

 

Make Data Count: Building a System to Support Recognition of Data as a First Class Research Output

The Alfred P. Sloan Foundation has made a 2-year, $747K award to the California Digital Library, DataCite and DataONE to support collection of usage and citation metrics for data objects. Building on pilot work, this award will result in the launch of a new service that will collate and expose data level metrics.

The impact of research has traditionally been measured by citations to journal publications: journal articles are the currency of scholarly research.  However, scholarly research is made up of a much larger and richer set of outputs beyond traditional publications, including research data. In order to track and report the reach of research data, methods for collecting metrics on complex research data are needed.  In this way, data can receive the same credit and recognition that is assigned to journal articles.

Recognition of data as valuable output from the research process is increasing and this project will greatly enhance awareness around the value of data and enable researchers to gain credit for the creation and publication of data” – Ed Pentz, Crossref.

This project will work with the community to create a clear set of guidelines on how to define data usage. In addition, the project will develop a central hub for the collection of data level metrics. These metrics will include data views, downloads, citations, saves, social media mentions, and will be exposed through customized user interfaces deployed at partner organizations. Working in an open source environment, and including extensive user experience testing and community engagement, the products of this project will be available to data repositories, libraries and other organizations to deploy within their own environment, serving their communities of data authors.

Are you working in the data metrics space? Let’s collaborate.

Find out more and follow us at: www.makedatacount.org, @makedatacount

About the Partners

California Digital Library was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. University of California Curation Center (UC3), one of four main programs within the CDL, helps researchers and the UC libraries manage, preserve, and provide access to their important digital assets as well as developing tools and services that serve the community throughout the research and data life cycles.

DataCite is a leading global non-profit organization that provides persistent identifiers (DOIs) for research data. Our goal is to help the research community locate, identify, and cite research data with confidence. Through collaboration, DataCite supports researchers by helping them to find, identify, and cite research data; data centres by providing persistent identifiers, workflows and standards; and journal publishers by enabling research articles to be linked to the underlying data/objects.

DataONE (Data Observation Network for Earth) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.