Make Data Count Summer 2018 Update

It’s been two exciting months since we released the first iteration of our data-level-metrics infrastructure.  We are energized by the interest garnered and questions we’ve received and we wanted to share a couple of highlights!

Screen Shot 2018-08-06 at 10.40.02 AM

July Webinar

Soon after launch we hosted a webinar on “How-To” make your data count. Thank you to the 100 attendees that joined us for asking such thoughtful questions. For those that could not make it, or those that would like a recap, we have made all resources available on “Resources” tab of the website. Check out the July 10th webinar recording, webinar slide deck, and a transcript of the Q&A session.

If you still have questions, we encourage you to get in touch with us directly so that we can set up a group call with our team and yours. We have found our meetings with repositories and institutions to talk through the code of practice and log processing steps have been very helpful.

Zenodo Implemented the Code of Practice

A big congratulations and a thank you goes out to the Zenodo team for their implementation of standardized data usage metrics. Our project is only successful if we have as many repositories as possible standardize their data usage metrics so that we can truly have a comparable data metrics landscape. Zenodo is a global, popular, repository that was able to follow the Code of Practice for Research Data that we authored and standardize and display their views and downloads. We are looking forward to Zenodo displaying citations and contributing their usage metrics to our open-hub.

In-Person Team Meeting

Last week members from the DataCite, DataONE, and CDL teams were able to meet for a full day of planning the next quarter of the project. Prioritizing by project component, we were able to agree on where we would like to be by RDA Botswana. In broad terms – we would like to have citations integrated into the DataCite open hub (instead of as a separate entity in Event Data), we plan to gather user feedback on valued metrics, and we would like to spend time analyzing the citation landscape and the reasons why citations not making it to the hub. Follow along at our Github here.

IMG-0173

Our biggest goal is to get as many repositories as possible to make their data count. But beyond repositories, there is a role for all of us here:

Repositories:

  • If you are on a home grown platform, follow along our How-To guide. Let us know if you are implementing, and share with us your experience. The more that we can publicize repositories experiences and resources, the easier it will be for the community to adopt.
  • If you are a part of a larger platform community (fedora, dataverse, bepress), help us advocate for implementation!
  • Send your data citations through DataCite metadata. DataCite collects citation metadata as part of the DOI registration process. Enrich your metadata with links between literature (related resources) and data using the relatedIdentifier property.

Publishers:

  • Index your data citations with Crossref. When we first implemented MDC at our repositories we noticed that some known data citations were not appearing, and when looking in the Crossref API found that even when researchers added data citations they were in some cases stripped in the XML. When depositing article metadata, please ensure data citations are included as references (in the case of DataCite DOIs) or as relationships (in the case of other PIDs).

Funders, Librarians, and other Scholarly Communications Stakeholders:

  • Help us advocate for the implementation of data level metrics! Catch us at 5AM Conference, ICSTI, FORCE2018, or at RDA Botswana/International Data Week to learn more about our project and better equip yourself as an advocate.

Follow us on Twitter, join our Newsletter, or contact us directly here.

It’s Time to Make Your Data Count!

photo-1520246819288-8bcefb7ac966.jpeg

One year into our Sloan funded Make Data Count project, we are proud to release Version 1 of standardized data usage and citation metrics!

As a community that values research data it is important for us to have a standard and fair way to compare metrics for data sharing. We know of and are involved in a variety of initiatives around data citation infrastructure and best practices; including Scholix, Crossref and DataCite Event Data. But, data usage metrics are tricky and before now there had not been a group focused on processes for evaluating and standardizing data usage. Last June, members from the MDC team and COUNTER began talking through what a recommended standard could look like for research data.

Since the development of our COUNTER Code of Practice for Research Data we have implemented comparable, standardized data usage and citation metrics at Dash (CDL) and DataONE*, two project team repositories.

Screen Shot 2018-06-01 at 9.43.57 AM

Screen Shot 2018-06-05 at 6.33.12 AM
*DataONE UI coming soon

The repository pages above show how we Make Data Count:

  • Views and Downloads: Internal logs are processed against the Code of Practice and send standard formatted usage logs to a DataCite hub for public use and eventually, aggregation.
  • Citations: Citation information is pulled from Crossref Event Data.

The Make Data Count project team works in an agile “minimum viable product” methodology. This first release has focused on developing a standard recommendation, processing our logs against that Code of Practice to develop comparable data usage metrics, and display of both usage and citation metrics at the repository level. We know from work done in the prototype NSF funded Making Data Count project that the community value additional metrics. Hence future versions will include features such as:

  • details about where the data are being accessed
  • volume of data being accessed
  • citation details
  • social media activity

We just released our first iteration of data-level metrics infrastructure, what next?

1) Get Repositories Involved

For this project to be effective and for us to compare and utilize data-level metrics we need as many repositories as possible to join the effort. This is an open call for every repository with research data to Make Data Count. A couple of important resources to do so:

  • Check out our How-To Guide as described by the California Digital Library implementation of Make Data Count. Tips and tools (e.g. a Python Log Processor) are detailed in this guide and available on our public Github. Links in this guide also point to the DataCite documentation necessary for implementation.
  • Join our project team for a webinar on how to implement Make Data Count at your repository and learn more about the project on Tuesday, July 10th at 8am PST/11am EST. Webinar link: http://bit.ly/2xJEA4n.

2) Build Advocacy for Data-Level Metrics

Publishers:

When implementing this infrastructure in our repositories we became aware of how few publishers are indexing data citations properly. Very few datasets are correctly receiving citation credit in articles. If you are a publisher or are interested in advocating for proper data citation practices, check out the Scholix initiative and our brief guide here as well as DataCite’s recent blog on the current state of data citations.

Researchers & the research stakeholder community:

For the academic research community to value research data we need to talk about data-level metrics. This is a call out to researchers to utilize data-level metrics as they would with articles, and for academic governance to value these metrics as they do with articles.

With the first version of our data-level-metrics infrastructure released, we are excited to work as a community to further drive adoption of data metrics. For further updates, follow our twitter @makedatacount.

Publishers: Make Your Data Citations Count!

Many publishers have implemented open data policies and have publicly declared  their support of data as a valuable component of the research process. But to give credit to researchers and incentivize behavior for data publishing, the community needs to promote proper citation of data. Many publishers have also endorsed the FORCE Data Citation Principles, Scholix, and other data citation initiatives, but still we have not seen implementation or benefits of proper data citation indexing at the journal level. Make Data Count provides incentives and aims to show researchers the value of their research data by displaying data usage and citation metrics. However, to be able to expose citations, publishers need to promote and index data citations with Crossref so that repositories utilizing the Make Data Count infrastructure can pull citations, evaluate use patterns, and display them publicly.

So, how as a publisher, can you support open research data and incentivize researchers to think about data like articles?

  1. Implement policies that advise researchers to deposit data to a stable repository that gives a persistent, citable identifier for the dataset
  2. Guide researchers to cite their own data or other data related to their article in their references list
  3. Acknowledge data citations in the article, data availability statement, and/or reference list, tag it as a data citation, and send this in XML to Crossref via the references list or in the relationships type. Crossref has put together a simple guide here.

Make Data Count Update: Spring, 2018

The Make Data Count team is rapidly approaching the first release of standardized and comparable data level metrics (DLMs) on California Digital Library’s Dash and DataONE repositories. Resources on this release will be available shortly, but in the meantime the team would like to share updates on work completed in winter and our spring roadmap.

Berlin, March 2018

Before the Research Data Alliance (RDA) 11 Plenary in Berlin, MDC team members met for a day to map the work towards reaching our minimum viable product (MVP) by May. The focus of Fall 2017 – Winter 2018 was releasing a recommendation for counting data usage metrics. Now that this standard has been released, the team is utilizing it as guidance for processing logs at the repository level and sending these reports to a centralized open hub (at DataCite for access and aggregation).

Discussions for log processing, hub aggregation, and display at the repository level centered around the interactions between and roles of a repository, DataCite hub, and CrossRef Event Data (architecture map to be released this summer). The team also discussed how to tackle publication date (when repositories allow for delayed publication for peer review reasons), how Scholix and Event Data work together, how how citations will be pulled, and what resources should be produced for the community to implement MDC at their own repositories. For more information please check out our public Github.

Our goal is to release the first iteration of DLMs in May. With this release in Dash and DataONE we will also be providing:

  • A How-To Guide for repositories
  • Webinar (recorded) for repository implementation
  • Log Processor (in Python) for repositories that would like to utilize this built tool
  • Explanatory architecture diagram of the push and pulls from repositories to the DataCite hub

We have been collecting a list of repositories that have expressed interest in processing and displaying comparable DLMs, but if you have not yet been in touch with us please do contact us.

While at RDA, members from the MDC team presented at the Scholix (Kristian Garza and Martin Fenner pictured below) and Data Usage Metrics Working Group sessions.

The Data Usage Metrics WG was recently formed to engage the community in discussions around needs and priorities for usage metrics (and not just following the path of article level metrics). At this first session Make Data Count was presented by John Chodacki (CDL), Kristian Garza (DataCite), and Dave Vieglais (DataONE) pictured below. Wouter Haak presented ongoing initiatives at Elsevier and members from the audience shared out their work in this space.

Full notes and a recording of this session are available. While the focus of the group is broader than the Make Data Count use case, we encourage anyone interested in this space to join the working group.

April, 2018 and Roadmap Forward

Beyond RDA, members from Make Data Count were invited to the AAMC/NEJM/Harvard MRCT meeting on “Credit for Data Sharing” where MDC was briefly presented as an example of infrastructure for credit. Throughout Spring and Summer, MDC plans to direct outreach at repositories on how and why to get involved, and how and why publishers should index data citations. For repositories to be able to display citations, they need to be indexed by publishers with Crossref. A priority for the MDC project is to elevate the number of publishers doing this. To do so we will be releasing a series of resources for repositories and publishers virtually and in person at conferences.

Catch us this summer at:

Stay tuned for more updates, webinar dates, and resources!

Code of practice for research data usage metrics release 1

Kicking off Love Data Week 2018, the Make Data Count (MDC) team is pleased to announce that the first iteration of our Code of Practice for Research Data Usage Metrics Release 1 has now been posted as a preprint.

Beginning in June, members from the MDC team and COUNTER began conversations around what a standard for data usage metrics may look like. By September we were able to release an initial draft outline for community feedback. Comments from the community and further drafting spurred discussions around how data are different than articles and where the code of practice for data needed to deviate from the COUNTER Code of Practice Release 5.

Our first release has been posted as a preprint so that we can continue to receive community feedback and input. This Code of Practice will act as the framework for the MDC project goals of having comparable data usage metrics across the repository landscape. As we begin to implement this standard in our own (CDL and DataONE) data repositories we will be adapting the Code of Practice based on our experiences. We hope that repositories interested in being early adopters of displaying standardized data level metrics in accordance with this recommendation will also contribute to future iterations of the code of practice.

We look forward to utilizing this first release as a starting point for community discussion around data level metrics, and urge anyone interested to get in touch with us or join our RDA Data Usage Metrics Working Group.

Make Data Count Winter 2018 Update

For the past few months, we have worked to garner interest and facilitate discussion about data usage metrics within the community. Internally, we are working to drive development toward comparable, standardized data usage metrics and data citations on repository interfaces. We are excited to share our progress and we want to thank those who have given us feedback or have gotten involved along the way!

December, 2017

Early in December we had a series of webinars for the DataONE and NISO communities. Although the recordings are not available, the discussions between the MDC team and institutions and repository communities were engaging and productive. Thank you to those who joined us. Slides for these webinars can be found here.

During the week of webinars we also had DataONE’s Matt Jones at AGU talking on the  “Receiving Credit Where It’s Due” panel about the MDC project.

Matt Jones (DataONE) presenting at AGU, 2017

January, 2018

In the new year we gathered an Advisory Group of community members that have expertise in driving adoption of open data, open source, and open access initiatives. Our first Advisory Group meeting proved to be energizing for our MDC team as we brainstormed our best path towards mass adoption, various initiatives we should be working in conjunction with, and projects that could expand on the MDC work.

We are also pleased to report that we have launched an RDA Working Group “Data Usage Metrics” led by CDL’s Daniella Lowenberg, DataONE’s Dave Vieglais, and Scopus Product Manager Eleonora Presani. We will be at the RDA11 Berlin Meeting and would love for you to join the group and help us spread the word. The focus will be research data usage metrics implementation, adoption strategies, and future metrics to be considered.

Closing out January, members from MDC met at PIDapalooza and had a meeting focused on mapping implementation of log processing in the CDL Dash Data Publication platform and DataONE repositories.

At the conference we did a brief presentation on the progress of Make Data Count and spent much of our session time having a discussion about what constitutes “usage metrics”. Questions arose around how stakeholders may not differ in how they would benefit from data usage metrics (i.e. institutions versus funders), how to understand impact from usage metrics, and how citations are indicators of usage.

Martin Fenner (DataCite), Daniella Lowenberg (CDL), and Trisha Cruse (DataCite) presenting at PIDapalooza

What’s in store for the rest of winter?

Instead of hibernating, we have one major priority: implementation in the Dash and DataONE repositories. Coordinating efforts between the building of an open and public hub (hosted at DataCite) and implementation in the repositories, we are documenting our questions, answers, and experiences to develop a “how-to” guide for the repository community. We are continually looking for early adopter repositories that would like to log process and display standardized data usage metrics and citations. Please get in touch with us if your repository would like to be a part of this. To follow on our implementation work, check out our public Github.

And lastly, we have been working to formalize our recommendation for research data usage metrics. Stay tuned next week for the release of our COUNTER Code of Practice for Research Data preprint.

Make Data Count Update: November, 2017

The Make Data Count (MDC) project is moving ahead with full force and the team wanted to take a moment to update the research stakeholder community on our project resources and roadmap.

In September, the MDC team sat down and mapped out the project plan for our two-year grant. Working in an agile method, we defined a “minimum viable product” (mvp) that comprises a full ecosystem of data usage and citation metrics flowing in and out of the technical hub and displayed on the DataONE repositories, Dash (California Digital Library Data Publishing Platform), and DataCite by summer of 2018.

Screen Shot 2017-11-08 at 11.30.12 AM

This fall the MDC team also spent time traveling to several conferences to gather early adopters and gauge interest in data usage metrics. Many energetic and thoughtful discussions occurred regarding what the MDC-envisioned full ecosystem of data usage metrics will look like and how various stakeholders can contribute. The main takeaway: there is a need for a comprehensive and standardized way to count and display data level metrics.

MDC

Coming up, representatives from the MDC team will be (and hope you can join us) at:

So, what is MDC working on outside of these presentations?

All of the MDC project work can be tracked on Github, and we encourage you to follow along.

MDC_Roadmap

  • MDC and COUNTER are gathering community feedback from the COUNTER Code of Practice for Research Data Draft and turning this outline into a full narrative to be posted as a preprint in December.
  • DataCite is working to build out a Data Level Metrics Hub that will ingest data citations and data usage metrics, use the COUNTER recommendation as a standard to log crunch, and push out standardized usage metrics for display on repository interfaces.
  • Our first repositories, listed above, will be working to log process usage metrics against the COUNTER recommendation and technical hub for implementation.
  • Designs for displayed data metrics on repository interfaces will be created and tested.
  • Conversations with any groups that may want to be involved will continue- the more community feedback & support the better!

How can you help?

Everyone: We put out a COUNTER Code of Practice for Data Usage Draft and would appreciate community feedback. As stated above, this recommendation is what the usage metrics ecosystem will be standardized against. We also need help with mass outreach about our project, so please help us spread the word!

Repositories: We are collecting the names of those who would be interested in log file crunching against our COUNTER recommendation and hub and be early adopters of data level metrics; please get in touch if your repository supplies DOIs and would be interested.

Publishers: Support data citations! The data citation information is coming from CrossRef Event Data and DataCite, and the more that publishers support data citations in article publication, the more data can be fed into our hub.

Researchers: We want to give you credit for your research data. We are always looking for beta testers of our system and would appreciate your input. Please get in touch if you or your labs are interested in getting involved.

Join our mailing list & follow us on Twitter (@MakeDataCount)

COUNTER Code of Practice for Research Data Draft 1

Following our draft update and executive summary, Make Data Count and COUNTER are proud to release our first draft of a Code of Practice for Research Data.

This first iteration is meant to be a draft, and our goal is to receive input and feedback from the community. We ask that you please comment on and mark up the document with questions, suggestions, and/or overall feedback. Our intention is to build off of this feedback and iterate on future versions of this Code of Practice.

COUNTER Code of Practice for Research Data Draft 1

COUNTER Code of Practice for Research Data Draft Update

As a research and scholarly communications community, we value methods to gauge the impact of research outputs, and we do this in the forms of citations and downloads. But, until now this has been limited to traditional journal publications, and scholarly research is much more than an article. Foremost, data play a major role in the research process and deserve to be valued as a first-class scholarly output. But to do so, an infrastructure for measure the impact of research data needs to be developed that responds to the community’s needs. The first step is developing a measurement standard for data usage.

Counting data is much harder than journal counting pdf articles. Data are complex objects with a variety of file formats, numerous versions, and one dataset can be part of or derived from another dataset. Earlier this year, COUNTER (represented by Paul Needham) and Make Data Count  team members  (Martin Fenner (DataCite), Matt Jones (DataONE), John Chodacki and Daniella Lowenberg (CDL)) met for two days to talk through the use cases, definitions, and hurdles in properly counting data usage and to put together the first iteration of the COUNTER Code of Practice for Research Data.

Both days were filled with discussions about what stakeholders would value in data usage. For instance, funders may want to know statistics about all data from a specific program, or an institution may want to know statistics about all of their researchers’ data aggregated. Discussions continued about how data can be dynamic, have partial downloads, and data can vary in volume. Concerns about geolocation and IP addresses across different country standards were also discussed.

We talked through puzzling definitions, reminiscent of word problems: “If someone accesses and downloads package A, they get a copy of granule, but if that granule is in another download, package B, they get a copy of it. If granule was downloaded without either package, does it count for package A or B for the total count? Both? Neither?”.

Hint: The answer is both.

As the first draft takes shape, we want to share a summary of our recommendation. Make Data Count would appreciate your community input and feedback as soon as we release the first version at the beginning of September. Stay tuned for the first release, and follow us on our website for where we will be doing virtual and in-person sessions. Additionally, if you are working in the data metrics or data usage space, please get in touch with us!

COUNTER Code of Practice for Data Usage Executive Summary:

The Code of Practice for Research Data enables data publishers and vendors to produce consistent and credible usage data for research data. This allows libraries, funders and other stakeholders to compare data received from different vendors and data publishers, and to understand and demonstrate the value of research data.

This is the first draft release of a Code of Practice for Research Data specifically targeting research data usage. The recommendations are aligned as much as possible with the COUNTER Code of Practice Release 5 for the major categories of e-resources (journals, databases, books, reference works, and multimedia databases). Many definitions, processing rules and reporting recommendations apply to research data in the same way as they apply to  other resources. For example, this applies to data granularity, i.e. how to report usage for datasets that are available as single files, datasets consisting of multiple files, and/or collections of datasets.

While there is much more heterogeneity in this granularity for research data compared to other e-resources, the basic approach to data processing and reporting is fundamentally the same. The Dataset (a collection of data, published or curated by a single agent), is the content item for which we report usage and this can be in the form of investigations (e.g. how many times metadata are accessed) and requests (how many times data are retrieved). Investigations and requests for components of the dataset can be reported in the same way as other e-resources under COUNTER Code of Practice Release 5, in that the total number of requests are summed across the components of a given dataset. Sessions allow the differentiation between total investigations and requests of a dataset (in which all accesses are summed) and unique investigations and requests (in which accesses are only counted once if they are within a unique user session), similar to the reporting for other content items.

Some aspects of the processing and reporting of usage data are unique to research data, and the Code of Practice for Research Data thus needs to deviate from the COUNTER Code of Practice Release 5 and specifically address them. This starts with the main use cases for data usage reporting; subscription access to research data is uncommon, and thus breaking down the usage data by institution accessing the research data less relevant. There is interest in understanding the geographic distribution of investigations and requests to research data, but these usage data can have a lower granularity (by country rather than by institution), and can be aggregated and openly shared.

COUNTER Code of Practice Release 5 limits the usage data to human users and filters out all known robots, crawlers and spiders. While the same exclusion list can be applied to research data, there is legitimate non-human usage by scripts and other tools used by researchers, and these usage data should be included in the reporting.

Versioning is much more common and complex with research data compared to other e-resources, and the Code of Practice for Research Data addresses this. We recommend to report both the usage data for each specific version, and the combined usage for all versions. The Code of Practice for Research Data Draft 1 will not fully address the particular challenges associated with reporting usage for dynamically generated datasets.

Research data can be retrieved in a wide variety of file formats, different from text-based e-resources. For the Code of Practice for Research Data Draft 1 we will not  break down requests by file format. We do include the volume of data transferred as part of the reporting, as again the variations are much greater than for other e-resources. Reporting request data transfer volume in addition to the number of requests and investigations also helps with understanding differences between data repositories in how data are packaged and made available for retrieval.

The Code of Practice for Research Data will enable the reporting of usage stats by different data repositories following common best practices, and thus is an essential step towards realizing usage stats as a metric available to the community to better understand how publicly available datasets are being reused. This complements ongoing work on establishing best practices and services for data citation.