Mapping the Impact of Software in Science
Notes from the 2023 hackathon hosted by the Chan Zuckerberg Initiative
Over the last few decades, a scientific culture has emerged that defines success as publication in high-prestige journals and relies on citation-based metrics as the currency of scientific excellence. Creators of critical resources — software, data, methods, hardware — that empower scientists across the globe often struggle to demonstrate their impact in a publication-centric culture. They face the dilemma of either packaging these resources in the format of a journal publication to make them “count,” or venture into the uncharted territory of demonstrating their impact using other data sources. How do we measure the scientific impact of outputs that are not papers?
Thankfully, institutions and funders are realizing we need better ways to measure the impact of these resources in order to sustain them. The open data community has been promoting the use of standardized metrics to assess the impact and reuse of research data. A global coalition of funders has identified sustaining research software, in particular, as a critical priority for science funding, and yet characterizing its usage and adoption in science remains a challenge. Despite the exponential growth of computational methods in biomedicine and science at large, the lack of canonical data, infrastructure, and widespread practices for citing software means that it’s very hard to quantify impact based on the scientific literature. It’s difficult for maintainers, funders, institutions, and individual labs to answer questions like the following:
- Which software tools are most frequently used by scientists in any given field?
- How does the use of open source compare to proprietary software in any given field?
- Are emerging new tools replacing legacy ones?
- What is the prevalent programming language in any given field?
- Which software tools should be part of a student’s computational curriculum?
- Which software projects should funders prioritize as critical infrastructure for science?
In recent years, there have been several attempts to answer these questions by mining the scientific literature and analyzing electronic notebooks or research code repositories. In 2022, our team at CZI released the CZ Software Mentions dataset, which represents the most comprehensive account (to our knowledge) of research software mentions extracted from the literature. And yet, there’s a long way to go to bridge the gap between software and publications and draw a complete and reliable map of software usage across scientific disciplines.
In October 2023, we hosted a hackathon that aimed to develop comprehensive datasets, methods, approaches, and resources to map the adoption and impact of research software in science (specifically scientific open source software). The hackathon convened 60 practitioners from 13 countries in different areas of computer science, data science, and machine learning, as well as organizations active in this space. Over the course of three days, we worked to tackle this challenge together using a broad range of resources and publicly available datasets — including the CZ Software Mentions dataset, the SoftCite dataset, SoMeSci, SoftwareKG, OpenAlex, the Research Organization Registry, Crossref, and OpenCitations.
Nine projects were identified as high priority. Check out the main repository of the hackathon for more details.
- How do scientific disciplines differ in the type of software authors cite? An analysis of disciplinary differences in software mention and usage sheds some light on this question.
- A taxonomy and algorithmic methods to classify the author’s intent when referencing software in scholarly articles were developed in a project that aims to help disambiguate papers about the creation of new software, the usage of existing software, or papers that cite software for other unrelated reasons.
- Characterizing software impact by only looking at mentions in the literature risks missing the critical contribution of core software libraries that represent dependencies and infrastructure on top of which scientific software works. A new analysis conducted at the hackathon looked into the dependency tree of software mentions.
- Large Language Models (LLMs) seem to be able to answer questions that used to require significant model training efforts. This project assessed the possibility of leveraging GPT3.5 with some fine-tuning to process a paragraph from a scholarly manuscript, identify software missing citations, and suggest a canonical citation.
- The absence of bidirectional links between academic papers and data/code repositories makes it challenging to mutually discover them. This project developed a tool to connect scientific papers to their relevant GitHub repositories and vice versa.
- While multiple datasets of software mentions have become available, they often use different/inconsistent data models. This project aims to construct a gold dataset that can significantly contribute to the automated extraction of software citations from scientific literature.
- Software names can be challenging to disambiguate or resolve. This project developed a benchmark to validate clustering approaches for software name disambiguation.
- Linking software to research organizations can help better monitor the production and consumption of software in academic institutions. This project demonstrates the possibility of establishing these links using research organization identifiers from ROR.
- A data visualization project aimed to visualize trends in software usage over time by leveraging software mention data.
All these projects were developed using open data resources and open source software, and their contributors agreed to make the output of their work at the hackathon openly available through the above repositories. We are excited about the momentum created from these three days together, and we expect these projects will continue and develop into new datasets, code repositories, and preprints.
We expect these projects will help lay the foundation for the development of reliable and comprehensive data sources to connect scientific discovery to the research software that enables it. Check out the main repository of the hackathon for more information on each of these projects.
We wish to thank all participants in the hackathon, the creators and providers of the public datasets and resources used during the event, and the CZI Infrastructure, Security and Legal teams for their help in supporting the computing needs of the participants.