The FAIR Guiding Principles, published in 2016, aim to improve the findability, accessibility, interoperability and reusabilityof digital research objects for both humans and machines. Until now the FAIR principles have been mostly applied to research data. The ideas behind these principles are, however, also directly relevant to research software. Hence there is a distinct need to explore how the FAIR principles can be applied to software. In this work, we aim to summarize the current status of the debate around FAIR and software, as basis for the development of definite community-agreed principles for FAIR research software in the future. We discuss what makes software different from data with regard to the application of the FAIR principles, present an analysis of where the existing principles can directly be applied to software, where they need to be adapted or reinterpreted, and where the definition of additional principles is required. Furthermore, we discuss desired characteristics of research software that go beyond FAIR
The FAIR Guiding Principles were published and promoted to improve the reuse of scholarly data through making it more findable, accessible, interoperable and reusable by humans and machines. Implementing FAIR helps researchers demonstrate the impact of their work by enabling the reuse and citation of the data they produce, and can promote collaboration among them. It also helps publishers and funders to define policies for data sharing, and to promote discoverability and reuse. In addition, it helps data stewards and managers to provide guidance on quality criteria for deposits in institutional, community, and/or public digital repositories.
The intention of Wilkinson et al. was that the principles not only apply to data, but also to other digital objects, e.g. algorithms, tools, and workflows, that led to that data, as all these elements must be available to ensure transparency, reproducibility and reusability . At the policy level, software is indeed seen as part of FAIR, with the European Commission expert group on FAIR data stating that “Central to the realisation of FAIR are FAIR Digital Objects, which may represent data, software or other research resources.” . Applying the FAIR principles in a useful way to research software will provide similar benefits of enabling transparency, reproducibility and reusability of research, making it easier for industry, science, education and society to have effective access to software-based knowledge. In particular, FAIR software should facilitate making FAIR data.
However, software is not data. Over the last three years, numerous discussions have taken place that aimed to understand how the FAIR principles relate to software. Table 1 in the appendix provides an overview of recent events that included open discussions around FAIR software, with links to associated working documents and publications. It is clear that the four foundational principles in are intended to apply to software, but can we apply them in a practical and useful way? The terminology and detail used in the 15 FAIR Guiding Principles is focused on their application to data - particularly in the life sciences - and can be confusing if applied to software without translation. The drivers, stakeholders and incentives, whilst overlapping, are not identical. In addition, the variety covered by software and its distribution channels poses a challenge when adapting the current FAIR principles.
In this work, we aim to summarize the current state of the debate around FAIR and software, as basis for the development of specific principles for FAIR research software in the future. First, we discuss what makes software different from data with regard to the application of the FAIR principles, and describe desired characteristics of research software that go beyond FAIR (Section Software is not data). We then present an analysis of where the existing principles can directly be applied to software, where they need to be adapted or reinterpreted, and where the definition of additional principles is required (Section FAIR principles applied to research software). “Interoperability” turned out to be the most challenging principle, as we needed to take into account the complexities imposed by the executable, composite and multi-form nature of software. The conclusions provide a summary and directives for future work on FAIR for research software.
It may seem obvious that software and data are not the same, but in the discussion around the FAIR principles, software is often regarded as a special kind of data. This is not always helpful. Software and data are indeed both digital research objects and as such do share particular characteristics, such as the possibility of having a Digital Object Identifier (DOI) assigned, that allow them to be treated alike for certain aspects of FAIR. However, as elaborated by Katz et al. , there are also several significant differences that need to be taken into account when treating data and software as research objects: Data are facts or observations that provide evidence. In contrast, software is the result of a creative process that provides a tool for doing something, for example with data. As such, software is executable, while data is not. Software is often built using other software. This is especially obvious for software that implements multi-step processes to coordinate multiple tasks and their data dependencies, which are usually referred to as workflows . However, generally all software applications that are not written completely from scratch are of a composite nature that easily leads to complex dependencies. The lifetime of software is generally shorter than that of data, as versioning is applied more frequently and regularly leads to a change in behaviour and interface. Consequently, dependencies as well as dependent software packages are subject to frequent changes.
Naturally, the work on FAIR principles for software is focused on research software. Research software is defined as “software that is used to generate, process or analyse results that you intend to appear in a publication (either in a journal, conference paper, monograph, book or thesis)”. Importantly, for the purpose of having a reference definition, software that does not generate, process or analyse results - such as word processing software, or the use of a web search - is not considered research software. Research software is also a research object, and provenance around software usage plays a key role in the transparency, reproducibility and reusability of scientific activities, spanning from academic to industrial research. Research software includes but is not limited to source code, binaries and web services, and covers a broad spectrum from short scripts written ad hoc by researchers to produce results for a publication, to software rigorously developed for a mission-critical process. Accordingly, research software can be distributed in many ways such as digital repositories e.g., Github, BitBucket, GitLab; or archives like the Software Heritage Project ; project websites, FTP folders, language specific archive networks e.g., the Comprehensive R Archive Network (CRAN), the Python Package Index (PyPI), Maven, the Comprehensive Perl Archive Network (CPAN), Node Package Manager (NPM), and others.
Traditionally, research software have been created and maintained as Free and/or Open Source Software (FOSS). However, while there is a clear overlap between the objectives of FAIR and FOSS with regard to accessibility and reusability, they are not necessarily the same (see also ). FOSS is mostly concerned with source code being open and licensed under an open license. Open source and permissive licenses are desirable for FAIR software, but although FAIR has its roots in the “FOSS-loving” research software community, they are not a requirement as such. Required access control for certain data sets (e.g., patients electronic health records, genomics sequences) have prevented open data to become a FAIR principle (see go-fair FAQ). However, such privacy and sensitivity concerns are not in the same way valid for research software that relates to published research, where there is an expectation that the methodology is made available. It remains to be discussed how open research software should be in order to meet the intentions behind FAIR.
Another much-debated relationship is the one between FAIR and software quality. Ultimately, the quality of the content of digital resources is crucial for obtaining valid research results. However, the FAIR Guiding Principles do not explicitly cover content-related quality aspects, and it is an ongoing discussion whether software quality considerations are part of FAIR (e.g., ). We think that it is important here to distinguish between form (that is, how a software is provided, the code itself) and function (that is, what a software actually does, how it behaves, the algorithm encoded), as different quality considerations apply. This is also in line with how the FAIR principles are interpreted for data: i.e., they address the form of providing data sets to the scientific community, but are not concerned with the functional content or quality of the data themselves.
Quality aspects concerning the form of software can be considered as covered by FAIR, in particular by the interoperability and the (re)usability principles. It is important to realise that unlike data, software is not static and can only be (re)used if it is sustainable and evolves along with the continuous development of the entire software ecosystem. The quality of its codebase is decisive for a software's ability to evolve sustainably. This characteristic is often also referred to as maintainability, and includes aspects like modularity, understandability, changeability, analysability and testability . Following guidelines for good scientific software development, as well as language- and/or community specific coding standards are effective means that help to make and keep the code base maintainable. Many of these qualities are measurable/quantifiable and could thus be covered with additional FAIR principles and metrics.
Quality aspects that concern the functionality of software, on the other hand, go beyond what is covered by the FAIR Principles. Arguably, the most important quality criterion for research software is functional correctness, i.e., the production of the correct results every time the software is run. Thorough validation of the functional correctness of research software can however be significantly more difficult than the testing that is required for code maintainability as discussed above . For example, testing the software might require specific resources such as access to high performance computing, validated input/output data pairs to test the implementation of an algorithm might not be available yet (as the purpose of the software is to create them), or require the execution of very long computations. Other important quality criteria related to functionality of research software are security measures (guaranteeing privacy and integrity of research data) and computational efficiency (striving to optimise use of resources and runtime performance). The latter cannot be measured statically and may require systematic scientific benchmarking in order to arrive at meaningful performance estimates . Discussion is ongoing to see whether for these criteria workable principles and metrics can be developed, but specific training and adequate attention in the development process are certainly key to high functional quality of research software.
We interpret the original 15 FAIR principles in the context of research software, and discuss below how they apply to software. We also suggest additional principles when necessary. Table 1 provides an overview of the principles in their original and software-specific formulations. It is important to keep in mind that software comes in different forms, for example as source code, executable binaries, and interactive notebooks. These forms have a direct implication on how the FAIR principles can be interpreted, especially for the principles about interoperability and reusability. In fact, interoperability is the most challenging FAIR principle for data and equally for research software. The static nature of data compared with the dynamic one of research software, greatly set both digital objects apart from each other. Executability of research software has profound implications on how it can be used. Research software needs to work closely with other digital components at the building time (that is, when the research software is developed), at the execution/invocation time (that is, when the research software is run) and when part of software workflows complete a specific task.
FAIR for data | FAIR for software | Operation | |
---|---|---|---|
F1 |
(meta)data are assigned a globally unique and persistent identifier. |
Software and associated metadata have a global, unique and persistent identifier for each released version
|
Rephrased |
F2 |
Data are described with rich metadata |
Software is described with rich metadata
|
Rephrased |
F3 |
Metadata clearly and explicitly include the identifier of the data it describes |
Metadata clearly and explicitly include identifiers for all the version of the software it describes
|
Rephrased and extended |
F4 |
(meta)data are registered or indexed in a searchable resource |
Software is included in a searchable software registry
|
Rephrased |
A1 |
(meta)data are retrievable by their identifier using a standardized communications protocol |
Software and associated metadata are accessible by their identifier using a standardized communications protocol
|
Rephrased |
A1.1 |
the protocol is open, free, and universally implementable |
the protocol is open, free, and universally implementable
|
Remain the same |
A1.2 |
the protocol allows for an authentication and authorization procedure, where necessary |
the protocol allows for an authentication and authorization procedure, where necessary
|
Remain the same |
A2 |
metadata are accessible, even when the data are no longer available |
Software metadata are accessible, even when the software is no longer available
|
Rephrased |
I1 |
(meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation |
Software and its associated metadata use a formal, accessible, shared and broadly applicable language to facilitate machine readability and data exchange
|
Rephrased and extended |
I2 |
(meta)data use vocabularies that follow FAIR principles |
-
|
Reinterpreted, extended and split |
I2S.1 |
- |
Software and its associated metadata are formally described using controlled vocabularies that follow the FAIR principles.
|
|
I2S.2 |
- |
Software use and produce data types and formats that are formally described using controlled vocabularies that follow the FAIR principles.
|
|
I3 |
(meta)data include qualified references to other (meta)data |
-
|
Discarded |
I4S |
- |
Software dependencies are documented and mechanisms to access them exist.
|
Newly proposed |
R1 |
meta(data) are richly described with a plurality of accurate and relevant attributes |
Software metadata are richly described with a plurality of accurate and relevant attributes
|
Rephrased and extended |
R1.1 |
(meta)data are released with a clear and accessible data usage license |
Software and its associated metadata have independent, clear and accessible usage licenses compatible with the software dependencies.
|
Rephrased and extended |
R1.2 |
(meta)data are associated with detailed provenance |
Software metadata include detailed provenance, detail level should be community agreed
|
Rephrased |
R1.3 |
(meta)data meet domain-relevant community standards |
Software metadata and documentation meet domain-relevant community standards
|
Rephrased |
Findability is a fundamental principle, since it is necessary to find a resource before any other consideration. The main concern of findability for research software is to ensure software can be identified unambiguously when looking for it using common search strategies. Such strategies include the use of keywords in general-purpose search engines like Google, as well as specialised registries (websites hosting software metadata) and repositories (websites hosting software source code and binaries). Findability can be improved by registering the software in a relevant registry, along with the provision of appropriate metadata, providing contextual information about the software. Registries typically render metadata in a web-findable way and can provide a DOI. Some registries and repositories allow annotating software using domain-agnostic or domain-specific controlled vocabularies, increasing findability via search engines further. In the following we discuss how the Findability principles apply to the findability of research software in this context.
Different platforms address the identification of digital objects in different ways. As analysed by the project FREYA , there are multiple alternatives when it comes to persistent identifiers (PIDs). Generally, PIDs are long-lasting references to documents, web pages, or any other digital objects . Global uniqueness makes PIDs a tool that allows unambiguous identification of the referenced resources and thus research software should have their own PIDs. However, it is not enough to assign a PID to a generic software but to their versions and their specific deployments when considering web applications. Software versions should get assigned different PIDs as they represent specific developmental stages of the software. This is important as it will contribute to guarantee data provenance and reproducible research processes. Indeed, source code management systems make it easier to track software versions. For example, Git, currently the most popular technology for software source code version control, work with commit hashes (SHA1), which uniquely point to the specific snapshot of the source code. However, this identifier is not globally resolvable and GitHub does not make any guarantees about the accessibility or sustainability of code on the platform (persistence), and thereby the software published therein. A common community solution to this problem is depositing software releases from GitHub to Zenodo, an open publishing platform funded by the European Commision, and developed and hosted by Centre Européen Recherche Nucléaire (CERN). Zenodo mints DOIs for each released version of the software and also creates a concept DOI which refers to all versions of a given software .
To the best of our knowledge, there is no unified mechanism to automatically assign PIDs for research software. Thus, software authors need to actively register their software, and associated versions, at least in one registry or/and repository.
We suggest rephrasing this principle as "Software and associated metadata have a global, unique and persistent identifier for each released version" .
A software name alone does not reveal much about it. In order for others to find and use that software, they need information about what it does and how it works. Metadata include elements related to provenance and should follow community agreements (discussed later as part of the (re)usability principles).
Metadata comes in many different forms, and richness depends on how much information it provides about the software in order to describe, find and use it. There are multiple projects working on solutions to add structured metadata annotations to software that could be considered as reference. Examples include the biotoolsSchema , a formalised schema (XSD) used by the bio.tools project; the CodeMeta project and Bioschemas Tool profile . The later two work on top of schema.org, a project aiming to make it easier to add structured markup to web pages, and help search engines to index them. Additionally, some programming languages provide a way to add metadata to software sources i.e., packages, and often require them to be in a specific format and/or adhere to some guidelines. For instance, R packages must include metadata in the DESCRIPTION file while PEP 566 describes metadata for python software packages.
Regardless of the metadata description approach used, it is recommended to use controlled vocabularies provided by community-approved ontologies, which will vary depending on the research domain. The Software Ontology is a resource that can be used to describe software, including types, tasks, versions, provenance and associated data. In the case of Life Sciences, we advise to use elements from ontologies such as EDAM . EDAM provides unambiguously defined terms for describing the types of data and data identifiers, data formats, operations and topics commonly used in bioinformatics. In the geosciences, OntoSoft is an ontology designed to facilitate the annotation and publication of software with rich metadata.
In order for the principle to better reflect this definition, our proposal is to rephrase it as "Software is described with rich metadata" .
For reproducibility and reusability purposes, any person and/or system examining the metadata needs to be able to identify which version of the software is described by it. F3 extends F1 focus regarding the precise identification of versions and/or reference deployments beyond the software itself by including the metadata associated with each version and/or reference deployment of the software. This enables the exact version of a given software to be found when reusing and/or reproducing previously generated scientific results. For example, release metadata files on Zenodo should point to specific releases on software source code repositories such as GitHub, BitBucket or GitLab.
In order for the principle to better reflect this definition, our proposal is to rephrase and extend it as “Metadata clearly and explicitly include identifiers for all the versions of the software it describes” .
Software and associated metadata should be registered in a suitable, searchable software registry or repository. There are chiefly three classes of registries and repositories: (i) general ones such as Zenodo, GitHub itself, and comprehensive software archives as run by the Software Heritage project , (ii) language-specific ones such as CRAN , BioConductor , and the PyPI , and (iii) domain-specific ones such as the bio.tools registry for bioinformatics software, the BioContainers registry of bioinformatics containers and workflows, the DockStore platform for sharing Docker-based scientific tools and workflows, the Astrophysics Source Code Library (ASCL) for software source code for astronomers and astrophysicists, swMath for mathematical software, CLARIN for digital humanities software source code and the different science gateways based on the HUBzero open source software platform.
The choice of the registry/repository may be influenced by the programming language used and/or the operating system most used by the respective community. For example, most of the Python packages are registered in PyPI and/or one of the Conda Channels. R packages use CRAN, Bioconductor.org and/or source code repositories like GitHub. Linux distributions have their own package managers with software repositories.
In order for the principle to better reflect this definition, our proposal is to rephrase it as “Software is included in a searchable software registry” .
In the original formulation of the FAIR principles, accessibility translates into retrievability through a standardized communication protocol (A1) and accessibility of metadata even when the original resource is no longer accessible (A2). These principles clearly also apply to software. Interpreting accessibility also as the ability to actually use the software (access its functionality), however, we found mere retrievability not enough. In order for anyone to use any research software, a working version of the software needs to be available. This is different from just archiving source code, even in comprehensive and long-term collections like the Software Heritage archive . To use software, a working version (binary or code) has to be either downloadable and/or accessible e.g., via a web interface, along with the required documentation and licensing information. Accessibility requirements depend on the software type, e.g., web-applications, command-line tools, etc. For example, software containers allow the use across different operating systems and environments, e.g., local computers, remote servers, and high-performance computing (HPC) installations. Cloud-based servers can execute existing pieces of code as a service, as software made available through a web interface or via Jupyter Notebooks . Notebooks allow others to see the results and the narrative alongside the code used to generate them.
Furthermore, even for software that can be downloaded or accessed without restrictions, being able to run it might also depend on, for example, (paid) registration, other (proprietary) software packages, or a non-free operating system like Windows or macOS. For data, the FAIR principles demand that “(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation” (I1) and in that sense discourage the use of proprietary data formats. This is in our view, however, different from transparent dependencies for running software.
It is worth to re-emphasize that research software are not single, isolated, digital objects. As further discussed in the Interoperability section, research software interoperate at different levels with others digital object including other software. Interoperability for research software can be understood as being part of software workflows as well as building up and executing and/or invoking the software itself. It is also important that research software might have different available versions and/or web-based deployments. However, all implementations should be considered as part of a single entity for the considerations on accessibility with metadata ensuring appropriate links among them (see F1, F3).
Since accessibility, interoperability and (re)usability are intrinsically connected for research software, we consider aspects of installation instructions (R1.3), software dependencies (I4S), and licensing (R1.1) as part of other FAIR principles here, rather than adding another Accessibility principle.
Access to research software and its metadata can be achieved by depositing it in an appropriate repository, such as GitHub or BitBucket for the code, and Zenodo or FigShare for the metadata. We will discuss later in this paper that retrieving software source code and/or binaries is however only the first step towards being able to execute or invoke it.
Usually software (and its metadata) can be downloaded directly from the repository and/or website via standard protocols (HTTP/SSH).
There is no need to rephrase this specific item as it generally applies to both data and software, or any other digital resource exposed via the web.
Authentication and authorization are relevant for accessing research software source code, binaries and/or web applications. There are different considerations that can justify that software is not available as open source code via collaborative platforms like GitHub, BitBucket or GitLab. In such cases, those platforms implement mechanisms to support authentication and authorisation, and control the access to the code. Similarly, it might be possible that users might need to register, and/or authenticate, before downloading or, in the case of web applications, using the software. In both cases, access conditions should be justified and documented.
There is no need to rephrase this specific item as it generally applies to both data and software, or any other digital resource exposed via the web.
In order for the main A1 principle to better reflect its two specific items, our proposal is to rephrase it as “Software and associated metadata are accessible by their identifier using a standardized communications protocol” .
Metadata provides the context for understanding research software, and would be ideal to have it available even long after a given research software is gone. For achieving that, metadata should be available separately from the research software objects, here understood as either the source code, binaries and/or web servers hosting the deployed software. For example, GitHub can host the software source code and can be connected to Zendo, FigShare, bio.tools or FAIRsharing.org for hosting additional copies of the research software metadata. Zenodo promises metadata, and a snapshot of the software release, to be available for the upcoming 20 years, even when the versioned source code on GitHub may not be accessible any more. Metadata should follow community agreements. In this way, it will contribute towards the findability of the metadata as well as the software it refers to, provide how the software interoperates with other digital objects and how can be (re)used.
In order for the principle to better reflect this definition, our proposal is to rephrase it as “Software metadata are accessible, even when the software is no longer available” .
When examining the FAIR data principles from a research software perspective, interoperability turns out to be the most challenging among the four high-level principles. Data, and its surrounding metadata, are digital objects of static nature and yet interoperability for data is “the most challenging of the four FAIR principles. This, in part, is due to interoperability not being well understood”. In contrast, research software are live digital objects that interact at different levels with others, e.g., other software, managed data, execution environments; either directly and/or indirectly, as scripts or as part of a workflow (see Figure 1). The interoperability principles (designed for data) are therefore even more challenging to apply to software, some are not directly applicable, others need to be rephrased and new ones need to be defined to properly address the dynamic nature of software.
Software interoperability can be defined from three different angles:
(i) as a set of independent but interoperable objects to produce a runnable version of the software, including libraries, software source code, APIs and data formats, and any other resources for facilitating that task;
(ii) as a stack of digital objects that should work together for being able to execute a given task including the operating system, the execution environment, dependencies, and the software itself; and
(iii) as part of workflows, which interconnect different standalone software tools for transforming one or more data sets into one or more output data sets through agreed protocols and standards.
Thus, interoperability for software can be considered both for individual objects, which are the final product of a digital stack, and as part of broader digital ecosystems, which includes workflows. When considering workflows, different pieces of software can also work together independent of programming languages, operating systems and specific hardware requirements through the use of APIs and/or other communication protocols.
Software metadata are a necessity for interoperability, regardless of the aforementioned perspective. Metadata provides the context on which the software is used and contributes towards provenance, reproducibility and reusability. However, a balance is needed between the detail level and its generation cost. Depending on whether research software is considered as an individual product or as part of an ecosystem, the associated metadata might differ with workflows having specific mechanisms to capture it through their specifications, e.g., using Common Workflow Language (CWL) and/or Workflow Description Language (WDL) , among others This metadata should include software version, dependencies including their versions, input and output data types and formats (preferably using an ontology and/or controlled vocabulary), communication interfaces (specified using standards like OpenAPI), and/or deployment options.
Another aspect associated with interoperability, is the ability to run the software in different operating systems, i.e. software portability. Software portability strongly depends on the availability of the full execution stack in other operating systems (vertical axis in figure 1), which might not be always the case. That dependency on other digital objects to have a working software is further extended in the newly introduced FAIR principle I4S. The present tendency to package software and its dependencies, in software containers e.g., docker, singularity, rocket, contributes to enhanced software portability. Although these differences are not negligible, given that these terms are often used interchangeably, we will be considering both under the FAIR principle of interoperability, highlighting any issues that arise due to this divergence.
Considering those aspects, below we examine which interoperability principles can be translated and reworked as well as which ones are new and specific for software.
Considering all existing differences between data, of static nature, and software, of dynamic one, we propose up-front to rephrase and extend the scope of I1 as “Software and its associated metadata use a formal, accessible, shared and broadly applicable language to facilitate machine readability and data exchange”.
For a software to be considered interoperable, the already mentioned dimensions (figure 1) should be taken into account. Source code should make use of formal languages including programming ones and other mechanisms to express algorithmic solutions. When invoking research software, the use of formal languages will facilitate the needed interactions with the execution environment to perform the expected tasks. When considering research software as part of a workflow, software should be able to share input and/or output data sets with other software. The proper specification of data types, models and formats for the data consumed and/or produced are key to facilitate machines to interconnect different pieces of software via their associated metadata.
Following on the previous interoperability principle, and considering the differences between data and software, we consider two different cases here: the software itself and the data that it operates on. In both cases, ontologies and controlled vocabularies that are by design FAIR should be used for the formal description. FAIR software should operate on FAIR data, and not undermine the principles by, e.g., producing outputs only in proprietary data formats. Without using FAIR vocabularies, it might become impossible to understand, for both machines and humans, what is described (software) and/or to what it refers (software metadata). Whenever possible, those descriptions should be agreed and maintained by communities as a mechanism towards the sustainability of such resources, keeping metadata understandable even if the resources disappear for unforeseen reasons. A registry of the available data types, formats and schema that software may use can be found at FAIRsharing.org
Thus, we propose to reinterpret and extend I2 by splitting it into two sub-principles to account for such differences:
I2S.1 “Software and its associated metadata are formally described using controlled vocabularies that follow the FAIR principles.”
I2S.2 “Software use and produce data types and formats that are formally described using controlled vocabularies that follow the FAIR principles.”
I3 aims to interconnect data sets for better use by using semantically meaningful relationships across data sets. This approach is useful to prevent silos and to facilitate machine interpretability of existing relationships between data sets. However, such relationships are context and domain dependent and therefore difficult to translate for research software. Software dependencies are the closest item to this FAIR principle. Despite all the complexity associated with software dependencies, there are not semantically meaningful information on it. This leads us to propose a new FAIR principle I4S for research software.
When considering research software, dependencies are a key element for building working software. Building software usually require a number of additional modules, libraries and/or other research software that are not included in the original software distribution (figure1). The latter case refers to software workflows. Unfortunately, such dependencies include not only those modules directly used within the software, but also dependencies that may arise from additional libraries used by the imported modules. The scenario often builds a complex network of interconnected modules that precludes the software building. Fortunately, the present tendency to package software and its dependencies, either in virtual environments or software containers, alleviates the practical concerns for the final user, but simply moves the issue to the generation of those packages. Software deployment systems (PyPI, Conda, CRAN, …) provide solutions for this, and this information can be aggregated by services such as Libraries.io. In order to address this principle, software dependencies need to be clearly documented in a formal, accessible, machine-readable, and shared way, and formally described in the language-specific format.
Reusability in the context of software has many dimensions. At its core, reusability aims for someone to be able to re-use software reproducibly as described by Benureau and Rougier 2018. The context of this usage can vary and should cover different scenarios: (i) reproducing the same outputs reported by the research supported by the software, (ii) (re)using the code with data other than the test one provided to obtain compatible outputs, (iii) (re)using the software for additional cases other than those stated as supported, or (iv) extending the software in order to add to its functionality.
Software reusability depends to a high degree on software maintainability (see also Section Software is not data), including have proper documentation at various levels of detail. The legal framework, e.g., software licenses, is also important in terms of reusability as it establishes how software can be built, modified, used, accessed and distributed. Furthermore, as research software is an integral part of the scientific process, credit attribution (citation) is also another important aspect to consider with regard to (re)usability.
Licenses are useful to protect intellectual property. Software licenses let others know what they are allowed to do, e.g., use the software for free with their own data, and how they are restricted, e.g., do not modify or redistribute it. Without a license, others cannot legally use software in any way as the usage rules are not even defined. Metadata should have separate data usage licenses. Licenses for metadata are mainly needed to establish mechanisms on how the software is referenced via its metadata by third parties. A clear example is the indexing of software metadata by registries.
Proper management of software licenses is a challenging task considering the multi-faceted nature of research software, which is often the product of combining libraries, modules, and execution environments with the software itself. The legal implications of misusing software by not considering dependencies and incompatibilities between the associated licenses can be severe. Therefore, it is necessary that licenses for research software are included as part of the available documentation and are structured to facilitate its machine-readability. For example , the Software Package Data Exchange standard (Odence 2010) facilitates that software licenses becomes machine readable. This is important because licenses go beyond the software itself and have to take into account limitations established by the licenses of all of its dependencies. If every piece of software have made available their license information, then it should be possible to automatically derive potential incompatibilities on software usage as well as to establish whether licenses between software components and dependencies are compatible at the build stage.
Metadata usage licenses are independent of software licenses and should be constrained to the software itself. Data usage licenses for metadata establish how the metadata can be consumed by third-parties for purposes of indexing, citing and/or referencing software. As there are no dependencies between pieces of software in terms of metadata, there is no need to propagate the data usage licenses among them. Similarly to metadata, if any data is distributed with the software, it should have its own data usage licenses, where the terms to make use of it are clearly stated by third parties.
We suggest rephrasing and extending this principle as “Software and its associated metadata have independent, clear and accessible usage licenses compatible with the software dependencies” .
Provenance refers to the origin, source and history of software and its metadata. It is recommended to use well-known provenance vocabularies, for instance PROV-O , that are FAIR themselves. There are some elements commonly presented on any provenance data, including a person or organization providing the resource and how to contact them, published date, location and other resources used to produce the one described.
Regarding software provenance, it is possible to use standards, to capture how software is being used while transforming a given dataset. To guarantee software provenance independently of the data, specific software versions is the minimum required information. Software versions make reference to specific algorithmic implementations which might change over time and/or be included/removed among major software releases. This aspect connects with principles F1 and F3 on identifying specific software versions and its associated metadata. Software provenance also incorporates aspects of how the software is produced. Specifically referring to executable compiled software, provenance should include information on how the software has been compiled and which dependencies it incorporates. This complements the newly proposed interoperability principle I4S.
Furthermore, information on how to cite software and how to contribute to it are related to provenance, as they provide information about the people involved in creating the software. Citation information should be included in the metadata, since it makes it easier for others (re)using the software to acknowledge the developers. Although there is no standard way to cite software currently, the Software Sustainability Institute provides more information and discussions on this topic and there are guidelines developed for particular domains e.g. earth sciences and mathematics , as well as generic guidelines defined by the FORCE11 Software Citation Working Group based on the Software Citation Principles .
We suggest rephrasing this principle as “Software metadata include detailed provenance information” .
Community standards are important as they provide guidelines, which might later influence a certification process, on what is the minimum needed for research software. Given the multi-faceted nature of software and their strong dependencies on other pieces of software, it implies that non-compliance with community standards might render software unable to be reused. Indeed, non-compliance with standards will also prevent to integrate research software within other applications.
Metadata describing software should use community agreed descriptors, for example, those provided by CodeMeta (CodeMeta community 2019 ) or the Bioschemas Tool profile (Bioschemas community 2018). The developer should find out which one is more widely accepted within their community and also include software citation as part of its metadata.
Software documentation should include information on how to install, run and use your software. In order to make it easier for users, it should include examples with inputs and expected outputs. Additionally, any dependency should be clearly stated as it contributes not only to (re)usability but also accessibility and interoperability.
More work in this area is still required on the minimum recommended metadata needed for software. Initiatives such as CodeMeta, Bioschemas and the RDA Research Schemas have made preliminary recommendations but broader work, including mapping across different vocabularies, is still needed.
We suggest rephrasing this principle as “Software metadata and documentation meet domain-relevant community standards” .
Following the previous sections, it is proposed to rephrase and extend R1 FAIR principle as “Software metadata are richly described with a plurality of accurate and relevant attributes” . It captures the similarity between data and software as digital research objects and also the specific aspects of software as composite, live digital object.
Software has become an essential constituent of scientific research. It is therefore desirable to apply the FAIR Guiding Principles, which have so far mostly been interpreted as principles for scientific data management and stewardship, also to research software. As we have discussed in this work, many of the FAIR principles can be directly applied to research software, where software and data can be treated as the same kind of digital research objects. However, when specific characteristics of software are involved, such as their executability, composite nature, and continuous evolution accompanied by frequent versioning, it is necessary to introduce additional principles. Furthermore, it can be argued that considerations about the functionality of software (as opposed to the form in which it is provided) are by definition out of the scope of FAIR, and thus need to be addressed by other guiding principles, for example based on best practices for (research) software development.
This work aims to become the starting point for further community-led discussions and proposals on how to effectively apply FAIR principles to research software, and eventually the development of specific FAIR principles for research software. In addition to the work on the principles, the development of community-specific metadata schemes for software has to play an important role, as defined metadata standards are key to the successful application of many of the principles. There are groups within the wider research software community beginning to address these issues. For example, recently the Software Source Code Identification Working Group has been initiated in the scope of the Research Data Alliance (RDA) and FORCE11 to produce an initial collection of software identification use cases and corresponding schemas as well as to give an overview of the different contexts in which software artifact identification is relevant. Results from this working group can assist in the definition of principles related to software annotation.
Another important aspect discussed during the work on this paper is the need of a governance model for the FAIR principles. A governance model is crucial to enable an open and transparent process for updating the FAIR principles and should be defined in the scope of the community discussions for each of the domains where they are applied e.g., research data, workflows, software, etc.
Finally, the aim of this work is to set the foundations to develop metrics and associated maturity models that can ultimately inform software users and developers how FAIR their software is. Making software FAIR comes with a cost due to the required efforts. Hence, software developed to be used by others, such as libraries, can be expected to reach a higher degree of FAIRness than software that has not been implemented with reuse as a primary goal, for example a script that has been created as a side-effect of demonstrating an algorithm. Based on FAIR software metrics, communities will be able to agree on degrees of FAIRness that the different kinds of software should comply to, in order to reflect their Open Science ideals.
We are grateful to the numerous people who contributed to the discussions around FAIR research software at different occasions preceding the work on this paper. Making no claims to completeness, these include Michel Dumontier, Chris Erdmann, Rafael C. Jimenez, Mark Wilkinson and Amrapali Zaveri. We also thank the Japan BioHackathon for sponsoring a FAIR software related project for the 2018 edition. Furthermore, we would like to thank Stian Soiland-Reyes for his valuable comments on earlier versions of this manuscript.
NCH and CAG were supported by EP/N006410/1 and EP/S021779/1 for the UK Software Sustainability Institute. AV, JLG, SCG and CAG were supported by ELIXIR-EXCELERATE 676559. EM, AV, JLG and SCG has been additionally supported by PT17/0009/0001. EM and SCG has been additionally supported by IMI2 FAIRplus 802750.
Event |
Publications |
“FAIR principles for Software” at 2019 Workshop on Sustainable Software Sustainability (WOSSS19) | |
“FAIR Software” Birds of a Feather meeting at deRSE 2019 | |
Top 10 FAIR Data & Software Global Sprint, including “10 easy things to make your software FAIR” | |
“Sharing Your Software - What is FAIR?” at the 2018 American Geophysical Union (AGU) Fall Meeting | |
“FAIRness assessment for software” at the ELIXIR 2018 BioHackathon | |
“Making Software FAIR” at the DTL Communities@Work 2018 Conference | |
TIB Training workshops on FAIR Data and Software | |
“Applying FAIR Principles to Software” at the 2017 Workshop on Sustainable Software Sustainability (WOSSS17) | |
CodeMeta Workshop 2016 on The Future of Software Metadata |
Abramatic, Jean-François, Roberto Di Cosmo, and Stefano Zacchiroli. 2018. 'Building the Universal Archive of Source Code.' Communications of the ACM 61 (10): 29–31. https://doi.org/10.1145/3183558.
Aerts, Patrick J.C., Cees Hof, Shoaib Sufi and Carlos Martinez-Ortiz. 2019. 'Sustainable Software Sustainability - Workshop report'. DANS, SSI, Netherlands eScience center. [In preparation]
Aerts, Patrick J.C. 2017. 'Sustainable Software Sustainability - Workshop report'. DANS. https://doi.org/10.17026/dans-xfe-rn2w.
Allen, Robert, and David Hartland. 2018. 'FAIR in Practice - Jisc Report on the Findable Accessible Interoperable and Reuseable Data Principles.' Zenodo, May. https://doi.org/10.5281/zenodo.1245568.
Anaconda Inc.,. 2017. 'Conda Documentation.' Conda. 2017. https://docs.conda.io/en/latest/.
ASCL. n.d. 'Astrophysics Source Code Library.' ASCL.net. Accessed August 16, 2019. http://ascl.net.
Atkinson, Malcolm Gesing, Sandra Montagnat, Johan Taylor, Ian. 2017. 'Scientific Workflows: Past, Present and Future.' Future Generation Computer Systems 75: 216–27. https://doi.org/10.1016/j.future.2017.05.041.
Benureau, Fabien C. Y., and Nicolas P. Rougier. 2018. 'Re-Run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific Contributions.' Frontiers in Neuroinformatics 11. https://doi.org/10.3389/fninf.2017.00069.
Bioconductor. n.d. 'Bioconductor - Home.' Bioconductor. Accessed August 16, 2019. https://www.bioconductor.org/.
Bioschemas community,. n.d. 'Bioschemas - Tools.' Bioschemas. Accessed August 16, 2019. https://bioschemas.org/specifications/Tool/.
bio-tools. 2019. 'Bio-tools/biotoolsSchema.' GitHub Bio-tools/biotoolsSchema. August 9, 2019. https://github.com/bio-tools/biotoolsSchema.
Capella-Gutierrez, Salvador, Diana de la Iglesia, Juergen Haas, Analia Lourenco, José María Fernández, Dmitry Repchevsky, Christophe Dessimoz, et al. 2017. 'Lessons Learned: Recommendations for Establishing Critical Periodic Scientific Benchmarking.' Bioinformatics. bioRxiv. https://www.biorxiv.org/content/10.1101/181677v1.full.
Chue Hong, Neil, Daniel S. Katz. 2018. 'FAIR enough? Can we (already) benefit from applying the FAIR data principles to software?.' figshare. https://doi.org/10.6084/m9.figshare.7449239.v2.
CLARIN-NL. 2019. 'CLARIN NL Resource List.' CLARIN-NL. 2019. https://dev.clarin.nl/clarin-resource-list-fs.
codemeta. n.d. 'CodeMeta Project User Guide.' The CodeMeta Project. Accessed August 16, 2019. https://codemeta.github.io/user-guide/.
Common Workflow Language,. 2016. 'Common Workflow Language, v1.0.' FigShare. figshare. July 8, 2016. https://doi.org/10.6084/m9.figshare.3115156.v2.
Contributors to Wikimedia projects. 2012. 'Persistent Identifier - Wikipedia.' Wikimedia Foundation, Inc. March 21, 2012. https://en.wikipedia.org/wiki/Persistent_identifier.
CRAN. n.d. 'The Comprehensive R Archive Network.' CRAN. Accessed August 16, 2019. https://cran.r-project.org/.
Dockstore.' n.d. Accessed August 16, 2019. https://dockstore.org/.
Doorn, Peter. Mar 7-9, 2017. 'Does It Make Sense to Apply the FAIR Data Principles to Software?' SlidePlayer. Mar 7-9, 2017. https://slideplayer.com/slide/12849777/.
Erdmann, Christopher, Natasha Simons, Reid Otsuji, et al. 2019. 'Top 10 FAIR Data & Software Things.' Zenodo. http://doi.org/10.5281/zenodo.2555498.
Ferguson, Christine, Jo McEntrye, Vasily Bunakov, Simon Lambert, Stephanie van der Sandt, Rachael Kotarski, Sarah Stewart, et al. 2018. 'D3.1 Survey of Current PID Services Landscape.' Zenodo. https://doi.org/10.5281/zenodo.1324296.
Gil Yolanda Ratnakar Varun. 2015. 'OntoSoft: Capturing Scientific Software Metadata.' In . http://www.ontosoft.org/.
GitHub. 2016. 'Making Your Code Citable.' GitHub Guides. October 2016. https://guides.github.com/activities/citable-code/.
Goble, Carole, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober. 2019. 'FAIR Computational Workflows.' Zenodo, July. https://doi.org/10.5281/zenodo.3268653.
GO FAIR. n.d. 'What Is the Difference between ‘FAIR Data’ and ‘Open Data’ If There Is One? - GO FAIR.' GO FAIR. Accessed August 16, 2019. https://www.go-fair.org/faq/ask-question-difference-fair-data-open-data/.
Hausman, Jessica, Shelley Stall, James Gallagher, and Mingfang Wu. 2019. 'Software and Services Citation Guidelines and Examples Ver 1.' ESIP. https://doi.org/10.6084/m9.figshare.7640426.v4.
Hettrick, Simon, Mario Antonioletti, Les Carr, Neil Chue Hong, Stephen Crouch, David De Roure, Iain Emsley, et al. 2014. 'UK Research Software Survey 2014.' Zenodo, December. https://doi.org/10.5281/zenodo.14809.
Higman, Rosie, Daniel Bangert, and Sarah Jones. 2019. 'Three Camps, One Destination: The Intersections of Research Data Management, FAIR and Open.' Insights into Imaging 32 (1): 18. https://doi.org/10.1629/uksg.468.
Dustin Heaton and Jeffrey C. Carver. 2015. 'Claims about the use of software engineering practices in science: A systematic literature review.' Informaion and Software Technology 67 207-219. https://doi.org/10.1016/j.infsof.2015.07.011
HUBzero. 2019. 'HUBzero | Home.' HUBzero. 2019. https://hubzero.org.
Ison, Jon, Hans Ienasescu, Piotr Chmura, Emil Rydza, Hervé Ménager, Matúš Kalaš, Veit Schwämmle, et al. 2019. 'The Bio.tools Registry of Software Tools and Data Resources for the Life Sciences.' Genome Biology 20 (1): 1–4. https://doi.org/10.1186/s13059-019-1772-6.
Ison, Jon, Matús Kalas, Inge Jonassen, Dan Bolser, Mahmut Uludag, Hamish McWilliam, James Malone, Rodrigo Lopez, Steve Pettifer, and Peter Rice. 2013. 'EDAM: An Ontology of Bioinformatics Operations, Types of Data and Identifiers, Topics and Formats.' Bioinformatics 29 (10): 1325–32. https://doi.org/10.1093/bioinformatics/btt113.
Jupyter Project and Community,. 2019. 'Project Jupyter.' Project Jupyter. July 18, 2019. https://www.jupyter.org.
Kanewala, Upulee, and James M. Bieman. 2014. 'Testing Scientific Software: A Systematic Literature Review.' Information and Software Technology 56 (10): 1219–32. https://doi.org/10.1016/j.infsof.2014.05.006.
Katz, Daniel S., and Neil P. Chue Hong. 2018. 'Software Citation in Theory and Practice.' arXiv.org. https://doi.org/10.1007/978-3-319-96418-8_34.
Katz, Daniel S., Kyle E. Niemeyer, Arfon M. Smith, William L. Anderson, Carl Boettiger, Konrad Hinsen, Rob Hooft, et al. 2016. 'Software vs. Data in the Context of Citation.' e2630v1. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.2630v1.
Khan F., Soiland-Reyes S., Sinnott R.O., Lonie A., Goble C., Crusoe M.R. 2018. 'Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv.' Zenodo, Dec. https://doi.org/ 10.5281/zenodo.1966881.
Lebo, T., Sahu S., MacGuiness D. 2013. 'PROV-O: The PROV Ontology.' . https://www.w3.org/TR/prov-o/.
Mangul, Serghei, Lana S. Martin, Brian L. Hill, Angela Ka-Mei Lam, Margaret G. Distler, Alex Zelikovsky, Eleazar Eskin, and Jonathan Flint. 2019. 'Systematic Benchmarking of Omics Computational Tools.' Nature Communications 10 (1): 1393. https://doi.org/10.1038/s41467-019-09406-4.
Object Management Group. 2016. 'Automated Source Code Maintainability Measure.' January 2016. https://www.omg.org/spec/ASCMM/1.0/PDF.
OntoSoft.' n.d. Accessed August 16, 2019. http://www.ontosoft.org/.
Open WDL.' n.d. Accessed August 16, 2019. http://www.openwdl.org/.
RDA Research Metadata Schemas WG,. 2019. 'Research Metadata Schemas WG.' RDA. March 20, 2019. https://www.rd-alliance.org/groups/research-metadata-schemas-wg.
RTD:Directorate-General for Research. 2018. Turning FAIR into Reality : Final Report and Action Plan from the European Commission Expert Group on FAIR Data. Publications Office of the European Union. https://publications.europa.eu/en/publication-detail/-/publication/7769a148-f1f6-11e8-9982-01aa75ed71a1/language-en.
Sansone, Susanna-Assunta, Peter McQuilton, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Massimiliano Izzo, Allyson L. Lister, and Milo Thurston. 2019. 'FAIRsharing as a Community Approach to Standards, Repositories and Policies.' Nature Biotechnology 37 (4): 358–67. https://doi.org/10.1038/s41587-019-0080-8.
Smith, Arfon M., Daniel S. Katz, and Kyle E. Niemeyer. 2016. 'Software Citation Principles.' PeerJ Computer Science 2 (September): e86. https://doi.org/10.7717/peerj-cs.86.
Software Heritage. n.d. 'The Software Heritage Archive.' Software Heritage. Accessed August 16, 2019. https://www.softwareheritage.org/archive/.
Software Source Code Identification WG,. 2018. 'Software Source Code Identification WG.' RDA. June 14, 2018. https://rd-alliance.org/groups/software-source-code-identification-wg.
swMATH. n.d. 'An Information Service for Mathematical Software.' swMATH. Accessed August 16, 2019. https://swmath.org/.
The Software Ontology.' n.d. Accessed August 16, 2019. http://theswo.sourceforge.net/.
Veiga Leprevost, Felipe da, Björn A. Grüning, Saulo Alves Aflitos, Hannes L. Röst, Julian Uszkoreit, Harald Barsnes, Marc Vaudel, et al. 2017. 'BioContainers: An Open-Source and Community-Driven Framework for Software Standardization.' Bioinformatics 33 (16): 2580–82. https://doi.org/10.1093/bioinformatics/btx192.
Warehouse Project. n.d. 'PyPI · The Python Package Index.' PyPI. Accessed August 16, 2019. https://pypi.org/.
Wickham, Wickham. n.d. 'Package Metadata.' R Packages. Accessed August 16, 2019. http://r-pkgs.had.co.nz/description.html.
Wilkinson, Mark D., Michel Dumontier, I. Jsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. 'The FAIR Guiding Principles for Scientific Data Management and Stewardship.' Scientific Data 3 (March): 160018. https://doi.org/10.1038/sdata.2016.18.
Wilkinson, Mark D., Ruben Verborgh, Luiz Olavo Bonino da Silva Santos, Tim Clark, Morris A. Swertz, Fleur D. L. Kelpin, Alasdair J. G. Gray, et al. 2017. 'Interoperability and FAIRness through a Novel Combination of Web Technologies.' PeerJ Computer Science 3 (April): e110. https://doi.org/10.7717/peerj-cs.110.
Wroe, C., C. Goble, M. Greenwood, P. Lord, S. Miles, J. Papay, T. Payne, and L. Moreau. 2004. 'Automating Experiments Using Semantic Data in a Bioinformatics Grid.' IEEE Intelligent Systems 19 (1): 48–55. https://doi.org/10.1109/MIS.2004.1265885.
Zenodo. n.d. 'Frequently Asked Questions | DOI Versioning.' Zenodo. Accessed August 16, 2019. https://help.zenodo.org/.