Video embedded: 

Introduction

Leveraging the experience of Software Heritage, this task aims at designing and pilot a solution for the preservation of massive collections of software source code (billions of source code files with links to publications) into EOSC eTDR service (European Trusted Digital Repository). More specifically, as part of this use case EOSC/Pillar will:

  • develop an API that allows platforms such as open access paper archives to archive in Software Heritage research software in source code form,
  • develop an API that allows anyone to retrieve and access archived source code artifacts,
  • standardize a schema of persistent identifiers (PIDs) that allows to reference billions of source code artifacts,
  • integrate the above APIs and Services with EOSC eTDR,
  • develop a pilot to fully replicate the software archive onto existing EOSC infrastructure.

Challenges addressed

A large part of the technical and scientific knowledge that is being developed today resides in software. The preservation of this universal body of knowledge has become as essential as preserving research articles and data sets. Software preservation is a pillar of reproducibility, because software is used in essential ways during all phases of research in all fields of science. To be able to reproduce an experiment, knowing the exact version of the software used is essential.

Software Heritage will ensure availability and traceability of software, providing the missing vertex in the triangle of scientific preservation, together with open data and open access.

The first addressed challenge is to offer the research community a way to persistently and uniquely identify any piece of software source code. The second addressed challenge is to allow EOSC members to archive source code artifacts in the long term, thus helping reproducility.

Long-term preservation guarantees will be achieved by replication, encompassing the main Software Heritage archive, its network of mirrors, and by depositing a copy of the Software Heritage Archive in the CINES long-term archiving solution (Vitam).

Benefits through EOSC-Pillar

Thanks to EOSC-Pillar, users of the archival services offered by Softare Heritage will benefit from integration with other common services and infrastructure offered by EOSC-Pillar.

One example of this is authentication and authorization (AAI). There will be no need to create dedicated accounts on the Software Heritage infrastructure to access the provided APIs: any identity provider integrated into EOSC-Pillar could be used.

Second, thanks to ESOC-Pillar Software Heritage got access to collaboration opportunities with partners interested in replicating the archive and their infrastructure, such as CINES with Vitam. In the future, other interested partners will be able to do the same, building on top the accrued experience.

For the future it will also become possible to partner with providers of computing resources, enabling research groups to conduct massive large-scale experiments on the source code artifacts archived by Software Heritage.

Highlights

Results as of May 2021

  • A persistent and intrinsic identifier schema (SWHID) for software source code artifacts has been specified and its adoption in software research community is growing.
  • The Deposit API, allowing partners to deposit their source code artifacts is available and ready to use. This can be coupled with the AAI integration provided as a result of EOSC-Pillar Work Package 7 for easy integration of the deposit with services provided by EOSC partners.
  • The Web and Vault APIs allows anyone to retrieve a source code artifact from the Software Heritage Archive starting from its SWHID.

 

Work in progress

  • Replication of the Software Heritage Archive into CINES’ Vitam archiving service is under development. It will ensure that every source code artifacts present in the Software Heritage Archive, including those deposited by EOSC partners, will be periodically archived with high reliability and long term availability.