Introduction

Earth Environment Sciences & Geosciences require a large panel and volume of data from satellite, in-situ observations and climate models, that are managed and preserved separately in domain-dependant repositories of national or European infrastructures. As the data sources are widely distributed, it induces real difficulties to achieve inter-processing and integrated uses of the data for comprehensive studies at domain or cross-domain level.

In this context, the goal is to offer services that facilitate and speed up access to the distributed data sources and provide a web-based processing environment for Data Science notebooks supporting the Pangeo community platform. This use case aims also to enhance data discovery and data access, relying on existing services provided by the Earth Sciences communities such as in France, the Research Infrastructure Data Terra, and in Europe, the consortium of Environment Research Infrastructures (ENVRI), the climate community, and beyond.

Challenges addressed

The web-based processing environment proposed by the use case has to address the needs of two types of user of Data Science notebooks on Earth environment & Geosciences data :

  • Scientists, who want to run ready-to-use data processing “Pangeo ecosystem-based” scripts on their temporal and geographical areas of interest;
  • Data analysts, who want to develop, test and run their own data processing “Pangeo ecosystem-based” scripts adapted to their needs.

The first challenge to tackle is to design and implement the IT infrastructure (services and resources) underlying this virtual environment and that enables to speed up and facilitate access to data, taking into account the specificities of data sources, i. e. large volumes of data from distributed repositories.

The second one is to enable users to easily discover cross-domain data collections and services, even if these resources are distributed and based on different metadata and ontologies.

Benefits through EOSC-Pillar

The use case is built on top of IT services and resources proposed by national infrastructures involved in the project: 

  • HPC and cloud computing resources from use case and Work Package 7 partners
  • iRODS services from EOSC-Pillar partners, to ease and fasten data access from the web-based processing environment by synchronizing data from the distributed data repositories to the computing resources
  • D4Science Virtual Research Environment to run Data Science Notebooks
  • Software Heritage solution from the dedicated EOSC-Pillar use case, for the archive and sustainability of the source code of this use case's Data Science notebooks
  • Connection to Work Package 5 data layer services to demonstrate the cross-domain metadata catalogue enabling the data discovery. The first scenario will connect to D4Science and further scenarios foresee connecting to the Federated FAIR Data Space.

EOSC-Pillar also gives the opportunity to test cross-domain and transnational interoperability, as this use case gathers multi-domain data repositories from France, Germany and Italy.

    Highlights

      The overarching goal is to demonstrate what EOSC could offer to the Earth environment & Geosciences communities, i.e. a cloud platform to run big data analysis in Europe as an alternative to using large private providers for storage and computing. To this end, the use case focuses on Pangeo as it is a community platform for big data (Geo)sciences oriented toward Python scripts developers, that is (1) fostering collaboration around the open-source Scientific Python ecosystem and (2) involving many relevant technologies: HPC, containers, notebooks, advanced data structures (“Data Cubes”) for efficient access, remote access to data.

      As the use case is ambitious to design and implement, the partners chose to split it into three sub-use cases, in order of priorities

      • Data Science Notebooks, to offer a web-based processing environment for scientists and data analysts within the ecosystem of Pangeo
      • Data services, to speed up and facilitate access to data of repositories from several domains,
      • Discovery services, to provide a cross-domain catalogue

      In addition, this use case aims to go as far as possible in the Proof of concept within the EOSC-Pillar project in collaboration with the technical Work Packages 7 and 5.

      Demonstration videos