Due to data exploration complexity, provenance management is a key component in order to guarantee scientific data discovery, reproducibility and results interpretation. Provenance management should be able to define a set of metadata able to capture the derivation history of any stored data, including the original raw data sources, the intermediate data products and the steps that were applied to produce them.
To enable a well-defined data provenance in the scientific experiments workflow, from data production to data usage, this task will:
The initial challenge is to address and solve the provenance of data within the material science and climate science community.
For the material science community, we focused on the wide community behind the NFFA Europe Project where scientific data are collected from more than 150 experimental techniques. We planned to define general procedures to provide specific guidelines and recommendations on how to manage the important aspect of data provenance for NFFA Europe users community and beyond.
Climate research makes use of lots of data coming from the modeling and observational climate communities. In this domain, provenance management plays a key role both for numerical end-to-end simulations at the data center level as well as in the inner data analytics workflows.
Provenance enforcement procedures were identified, which will contribute to the climate data science notebooks of Use Case 2 (Agile FAIR data for environment and earth system communities). The provenance procedures can later be generalised for other domains. In addition, the approach will include a discussion of possible use cases of PID collections in this context and thus provides a link to Use Case 8 (Virtual definition of big datasets at seismological data centres according to RDA recommendations).
To account for in-depth provenance support within the climate processing services, a second-level provenance management complying with the W3C prov standard has been elaborated, thus addressing open science challenges (reproducibility, re-usability, etc.) at a finer granularity.
The material science use case will benefit through EOSC-Pillar by the fact that NEP user community and the project itself can establish a strong connection with EOSC network, favouring a continuous exchange among the two communities. Such exchange is of mutual interest: on one side (NEP) allowing the project and the data services built within the project to be EOSC- compliant. On the other hand, EOSC benefits from a large and committed user community that should provide useful suggestions and case studies in the overall EOSC implementation.
The climate use cases are built on top of the ENES Climate Analytics Service, which is the server-side compute infrastructure exploited in Use Case 2. ECAS, one of the EOSC-hub thematic services, allows performing data analysis experiments on large volumes of multidimensional data through a PID-enabled, server-side, and parallel approach. In this way scientific users can run their experiments and get fine-grained provenance information captured at the second level of a data analytics workflow based on W3C prov specifications. This allows retrieving the data lineage of an object including the entire analytics workflow associated to it, which is particularly worth towards data discovery and the FAIR Reproducibility principle.
The case study within the community of material science/nanoscience concerns the creation of a database to store the scientific images metadata and the development of a web service to visually explore these metadata through interactive tools. The objective was to organize the original STM dataset (STM-Scanning Tunneling Microscopy), in a more structured and convenient dataset to allow researchers to build the provenance of data. Instrument metadata were extracted for every image of the dataset. The website was then built around the selected metadata fields, which are searchable and presented through graphs. Finally, the website was refined by linking the metadata to the underlying images so that researchers can visualize them directly on the browser without the need to use custom software. This tool, named STM Explorer, is integrated on the Trieste Advance Data Services website (TriDAS).
In the climate science domain, the multi-model Precipitation Trend Analysis (PTA) was selected as a pilot case. It has been implemented as an ECAS analytics workflow and executed on 11 climate models from the CMIP5 experiment for a total of about 200 tasks. Based on the information tracked by the Ophidia analytics engine about the executed tasks, scientific users running the analytics workflow can retrieve the corresponding provenance documents (complying with the W3C prov standard) in XML, JSON or graphical format through the Python application developed in the task as an extension to the existing ECAS modules in order to fully address FAIR-oriented provenance management.
An additional use case was identified in collaboration with Use Case 2. It specifically works on providing simple libraries to use in a JupyterHub environment, thus enabling scientists to generate W3C prov standard compliant provenance descriptions accompanying their jupyter notebook based analysis results. The approach to pre-define provenance templates is evaluated with respect to simplifying this approach for scientists.
 A Templating System to Generate Provenance, Luc Moreau et al. IEEE Transactions on software engineering, https://nms.kcl.ac.uk/luc.moreau/papers/provtemplate-tse17.pdf