Virtual definition of data sets according to RDA recommendations
The management of digital objects remains an area of interest that crosses disciplines, institutions and infrastructures. In this context, the need for building aggregations or collections of such objects has become an essential element. Research data management practice requires not only to describe collections, but to make them actionable by automated processes to be able to cope with ever increasing amounts and volumes of data.
For many data centres, curators and dataset creators, it is difficult to assemble collections if there is the need for instantiating them, by replicating all their contents in one place. Sometimes this happens because of storage limitations, or because some parts of the dataset appear in many collections.
But the main limitation with existing solutions for the management of data collections is that they focus on describing collections and their semantics with metadata, but do not offer a full set of generic, machine-actionable CRUD operations on them.
A Working Group from RDA prepared specifications for a Research Data Collection system to overcome these issues.
This use case aims at making the Data Collections System, designed within the activities of RDA, generic enough and ready to be adopted by different communities within EOSC.
The system aims to:
- Provide CRUD operations (create, read, update, delete) for aggregations or data collections of different data objects as an essential element in the research data management practice,
- Describe the research data collections in a standardised way to make them actionable by automated processes in order to be able to cope with the increasing amounts and volumes of data.
Benefits through EOSC-Pillar
This use-case relies on – and demonstrates the benefits of – EOSC-Pillar services as follows:
- Build a generic, multi-disciplinary Data Collections System based on the experience and requirements of 13 communities within the RDA Working Group,
- Improve the current system by revisiting its specifications based on the feedback we received from our partners in EOSC-Pillar,
- Allow the interoperation with the Federated FAIR Data Space and other communities involved in the Use Cases,
- Make the system available via containers to communities who would like to run them on their own.
Achievements as of May 2021:
- We revised the requirements from the seismological community for such a system.
- An implementation following this specification had been in use for some time at GEOFON, where more than 6000 collections and 1.5 million members for datasets had been pre-defined by the data centre operators. However, this system was only of internal use.
- After revising the requirements, we modified the system in order to make it completely generic and ready to be tested by other communities.
- One of the best aspects, regarding the resources needed to put the service in production, is that the members of a collection can be identified by PIDs (DOIs, ePIC), what means that almost no space is needed to define them. Within the context of this project, we added the capability to identify resources by URL, making the system independent from a Handle server in case that some resources (or the collection itself) needs to be identified.
- The identification of improvements to the current system by revisiting the RDA specifications is already advanced.
- A first set of requirements were collected from communities and other Use cases of the project. In particular, from Climatology, taking DKRZ as an example, and the Federated FAIR Data Space (F2DS).
- Deploy Data Collection System as a service to open it to more communities,
- Contact other stakeholders to foster usage of the system,
- Implement extensions and improvements to the RDA Recommendations. For instance, the automatic export to the Federated FAIR Data Space developed,
- Put the service in production.