Ambition – Deep Hybrid DataCloud

The DEEP Hybrid DataCloud project aims to provide a bridge towards a more flexible exploitation of intensive computing resources by the research community, enabling access to the latest technologies that require also last generation hardware and the scalability to be able to explore large datasets . This possibility could make a clear difference regarding innovation for many research teams, enabling them to be competitive with large non-EU teams and companies.

The current e-infrastructure services in production in Europe are based on high quality and performing infrastructure: GÉANT is likely the best academic network in the world, both with large and scalable capacity connections and with high capillarity, across whole Europe; PRACE Tier-0 and Tier-1 centres provide an impressive offer of HPC to researchers; EGI nodes gather very large storage capacity and HTC capability, and EUDAT complements the previous services on data management and archiving. Most of these services in production are well suited for a “classical” use by the researchers, that enjoy a scheme that they know and where they are highly efficient, and use it in many techniques, from simulation to data processing or to final statistical analysis. And this is particularly true for those research teams that handle very large datasets (like for example in Astrophysics, Genomics, Meteorology, Earth Observation, High Energy Physics).

On the other hand, the impact of Cloud computing for research is increasing each day, driven by its flexibility and corresponding economy, ease of use for simple applications, and its scalability. Additionally, Cloud providers are pushing technology and techniques to the limits, both to attract new clients but also because some of them exploit these resources directly, and given the scale of their infrastructure they can easily offer latest generation hardware and proprietary solutions to exploit it. Up to now the exploration of very large datasets is not popular due to high costs, both for the storage and for the data access/transfer, but new offers are reducing this barrier.

We propose to reduce the gap between these two areas with a bridge in both directions:

Area 1

We want to make intensive computing techniques available in a friendly and flexible way to our researchers as services of the e-infrastructure that will support the EOSC evolution, that will include HPC resources, like for example the future European Data Infrastructure (EDI). This means we aim to deploy such services, that require access to bare metal, using Cloud based techniques, in such a way that they can be accessed by researchers in a flexible and elastic way.

Area 2

We aim to integrate those services, as well as other supporting intensive computing techniques over specialized hardware from commercial providers, under a Hybrid Cloud approach, so a research community can scale up from its private cloud resources.

The technical ambition of the project is high. Although the great level of sophistication of the functionalities delivered by the services developed in current projects, this is not sufficient to sufficient to address the requirements of the user communities identified within the DEEP project. The implementation of the basic Hybrid Cloud approach using orchestration services over Open Source cloud management solutions, like OpenStack and OpenNebula, is already a service in production proposed for the EOSC catalogue of services. However, the required improvement of the networking implementation, and the support for bare metal, require significant developments that are the target of this project. Similarly, the use of containers in HPC systems opens new possibilities for a flexible use of these resources, but there are yet significant services to be evolved and further integrated, while assuring the performance over low-latency interconnections, and the coordination of the deployment as services with the underlying batch system.

The competitive advantage of the project is provided by the consortium ideas, initiatives and expertise in Cloud services development and deployment, and by its experience in the daily management of large computing and data resources and the corresponding support to user communities.

Another key ingredient in this challenge is the specific support to very large datasets: management, transfer, sharing, and efficient access from intensive computing e-infrastructures. Hereby, the project relies on the existing services in production that are becoming part of the EOSC, in particular OneData, and also B2STAGE/B2SHARE, and will be also open to considerate new services developed in other projects supporting the exploitation of very large datasets that may be funded under EINFRA21 call.

These data management services rely in turn on storage solutions deployed at the data centre scale, including parallel and distributed filesystems like Ceph, Lustre, GPFS/Spectrum Scale, or parallel NFS, and we assume in our project, oriented to intensive computing techniques, that the final performance of the storage solution or the data management layer will not be the limiting factor.

Before turning to a detailed analysis of the status of the art and the proposed advances that the DEEP Hybrid DataCloud project targets, we would like to introduce another major technical challenge.

Another relevant challenge: how to improve the analysis/post-processing of very large datasets produced in simulations in HPC systems?

The capacity of HPC systems to produce data in simulations has increased by orders of magnitude in the past years. However, this potential has not been leveraged by the implementation of agile technical solutions to deal with the analysis/post-processing of those data.

Post-processing/analysis benefits from access to specialized hardware, in some cases GPUs and more frequently to Interconnected (over Infiniband, 10GbE, Omni-path) HPC nodes. Access to multicore nodes with significant memory, above 1TB of RAM, is not an uncommon request for data analysis/post-processing, when handling large images.

Accessing/transferring those data can be achieved by using well-known protocols/services.
Idea: leverage the developments in containers technology and cloud resource provisioning to define a solution enabling researchers to post-process/analyse their data in a more flexible way.

Could this analysis/post-processing take place in an HPC system that offers intensive computing services in cloud mode? Could this HPC system be, transparently, the same HPC system where the data was generated, and/or, as required, scale as part of a Hybrid DataCloud platform?

Eventually, could this approach promote a more flexible/scalable shared use of the resources?

This is not a new idea, but the challenge is on the approach proposed for integration under a Hybrid Cloud.