Typically the HPC environments are characterized by software and hardware stacks optimized for maximum performance at the cost of flexibility in terms of OS, system software and hardware configuration. This close-to-metal approach creates a steep learning curve for new users and makes external services, especially cloud-oriented, hard to cooperate with. In exchange for the flexibility one gets access to tens of thousands of CPUs and a high performance network and storage but with little isolation between the jobs and little or no possibility for the applications to interact with services outside a particular cluster.
The DEEP-Hybrid-DataCloud project aims at developing a distributed architecture to leverage intensive computing techniques for deep learning. One of the objectives of the project is to promote the integration of existing HPC resources under a Hybrid Cloud approach, so it can be used on-demand by researchers of different communities.
The abstraction offered by our solution simplifies the interaction for end users thanks to the following key features:
- Promoting container technologies for application development, delivery and execution: This approach enables easier application development, integration and delivery with CI/CD practices. It also makes applications portable and can be deployed/executed on any platform, independently from OS/libraries/software installed on host. Such containerized applications can be used in both Cloud or HPC platforms without modifications.
- Using portable container execution tool in use space on HPC platforms: udocker is a basic user tool to execute simple Docker containers in user space without requiring root privileges. It enables download and execution of Docker containers by non-privileged users in Linux systems where Docker is not available. It can be used to pull and execute Docker containers in Linux batch systems and interactive clusters that are managed by other entities such as grid infrastructures or externally managed batch or interactive systems.
- Standard interfaces are used to manage different workloads and environments, both cloud and HPC-based: the TOSCA language is used to model the jobs and the PaaS Orchestrator creates a single point of access for the submission of the processing requests. The DEEP PaaS layer features advanced federation and scheduling capabilities ensuring the transparent access to different IaaS back-ends including OpenStack, OpenNebula, Amazon Web Services, Microsoft Azure, Apache Mesos, Kubernetes and finally HPC environments. The user request is expressed in the TOSCA templating language and submitted to the PaaS Orchestrator. Depending on the type of request, the specific plugin will be activated in order to dispatch the task to the best compute service.
- Adopting unified AAI throughout the whole stack, from the PaaS to the data and compute layer: it is implemented by the INDIGO IAM service that provides federated authentication based on OpenID Connect/OAuth20 mechanisms. A SSH PAM module has been developed in order to allow users to login via ssh using their IAM access token instead of password. The users are automatically provisioned on the HPC cluster starting from the list of users registered in IAM and belonging to a specific group. Each IAM user is mapped onto a local account.
- A REST API gateway for submitting and monitoring the jobs from outside the HPC site. QCG-Computing is an open architecture implementation of SOAP Web service for multi-user access and policy-based job control routines by various queuing and batch systems managing local computational resources. This key service in QCG is using Distributed Resource Management Application API (DRMAA) to communicate with the underlying queuing systems. QCG-Computing has been designed to support a variety of plugins and modules for external communication as well as to handle a large number of concurrent requests from external clients and services.