Accelerated computing systems play important roles for delivering energy efficient and powerful computing capabilities for computational-intensive applications. However, the support for accelerated computing in cloud is not straightforward. Unlike common computing resources (CPU, RAM), accelerators need special treatment and support at every software layer. The maturity of the support strongly depends on specific hardware/software combinations. A mismatch at any software layer will make the accelerators unavailable for end-users.
The DEEP-Hybrid-DataCloud project aims at developing a distributed architecture to leverage intensive computing techniques for deep learning. One of the objectives of the project is to develop innovative services to support intensive computing techniques that require specialized HPC hardware, such as GPUs or low-latency interconnects, to explore very large datasets. In the project, the support for accelerators are carefully treated at all software layers:
- Support for accelerators at hypervisor/container level: During the project, GPU support in udocker, the portable tool to execute simple Docker containers in user space, has been significantly improved. Current version of udocker can automatically detect GPU drivers on host machines and mount it to containers. The improvement allows udocker to execute standard containers with GPU support from DockerHub like tensorflow:latest-gpu without modification. The support for GPU in other container and hypervisor drivers is also analyzed, tested and deployed on the project testbed in combination with higher cloud middleware framework whenever possible.
- Support for accelerators at cloud middleware framework level: The project testbed consists of sites with different cloud middleware frameworks including Openstack, Apache Mesos, Kubernetes and also HPC clusters. All these cloud middleware platforms are deployed with GPU supports. As GPU virtualization is supported only on newer GPU cards, Openstack sites mostly provide support for GPU via PCI passthrough approach in KVM hypervisor. Kubernetes sites have GPU support via NVIDIA device plugin and Mesos provide access to GPU via its own executor which mimics the nvidia-docker approach. Finally, the GPU support on HPC sites is provided by the portable, user-space execution tool udocker mentioned above.
- Support for accelerators at PaaS orchestrator level: In the project,the information system (Cloud Info Provider + CMDB) has been extended in order to collect information about the availability of GPUs at the sites. The GPUs can be made available through different services at the IaaS level, e.g. they can be provided through native Cloud Management Framework interfaces (e.g. Openstack or Amazon specific flavors) or through Container Orchestration Platforms, like Mesos. The TOSCA model for compute and container nodes has been extended in order to include requirements of specialized devices like GPUs, specified by users. The Orchestrator bases its scheduling mechanism on the provided information to select the best site where the resources will be allocated.