GPU Cluster – Flexibility, optimisation, fairness and robustness ‒ RCP ‐ EPFL

The Research Computing Platform provides access to a GPU cluster through a Container-as-a-Service (CaaS) model, enabling efficient and flexible use of over 400+ GPUs of different types.

Flexibility

Users can build their own Docker images, which function like lightweight virtual machines. They have the option to use pre-built images or customize their own, defining the runtime environment that best fits their specific requirements. This approach offers significant flexibility, allowing researchers to tailor their environment to their unique needs.

Optimization

The Kubernetes orchestrator manages the deployment of these Docker images across the GPU cluster by running them in “pods.” When a user initiates a pod, Kubernetes identifies available resources, launches it using the specified image, and grants the user access. Essentially, users gain on-demand, pay-per-second access to a dedicated remote computing environment.

One of the key benefits of containers is the ability to package all necessary resources and dependencies together. This eliminates the need for manual installation, streamlining workflows and enhancing efficiency.

Fairness and Robustness

A custom scheduling system, run:ai, ensures equitable distribution of resources across labs and projects. Each unit—w hether it be a laboratory or research project—is allocated a quota of GPUs. This guarantees that the unit can always access its allotted number of GPUs. On the other hand, if a unit is not using its allocated GPUs, others can utilize them, providing a dynamic and flexible system.

Additionally, the platform abstracts the underlying hardware, enhancing reliability. For example, if a GPU encounters an issue during a job, the system can seamlessly preempt the task and resume it on another GPU, ensuring uninterrupted progress.

Technical Specifications

Kubernetes Cluster with Run:AI as scheduler supporting both fragmented GPU and distributed nodes training

The cluster comprises 400+ GPUs of different types :
- H100 NVIDIA GPUs
  - Up to 10 nodes
    - 2 x AMD EPYC 9454 48C/2.75G/290W (Total of 192 threads)
    - Memory 1,5 TB
    - 2×100 Gbits/s
    - 8 x NVIDIA HGX H100-SXM5-80GB NVLINK
    - The HGX 8 way platform can provide large AI applications 640 GB HBM3 with a total bandwidth of 7.2 TB/sec and full 900 GB/sec of full GPU-to-CPU bi-directional bandwidth
- A100 NVIDIA GPUs
  - Up to 32 nodes
    - 2 x CPU AMD EPYC 7543 32c/64t (Total of 128 threads)
    - Memory 1 TB
    - 2×100 Gbits/s
    - 8 x NVIDIA HGX A100-SXM4-80GB NVLINK
- V100 NVIDIA GPUs
  - Up to 20 nodes
    - 2x INTEL Gold 6240 (Total of 72 threads)
    - RAM 384GB
    - 2×100 Gbits/s
    - 4x NVIDIA v100-SXM2-32GB NVLINK

The cluster accesses the scratch High Performance Storage
- 2.5PiB full flash (~ 800 TiB used)
- Dedicated to compute cluster (NFS only)
- Scratch storage (no replication)
- 8 x 100 Gbits/s

Kindly contact us for more details or for onboarding : SupportRCP@epfl.ch