UCR Research Computing

Building an AI-Optimized High-Performance Computing (HPC) Cluster in 2025

Authored by: Chuck Forsyth, Director of Research Computing, University of California Riverside

Abstract

In the rapidly evolving landscape of artificial intelligence (AI) and high-performance computing (HPC), constructing an AI-optimized HPC cluster necessitates a meticulous selection of hardware and software components. This whitepaper provides a comprehensive guide to building such a cluster in 2025, detailing node configurations, processing units, storage solutions, networking, and essential software tools. The aim is to equip research institutions with the knowledge to develop a robust infrastructure capable of handling complex AI workloads.

1. Introduction

The convergence of AI and HPC has led to unprecedented advancements in computational research. As AI models become more sophisticated, the demand for specialized HPC clusters tailored to AI workloads has intensified. This document serves as a blueprint for constructing an AI-optimized HPC cluster, focusing on the latest technologies and best practices as of 2025.

2. Hardware Components

2.1. Compute Nodes

Compute nodes are the workhorses of an HPC cluster, executing the bulk of processing tasks. Selecting appropriate node configurations is crucial for performance optimization.

Table 1: Recommended Compute Nodes

Model	Processor	Memory	GPU Support	Storage	Best For	Alternatives
Dell PowerEdge C6525	Dual AMD EPYC	Up to 4TB DDR4	None	Up to 10x NVMe/SSD/HDD	Memory-intensive workloads, simulations	HPE Apollo 2000, Lenovo ThinkSystem SD650
Dell PowerEdge XE9680	Dual Intel Xeon Scalable (4th Gen)	Up to 2TB DDR5	Up to 8x NVIDIA H100/A100	Up to 12x NVMe SSDs	AI/ML workloads, large-scale HPC simulations	HPE Cray EX, Supermicro SYS-420GP-TNAR
Dell PowerEdge XE8545	Dual AMD EPYC 7003 Series	Up to 4TB DDR4	Up to 4x NVIDIA A100 GPUs	Up to 10x NVMe SSDs	AI acceleration, deep learning, HPC applications	NVIDIA DGX A100, HPE Apollo 6500 Gen10 Plus
Dell PowerEdge R660	Intel Xeon Scalable (4th Gen)	Up to 2TB DDR5	Limited GPU support	Up to 8x NVMe SSDs or HDDs	General HPC workloads, cloud, and virtualization	Lenovo ThinkSystem SR630, HPE ProLiant DL360
Dell PowerEdge R750xa	Dual Intel Xeon Scalable (4th Gen)	Up to 4TB DDR5	Up to 4x NVIDIA A100 or AMD MI210	Up to 12x NVMe SSDs	AI/ML training, inference, GPU-intensive workloads	HPE Apollo 6500, Supermicro AS-4124GS-TNR

2.2. Graphics Processing Units (GPUs)

GPUs accelerate AI computations, making them indispensable in modern HPC clusters. Selecting the appropriate GPU models ensures efficient processing of AI workloads.

Table 2: Common HPC and AI GPUs

Model	Architecture	Memory	Peak FP64 Performance	Best For	Alternatives
NVIDIA H100	Hopper	80GB HBM3	60 TFLOPS	AI training, HPC simulations	AMD Instinct MI300X, Intel Ponte Vecchio
NVIDIA A100	Ampere	40GB/80GB HBM2e	19.5 TFLOPS	AI/ML training, inference, mixed HPC workloads	AMD Instinct MI250, Intel Data Center Flex Series
AMD Instinct MI300X	CDNA3	192GB HBM3	Over 100 TFLOPS	Exascale computing, AI/ML workloads	NVIDIA H100, Intel Ponte Vecchio
AMD Instinct MI250	CDNA2	128GB HBM2e	47.9 TFLOPS	HPC, AI/ML inference, large-scale simulations	NVIDIA A100, Intel Ponte Vecchio
Intel Data Center Flex Series	Xe GPU	Up to 32GB GDDR6	Not disclosed	Cloud AI inference, media processing	NVIDIA A40, AMD Radeon Instinct MI210

2.3. Storage Nodes

Efficient storage solutions are vital for managing the vast datasets typical in AI research. High-speed storage nodes ensure quick data retrieval and storage.

Table 3: Recommended Storage Nodes

Model	Processor	Memory	Storage Capacity	Best For	Alternatives
Dell PowerEdge R750	Dual Intel Xeon Scalable (3rd/4th Gen)	Up to 2TB DDR4/DDR5	Up to 24x NVMe SSDs or HDDs	HPC storage nodes, Lustre, BeeGFS, Ceph	Supermicro 2029U-TN24R4T, HPE Apollo 4200

2.4. Networking Components

High-speed networking is essential for efficient data transfer between nodes and storage systems. Selecting appropriate network cards enhances cluster performance.

Table 4: Network Cards for Storage

Component	Type/Specification	Description	Useful Info	Alternatives
Network Cards for Storage	InfiniBand HDR 200Gbps, 100GbE/400GbE Ethernet, RDMA-enabled NICs	High-speed network cards required for efficient storage access and data transfer in HPC clusters.	InfiniBand is optimal for low-latency, high-bandwidth storage like Lustre, while 100GbE/400GbE works well for general-purpose parallel file systems.	Mellanox ConnectX-6 (InfiniBand/Ethernet), Broadcom NetXtreme, Intel E810 (100GbE), HPE Slingshot (Cray)

3. Software Components

An AI-optimized HPC cluster requires a robust software ecosystem to manage resources, facilitate development, and ensure efficient operation. Below is a detailed overview of essential software components:

3.1. Operating System

The operating system (OS) serves as the foundation for all software operations within the cluster.

Recommended OS: CentOS Stream 9
- Description: A community-driven downstream branch of Red Hat Enterprise Linux (RHEL), offering a stable and secure environment.
- Alternatives: Ubuntu Server 22.04 LTS, Rocky Linux 9

3.2. Resource Management and Job Scheduling

Efficient resource allocation and job scheduling are critical for maximizing cluster utilization.

SLURM (Simple Linux Utility for Resource Management):
- Description: An open-source job scheduler that manages resources and job queues.
- Alternatives: PBS Professional, TORQUE, Grid Engine

3.3. Containerization and Virtualization

Containers provide isolated environments for applications, ensuring consistency across different computing environments.

Docker:
- Description: A platform for developing, shipping, and running applications in containers.
- Alternatives: Podman, Singularity (specifically designed for HPC environments)

3.4. Parallel File Systems

High-performance parallel file systems are essential for managing large datasets typical in AI workloads.

Lustre:
- Description: A scalable, secure, and highly available parallel file system.
- Alternatives: BeeGFS, IBM Spectrum Scale (GPFS)

3.5. Data Transfer and Workflow Management

Efficient data transfer and workflow management streamline operations and enhance productivity.

Globus:
- Description: A service for secure, reliable research data management, including transfer and sharing.
- Alternatives: rsync, SCP
Pegasus Workflow Management System:
- Description: A framework for mapping complex workflows onto distributed resources.
- Alternatives: Apache Airflow, Nextflow

3.6. Development Environments and Package Management

Providing users with robust development tools and package managers is crucial for productivity.

Python Package Management:
- Tools: pip, conda
- Description: Tools for installing and managing Python packages.
R Package Management:
- Tools: CRAN, Bioconductor
- Description: Repositories and tools for managing R packages.

3.7. Mathematical Libraries

Optimized mathematical libraries enhance computational efficiency.

Intel Math Kernel Library (MKL):
- Description: Provides highly optimized routines for science, engineering, and financial applications.
- Alternatives: OpenBLAS, LAPACK, FFTW

3.8. Cluster Access and Interface

User-friendly access and monitoring tools improve user experience and system transparency.

SSH (Secure Shell):
- Description: Standard protocol for secure command-line access.
Open OnDemand:
- Description: A web-based interface for HPC resources, providing access to file systems, job management, and interactive applications.
XDMod (XDMoD):
- Description: A tool for monitoring and analyzing HPC resource utilization.

4. Networking Considerations

High-speed networking is vital for efficient data transfer between compute nodes and storage systems.

Network Cards:
- Options: InfiniBand HDR 200Gbps, 100GbE/400GbE Ethernet, RDMA-enabled NICs
- Description: High-speed network cards required for efficient storage access and data transfer in HPC clusters.
- Alternatives: Mellanox ConnectX-6 (InfiniBand/Ethernet), Broadcom NetXtreme, Intel E810 (100GbE), HPE Slingshot (Cray)

5. Conclusion

Constructing an AI-optimized HPC cluster in 2025 involves careful selection of cutting-edge hardware and software components. By integrating the recommended compute nodes, GPUs, storage solutions, networking components, and software tools detailed in this whitepaper, research institutions can develop a robust infrastructure capable of handling complex AI workloads efficiently.

Authored by: Chuck Forsyth, Director of Research Computing, University of California Riverside