Building a multipurpose AI high-performance computing (HPC) cluster requires carefully integrating powerful hardware with a robust open-source software stack. We propose an on-premises cluster design that incorporates high core-count CPUs, GPU accelerators for AI workloads, emerging DPU technology, and large memory capacity – all balanced against a modest budget. On the software side, the cluster will leverage proven HPC tools: Warewulf for cluster provisioning, Open OnDemand for user access, Open XDMoD for monitoring, interactive platforms like RStudio and Jupyter, the Spack package manager, parallel programming frameworks (MPI and OpenMP), a high-speed InfiniBand interconnect, a parallel file system, optimized math libraries, XRootD for scalable data access, and Globus for data transfer. This report outlines the proposed cluster’s hardware and software, and then compares its capabilities to the HPC resources at top R1 universities (AAU institutions) – evaluating performance, software compatibility, scalability, and cost efficiency. We use the latest versions of all software components in our analysis, and provide recommendations on choosing hardware, configuring the software stack, managing costs, and understanding performance trade-offs.
High-Core-Count CPUs: Choose server CPUs with a large number of cores to maximize parallel throughput. Modern AMD EPYC processors (3rd or 4th Gen) offer 64 to 128 cores per chip, which is ideal for HPC workloads that benefit from many CPU threads. For example, a dual-socket node with 64-core CPUs provides 128 cores per node. These CPUs also support high memory bandwidth and capacity (important for data-intensive tasks). By comparison, many leading university supercomputers have adopted AMD EPYC for its core counts – PSC’s Bridges-2 uses 64-core AMD EPYC 7742 CPUs on its nodes (Bridges-2 | PSC). High core counts enable efficient multi-threaded computing (OpenMP) and handling many parallel tasks.
GPUs for AI and HPC: Incorporate GPUs to accelerate AI, machine learning, and data-parallel HPC tasks. NVIDIA’s A100 GPUs (40GB or 80GB models) are a popular choice for HPC+AI clusters, offering high double-precision performance (~9.7 teraflops FP64 each) and excellent deep learning throughput. Depending on budget, a reasonable configuration might be 4 GPUs per node (common in many HPC systems). For instance, SDSC’s Expanse and PSC’s Bridges-2 each use nodes with 4 NVIDIA V100/A100 GPUs for AI workloads (User Guide) ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,data%20security%20and%20expandable%20archiving)). These GPUs connect via NVLink/PCIe and can collectively deliver hundreds of teraflops of performance. Even a modest cluster with a few multi-GPU nodes can provide significant AI capability, although it will not match the sheer scale of top academic supercomputers that have hundreds of GPUs (Bridges-2 has 192 V100 GPUs in total) ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,data%20security%20and%20expandable%20archiving)). |
Data Processing Units (DPUs): To future-proof the cluster, we consider DPUs (Data Processing Units) such as NVIDIA BlueField. DPUs are specialized system-on-chip units designed to offload networking, storage, and security tasks from the CPU (pDOCA: Offloading Emulation for BlueField DPUs - HPCKP) (pDOCA: Offloading Emulation for BlueField DPUs - HPCKP). They combine programmable cores with high-speed network interfaces, enabling tasks like packet processing, encryption, and storage management to be handled on the DPU. In an HPC cluster, DPUs can accelerate data movement (e.g. RDMA networking, NVMe-oF storage access) and improve security isolation, thereby increasing overall efficiency (pDOCA: Offloading Emulation for BlueField DPUs - HPCKP) (pDOCA: Offloading Emulation for BlueField DPUs - HPCKP). While DPUs are not yet common in most production university clusters (many top systems do not yet deploy them broadly) (pDOCA: Offloading Emulation for BlueField DPUs - HPCKP), including them on critical nodes (login or storage nodes) could offload network communication tasks from CPUs. This is an optional component given budget constraints, but if affordable, DPUs can enhance performance for I/O-intensive workloads and prepare the cluster for emerging HPC paradigms.
Memory and Storage: Equip each compute node with ample RAM – at least 1–2 GB per CPU core is recommended for general HPC workloads. For example, with 128 cores, ~256 GB RAM per node is a balanced configuration (indeed, Expanse’s standard nodes have 256 GB for 128 cores) (User Guide). Certain nodes could be designated as high-memory nodes, with 1–2 TB of RAM, to support memory-intensive tasks (similar to how some university clusters have large-memory nodes for bioinformatics, in-memory databases, etc.). For cluster-wide storage, deploy a parallel file system that can deliver high throughput to all nodes. Lustre is a popular open-source parallel file system used in many HPC centers, known to scale to thousands of nodes and petabytes of data (Working with the Lustre Filesystem / Articles / HPC / Home - ADMIN Magazine). Lustre’s latest release (2.15 LTS and onward) provides improvements in stability and performance. An alternative is BeeGFS, which is also high-performance and reputed for easier deployment on modest-sized clusters. Both Lustre and BeeGFS stripe data across multiple storage servers, enabling concurrent read/write from many nodes. For our modest cluster, one can start with a few storage servers (e.g. 2 metadata + 2 object/storage servers for Lustre) to provide a scalable file system in the tens of terabytes range, with the option to expand capacity by adding servers. This ensures that IO-heavy workloads (such as training AI models on large datasets or writing simulation output) do not become bottlenecked.
High-Speed Interconnect (InfiniBand): Use an InfiniBand network for low-latency, high-bandwidth communication between nodes. InfiniBand is essential for MPI-based parallel applications that span multiple nodes, as it dramatically outperforms standard Ethernet in throughput and latency. A 100 Gbps InfiniBand (HDR100) switch network is a suitable baseline; if budget allows, 200 Gbps HDR can be chosen (as found in newer top systems like Bridges-2 which uses HDR-200) ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,IOV%2C%20and%20data%20encryption)). The Infiniband fabric should be configured with sufficient ports to allow cluster expansion. This interconnect will enable efficient scaling of MPI jobs and fast data access to the storage system via RDMA. It’s worth noting that the proposed InfiniBand also supports advanced features (GPUDirect for direct GPU-to-GPU communication across nodes, and potentially DPU offload in the future) ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,IOV%2C%20and%20data%20encryption)). |
Node Configuration and Form Factor: The cluster will include a head/login node for user access and cluster management, and multiple compute nodes for running jobs. Each compute node would contain the high-core CPUs, GPUs, memory as described. The head node can be a robust server with plenty of RAM and storage to handle job scheduling, user sessions, and I/O (it may also host the Open OnDemand and XDMoD web interfaces). If using Warewulf, the head node also acts as the provisioning server (PXE/dhcp and container image distributor). At least one additional node can serve as an I/O or storage node dedicated to the parallel file system (running Lustre’s metadata/OSS services or BeeGFS services). The hardware should be housed in a rack with adequate cooling. Commodity server hardware (e.g., 2U GPU servers for compute nodes and 1U storage servers) can be used to save cost – this is similar to the approach of NSF-funded clusters that use Dell, HPE, or similar commodity servers (for example, TACC’s Frontera is built from Dell EMC PowerEdge servers with Intel CPUs and Mellanox InfiniBand) ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=The%20new%20system%20is%20expected,10%20was%20in%20November%202015)) ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=Frontera%20will%20be%20built%20by,but%20since%20Intel%20has%20shut)). Using commodity hardware with open-source management keeps costs down while still achieving high performance. |
The cluster’s software environment will consist of open-source, latest-version tools that provide cluster provisioning, job scheduling, user-friendly access, performance monitoring, and a rich library of scientific software. Below are the key components of the software stack and their roles:
Operating System & HPC Distribution: All nodes will run a modern 64-bit Linux OS (e.g., Rocky Linux 9 or Ubuntu 22.04 LTS) for compatibility with HPC software. We can leverage the OpenHPC repository – an integrated HPC software stack – which includes packages for Warewulf, Slurm, OpenMPI, and more. For instance, OpenHPC 2.x provides pre-built recipes for CentOS/Rocky 8+ with Warewulf provisioning and Slurm workload manager (Build an Intel®-Based Cluster with OpenHPC* 2.0 on CentOS* 8). Using a stable Linux base ensures we have the latest kernel optimizations for performance and all open-source tools available.
Warewulf (Cluster Provisioning): Warewulf is an operating system-agnostic provisioning and cluster management system for HPC ([Install your HPC Cluster with Warewulf | SUSE Communities](https://www.suse.com/c/install-your-hpc-cluster-with-warewulf/#:~:text=Warewulf%20is%20an%20operating%20system,the%20development%20happens%20as%20well)). We will use the latest Warewulf 4 to manage the cluster nodes. Warewulf allows the head node to boot compute nodes disklessly via PXE, deploying a common OS image or container to all nodes. This ensures all compute nodes have an identical environment – critical for consistency in HPC. Warewulf addresses the administrative scalability problem of managing many nodes by providing a centralized way to configure and update nodes ([Install your HPC Cluster with Warewulf | SUSE Communities](https://www.suse.com/c/install-your-hpc-cluster-with-warewulf/#:~:text=In%20High%20Performance%20Computing%20,to%20address%20this%20%E2%80%98administrative%20scaling%E2%80%99)) ([Install your HPC Cluster with Warewulf | SUSE Communities](https://www.suse.com/c/install-your-hpc-cluster-with-warewulf/#:~:text=machines%20%28nodes%29,to%20address%20this%20%E2%80%98administrative%20scaling%E2%80%99)). It’s lightweight and configurable, and being open-source (BSD licensed) avoids any licensing costs. Warewulf will make it easy to add more nodes to the cluster in the future; new nodes can PXE boot and automatically join the cluster with the same software environment. |
Slurm Workload Manager (Job Scheduler): For resource management and job scheduling, Slurm is recommended. Slurm is the most widely used job scheduler on large HPC systems (used by about 60% of the Top500 supercomputers) ([MAAS blog | Open Source in HPC [part 5]](https://maas.io/blog/open-source-in-hpc-part-5#:~:text=started%20as%20a%20collaborative%20effort,the%20Universe%20repositories%20on%20Ubuntu)). The latest Slurm 23.x version provides scalability to tens of thousands of cores, advanced features like gang scheduling and burst buffer integration, and a rich plugin ecosystem. Slurm will run on the head node as the controller and on each compute node as an agent. Users will submit batch jobs or request interactive sessions through Slurm (via commands like sbatch , salloc ). Slurm’s popularity means our cluster’s scheduling environment will be familiar to researchers coming from other institutions. It also integrates with Open OnDemand and XDMoD easily for job submission and tracking. |
Open OnDemand (User Portal): Open OnDemand is an open-source HPC portal developed by the Ohio Supercomputer Center, which provides a web-based interface to the cluster (RCAC - Knowledge Base: Anvil User Guide: Open OnDemand). We will deploy the latest Open OnDemand (version 4.0, released Jan 2025) on the head node. This enables users to interact with the cluster via a browser – managing files, submitting jobs, running graphical applications, all without needing complex client software (RCAC - Knowledge Base: Anvil User Guide: Open OnDemand). Open OnDemand integrates with Slurm on the backend ([MAAS blog | Open Source in HPC [part 5]](https://maas.io/blog/open-source-in-hpc-part-5#:~:text=Open%20OnDemand)), so when a user launches an interactive Jupyter notebook or RStudio session from the portal, it actually submits a Slurm job under the hood. This greatly lowers the barrier to entry for new users and improves productivity. For example, a researcher can launch Jupyter Notebooks or RStudio Server with a few clicks through OnDemand, whereas traditionally they would need to do SSH port forwarding. Many R1 universities have adopted OnDemand to increase HPC accessibility (e.g., Michigan State and others have deployed it to broaden access for students and faculty) ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=Columbus%2C%20Ohio%20,supports%20advancements%20in%20research%20computing)) ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=Columbus%2C%20Ohio%20,wide%20range%20of%20disciplines%2C%20students)). By providing OnDemand, our cluster matches the usability features of leading centers. |
Interactive Applications (Jupyter, RStudio): On top of Open OnDemand, we will configure interactive apps like JupyterLab and RStudio Server (using latest versions – JupyterLab 4.x and RStudio Server 2023/Posit Workbench). This allows users to run notebooks or R analyses on the HPC nodes. The apps will run inside Slurm allocations on compute nodes but be displayed through the user’s web browser via OnDemand (RCAC - Knowledge Base: Anvil User Guide: Open OnDemand). RStudio Server provides a full R IDE in the browser, leveraging HPC compute power for heavy R tasks. JupyterLab supports Python, R, MATLAB (via kernels) and can tap into GPUs for AI model training. These tools are essential for data science and AI workloads and make the cluster “multipurpose” – supporting traditional batch computing as well as interactive research workflows.
Open XDMoD (Monitoring and Metrics): Open XDMoD (XD Metrics on Demand) is an open-source tool funded by NSF to monitor and audit HPC resources ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=XDMoD%2C%20which%20stands%20for%C2%A0XD%20Metrics,terms%20of%20scholarship%20and%20research)). We will install the latest Open XDMoD 11+, which provides a web dashboard of cluster utilization, job performance, and user metrics. XDMoD can track CPU hours used, GPU usage, memory usage, etc., and generate reports. It gives both administrators and users insight into how the cluster is performing. For instance, XDMoD can help identify if the cluster is CPU or I/O bound by visualizing job statistics ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=XDMoD%2C%20which%20stands%20for%C2%A0XD%20Metrics,terms%20of%20scholarship%20and%20research)). Many academic centers deploy XDMoD to justify resources and optimize operations. This aligns with practices at large facilities – it was originally designed for the XSEDE program to collect metrics across centers (ubccr/xdmod: An open framework for collecting and … - GitHub). By having XDMoD, we ensure our cluster’s performance is transparent and tunable. |
Spack (HPC Package Management): To manage the plethora of scientific software, we will use Spack – a flexible, Python-based HPC package manager maintained by Lawrence Livermore National Lab. Spack makes it easy to install and switch between multiple versions of applications, libraries, compilers, and MPI stacks ( Spack - Spack). The latest Spack release (v0.20+) will be installed on the head node. Using Spack, one can compile optimized builds of math libraries (like OpenBLAS, PETSc), AI frameworks (TensorFlow, PyTorch), domain-specific codes, etc., tailored to our hardware (e.g., using CPU-specific flags or CUDA for GPUs). Spack is not tied to any single language – it supports C/C++/Fortran, Python, R, and more, allowing a unified way to manage everything from Python packages to large parallel codes ( Spack - Spack) ( Spack - Spack). This is especially useful for a modest cluster with diverse workloads, as we can easily provide users with the latest versions of software on demand. Top universities similarly use environment modules or Spack to offer a wide range of software to their researchers. By adopting Spack, our cluster’s software stack will be as rich and up-to-date as those at larger centers.
MPI and OpenMP (Parallel Programming): The cluster will support both distributed and shared memory parallelism. MPI (Message Passing Interface) libraries (OpenMPI 4.x and MPICH/Intel MPI as needed) will be installed for multi-node parallel applications. MPI is the standard for distributed-memory parallel computing in HPC (Message Passing Interface - Wikipedia), used in applications from computational fluid dynamics to weather modeling. With InfiniBand and multi-core CPUs, our cluster will run MPI codes efficiently. OpenMP (v5.2, via latest GCC and Intel compilers) will be available for node-level parallelism. OpenMP allows multithreading within a single node (or GPU offload), using compiler directives to spawn threads that share memory (Open Multi-Processing (OpenMP) :: High Performance Computing). Together, users can develop hybrid MPI+OpenMP programs to fully exploit each node. We will also include math and communication libraries tuned for our hardware: e.g. Intel oneAPI Math Kernel Library (oneMKL) or OpenBLAS for fast linear algebra, NVIDIA’s CUDA and cuDNN for GPU acceleration, and NCCL for multi-GPU communication. These ensure that scientific codes and AI models achieve high performance on the cluster’s CPUs and GPUs.
Math and Analytics Libraries: A variety of optimized math libraries will be provided to support simulation and AI workloads. This includes BLAS/LAPACK (with CPU optimizations from oneMKL or AMD’s AOCL), FFT libraries (FFTW, MKL FFT), and scientific libraries like ScaLAPACK, PETSc, and Trilinos for HPC simulations. For data analytics and AI: latest Python scientific stack (NumPy, SciPy, Pandas), machine learning libraries (Scikit-learn, XGBoost), and deep learning frameworks (TensorFlow 2.x, PyTorch) compiled to utilize GPUs (with CUDA 12 and cuDNN). By building these through Spack or containers, we ensure compatibility. The cluster essentially mirrors the software environment available on larger research systems, so code developed locally can run on bigger machines and vice versa.
XRootD (Distributed Data Access): We plan to integrate XRootD for certain data-heavy workflows that require distributed data access or caching. XRootD is a high-performance, scalable system originally from high-energy physics that provides fault-tolerant access to file-based data repositories ([ Home Page | XRootD](http://xrootd.org/#:~:text=The%20XROOTD%20project%20aims%20at,systems%2C%20WAN%20data%20distribution%2C%20etc)). It allows the cluster to serve or fetch data from remote sources efficiently, using a plugin-based architecture and an asynchronous, parallel I/O protocol ([ Home Page | XRootD](http://xrootd.org/#:~:text=The%20XROOTD%20project%20aims%20at,systems%2C%20WAN%20data%20distribution%2C%20etc)). In practical terms, if our researchers collaborate with external experiments (e.g., CERN LHC or large genomic databases), XRootD can be used to stream data to the cluster on-demand. The latest XRootD 5.x will be installed, possibly containerized for ease of deployment. While not all HPC centers use XRootD, some that handle “big data” in scientific collaborations do employ it. It can be an advantage for our cluster to support it, enabling use cases such as federated data analysis or acting as a caching proxy for remote data ([PDF] Paving the Way for HPC: An XRootD-Based Approach for Efficiency …). |
All the above software components are open-source (except certain proprietary options like Intel compilers or MATLAB, which we would include only as needed under academic licenses). Relying on open-source tools keeps software licensing costs minimal and allows flexibility in customization. The latest versions of each ensure we have up-to-date features and security patches. For instance, the latest Slurm will have better scheduling features than older versions; the newest Open OnDemand 4.0 improves the interface and adds authentication enhancements ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=OnDemand%20to%20increase%20HPC%20accessibility,wide%20range%20of%20disciplines%2C%20students)). By building the cluster with this modern stack, we ensure software compatibility with contemporary HPC workflows and make it easy to maintain in the long term. The entire stack is also container-friendly – one could deploy applications in Singularity/Apptainer containers if needed, which is a practice seen at some centers (containers can complement Spack for user-level portability). |
To put the proposed cluster in context, we compare it to the research computing clusters at top R1 AAU universities. For concreteness, we consider leading academic supercomputers such as TACC’s Frontera (UT Austin), PSC’s Bridges-2 (Carnegie Mellon/UPitt), NCSA’s Delta (UIUC), SDSC’s Expanse (UC San Diego), and Purdue’s Anvil – these are among the most capable university-operated systems. We evaluate four key dimensions: performance, software and compatibility, scalability, and cost efficiency.
In terms of raw performance, large R1 university clusters far exceed what a modest-budget cluster can achieve – however, the gap must be viewed in the context of differing scales and usage.
Compute Power: Top academic supercomputers deliver performance on the order of petaflops. For example, TACC’s Frontera offers ~35–38 petaflops peak, making it the world’s most powerful academic supercomputer when debuted ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=The%20new%20system%20is%20expected,10%20was%20in%20November%202015)). It has over 8,000 CPU cores (Dell EMC nodes with 28-core Intel Xeon Platinum CPUs) and reached #5 in the Top500 rankings in 2019 ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=The%20new%20system%20is%20expected,10%20was%20in%20November%202015)). Another example, Purdue Anvil, consists of ~1,000 nodes with dual 64-core AMD Milan CPUs (128k CPU cores total) and peaked at ~5.1 petaflops (Anvil ranked 143rd on list of world’s most powerful supercomputers). In contrast, our proposed cluster might have on the order of a few dozen CPUs and a handful of GPUs – likely achieving tens or a few hundred teraflops in practice. Even if we fully load it with top-end GPUs, we might achieve <1 petaflop of mixed-precision AI performance. For instance, NCSA’s Delta, a recent NSF-funded cluster (~$10M cost), has 124 CPU nodes plus 200 GPU nodes and delivers ~6 petaflops (double precision) performance (NCSA’s Delta supercomputer gets AI partition - DCD). Our cluster will be an order of magnitude smaller. While it cannot compete on total throughput, it can still handle significant workloads for a single research lab or department. Individual node performance will actually be similar – we plan to use the same class of CPUs/GPUs available to the big centers. A single compute node in our cluster with 128 CPU cores and 4 A100 GPUs is comparable in capability to a standard node on Delta or Bridges-2. The difference is in the number of such nodes. Thus, for embarrassingly parallel workloads (that can be split across many jobs), a small cluster might suffice, but for extremely large MPI jobs (needing 1000+ ranks), the big clusters have a clear advantage. |
Accelerators and Specialized Hardware: Large university clusters often have heterogeneous hardware to cater to different workloads. For example, PSC Bridges-2 provides GPU nodes (192 Nvidia V100 GPUs total) for AI and also large-memory nodes with 4 TB RAM for data analytics ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,data%20security%20and%20expandable%20archiving)). NCSA’s Delta similarly has GPU partitions (with Nvidia A100 and A40 GPUs) and is even adding new Nvidia H100 GPU nodes (DeltaAI extension) to reach 600 petaflops of AI tensor performance (NCSA’s Delta supercomputer gets AI partition - DCD) (NCSA’s Delta supercomputer gets AI partition - DCD). Our cluster will have a much smaller accelerator count – perhaps 8–16 GPUs in total – but importantly, the per-GPU performance is similar since we would use the same Nvidia A100/H100 technology if budget permits. This means for a single GPU-heavy job (like training a deep learning model), our cluster can provide performance comparable to running that job on one node of the big system. The difference is that the big systems can run dozens of such jobs concurrently or scale to multi-node distributed training more effectively. We also note that DPUs and other emerging tech are not yet mainstream in those big clusters. If we include DPUs, our cluster might experiment with advanced network/storage offloads that even top systems are just beginning to evaluate (pDOCA: Offloading Emulation for BlueField DPUs - HPCKP). In summary, peak FLOPS and aggregate throughput are higher on R1 university clusters by one to two orders of magnitude, but on a per-node basis and for moderate job sizes, a well-chosen modest cluster can deliver competitive performance for its scale. |
Storage and I/O: Large HPC centers typically have massive parallel storage systems. For instance, Frontera has a Lustre filesystem named Stockyard with tens of petabytes of disk, and Bridges-2 uses a Cray ClusterStor E1000 (Lustre-based) with flash acceleration ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,data%20security%20and%20expandable%20archiving)). These systems can sustain hundreds of GB/s of throughput. Our cluster’s parallel file system will be smaller (perhaps tens of TBs capacity) and with lower throughput (maybe a few GB/s with a handful of storage servers). This could become a performance bottleneck for I/O-intensive workloads compared to the near-infinite storage feel on big clusters. However, for many AI and HPC tasks within our scope, a smaller Lustre/BeeGFS with NVMe drives can still provide excellent performance. We can also mitigate differences by using Globus to stage data to/from larger storage at a center when needed – effectively leveraging big HPC storage for long-term data, and our local storage for active compute. |
In summary, the proposed cluster sacrifices peak performance and scale in exchange for cost savings and dedicated availability. It will handle small to medium-sized research projects well, but cannot match the extreme scale computations (e.g., multi-thousand-core climate simulations or national-scale AI training runs) that top R1 university clusters execute routinely. For those extremely large jobs, users would still need access to national HPC resources. Our cluster could be seen as a “feeder” or complement to those: work can be developed and tested on the modest cluster, and only the largest runs need to go to the big systems.
One of the strengths of our proposed design is that it uses a software stack very similar to what top research universities deploy, ensuring a high degree of compatibility.
Operating Environment: Leading HPC centers almost universally run Linux and use similar job schedulers and libraries. For instance, Slurm is heavily used (Bridges-2, Expanse, Anvil all use Slurm for scheduling). By using Slurm, we match their job submission interface and policies ([MAAS blog | Open Source in HPC [part 5]](https://maas.io/blog/open-source-in-hpc-part-5#:~:text=started%20as%20a%20collaborative%20effort,the%20Universe%20repositories%20on%20Ubuntu)). This means job scripts written for our cluster will likely run on those clusters with minimal modifications (just account names or partition names might differ). We also plan to use the same or newer versions of compilers (GCC, Intel oneAPI) and MPI. Our use of Spack to manage software means we can provide many of the same scientific applications. Top university clusters often maintain large module libraries – for example, TACC’s Frontera has hundreds of applications available via Lmod modules (Frontera - TACC HPC Documentation) (Frontera - TACC HPC Documentation). With Spack, we can achieve comparable breadth on a smaller scale. Users will find common tools like GROMACS, LAMMPS, MATLAB, Python, R, etc., on both our cluster and the big ones. This compatibility is crucial for researchers who collaborate across institutions. |
User Access and Tools: The inclusion of Open OnDemand also aligns well with what many centers now offer. Open OnDemand provides a standardized web interface to HPC (RCAC - Knowledge Base: Anvil User Guide: Open OnDemand). If a researcher has used OnDemand on another campus or an XSEDE/ACCESS resource, they will feel at home using it on our cluster. It abstracts the complexity of the scheduler and command-line, which is why many sites (including some of the top 5 AAU universities) have been deploying it to broaden usability ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=Columbus%2C%20Ohio%20,supports%20advancements%20in%20research%20computing)) ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=Columbus%2C%20Ohio%20,wide%20range%20of%20disciplines%2C%20students)). For example, Ohio Supercomputer Center, Pittsburgh Supercomputing Center, and others have OnDemand portals for their clusters – by having the same, we ensure workflow portability. A Jupyter notebook launched on our cluster via OnDemand is the same process on a larger cluster with OnDemand. |
Monitoring and Accounting: Large centers often use accounting and metric tools (some use XDMoD, others use in-house solutions). By using Open XDMoD, we mirror the approach used in NSF’s XSEDE network for usage tracking (ubccr/xdmod: An open framework for collecting and … - GitHub). This means if needed, we could aggregate or compare our usage with national metrics. It also ensures we collect the right statistics to optimize software environment as the big centers do (like identifying under-utilized nodes or I/O hotspots).
Scientific Software Versions: Using the latest versions of software in our cluster may actually give us an edge in compatibility moving forward. Sometimes big clusters run slightly older OS or software due to their slower upgrade cycles (for stability). For instance, a center might still have CentOS 7 environment modules for legacy reasons, whereas we might go with Rocky 9 and newest libraries. But with Spack and containers, we can also offer legacy versions if needed. Overall, any code developed on our cluster with standard libraries (MPI, BLAS, etc.) should compile and run on big clusters, because those provide the same or equivalent libraries (e.g., Intel MKL, FFTW, OpenMPI etc. are ubiquitous). The reverse is also largely true: codes from those clusters can run on ours, since we have maintained compatibility with standards.
In short, the software stack we propose is in harmony with those at top research computing centers. We emphasize using common, widely-supported tools (Linux, Slurm, MPI, OpenMP, Jupyter, etc.) – these are essentially industry standards in scientific computing. This ensures that researchers do not face a steep learning curve moving between our cluster and a larger facility. The compatibility also extends to data handling: using Globus for transfers is common at large centers, so our users can directly transfer data between our cluster’s Globus endpoint and, say, Frontera’s Globus endpoint easily, with high speed (Frontera). Overall, by not deviating into any proprietary or unusual software, we ensure maximum portability and interoperability.
Scalability refers to both the ability to scale up the size of the cluster (more nodes, more users) and the ability to scale up particular workloads to more resources. We compare how our cluster and large university clusters fare in this regard:
Cluster Size and Expansion: The proposed cluster is initially small (perhaps 5–20 nodes), but thanks to tools like Warewulf and Slurm, it is designed to be scalable in architecture. We can add more compute nodes relatively easily – the provisioning system and scheduler will handle additional nodes with minimal reconfiguration. Warewulf’s design specifically allows managing large clusters consistently ([Install your HPC Cluster with Warewulf | SUSE Communities](https://www.suse.com/c/install-your-hpc-cluster-with-warewulf/#:~:text=machines%20%28nodes%29,to%20address%20this%20%E2%80%98administrative%20scaling%E2%80%99)). The limiting factor will be budget, but if down the line more funds become available, we can grow the cluster to maybe double or triple nodes and still use the same head node and network (up to the InfiniBand switch port limit). In contrast, top R1 clusters are built at scale from day one – Bridges-2, for example, started with roughly 560 compute nodes (including various types) and just added 10 new GPU nodes in 2025 as an upgrade ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=New%20Nodes%20Added%20to%20Bridges,January%2030%2C%202025)) ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=Bridges,from%20the%20National%20Science%20Foundation)). Those systems have a lot of headroom in infrastructure (cooling, network, etc.) to grow, but their expansions are also limited by major funding infusions. For a modest cluster, expanding from say 8 GPUs to 16 GPUs is a big improvement for our users, and is much simpler than a large center trying to expand by 2x (which could mean integrating hundreds of new nodes). In summary, our cluster is scalable to a point – likely up to a mid-size cluster – but will not reach the extreme node counts of big systems simply due to physical and budget limitations. Yet, the architecture we choose does not inherently bottleneck at small scale; it uses the same technologies that operate at large scale in big clusters. |
Workload Scalability: Large HPC centers support jobs that scale to thousands of cores or hundreds of GPUs concurrently (for example, an MPI job using 200 nodes on Frontera, or a distributed training job using 128 GPUs on Delta). Our cluster cannot run a single job at that scale because it won’t have that many nodes available. Thus, certain highly scalable workloads cannot be fully realized on our cluster. We anticipate most usage on our cluster will be small to moderate parallel jobs (e.g., 4–16 GPUs at most for an AI run, or 100–200 CPU cores for an MPI job). This covers a wide range of science, but if a user develops code that in theory can use 1000+ cores, they will hit a wall on our cluster around, say, 128 cores. In such cases, they would have to use a larger cluster to see further scaling. We can mitigate this by encouraging hybrid usage: do development and scaling tests on our cluster up to its limit, then apply for time on a bigger resource for production runs. This is a common model – even researchers at well-endowed universities often use campus clusters for development and national centers for the largest runs.
Multi-User Scalability: Top university clusters often serve hundreds or thousands of users across many departments. Our cluster will likely have a smaller user base (maybe a single department or a few collaborating groups). This means contention for resources is lower, which is actually a benefit for our users – they might get their jobs scheduled faster on our cluster than in the busy queues of a national supercomputer. However, if our user base grows, Slurm scheduling policies can ensure fairness. We can implement account-level quotas or QoS limits just like big centers do, albeit on a smaller scale. The Open OnDemand portal will scale to multiple simultaneous interactive sessions, though heavy interactive use could be an issue if not enough nodes are available (big centers often have special interactive partitions or many login nodes to handle dozens of simultaneous GUI sessions). With our modest cluster, we’d likely limit how many interactive jobs run at once to avoid overloading.
In summary, scalability is a relative strength of large R1 clusters – they are purpose-built to scale both the system and individual jobs to very high levels. Our cluster is architected using scalable components and can grow, but only within the bounds of being a small-to-medium cluster. For the intended use (a modest budget, departmental resource), this is acceptable. We ensure that we are not using anything that would prevent scaling (thus avoiding proprietary or low-end solutions); if more funding comes, one could imagine our cluster evolving into something larger over time. The biggest difference is that the top R1 clusters can accommodate the world’s most scaling-hungry scientific computations, whereas ours will top out at moderate scales.
Considering cost is paramount given our “modest budget” constraint. We examine how our cluster’s cost efficiency compares to the large university systems, and how to maximize the performance per dollar.
Capital Cost vs. Performance: The top 5 R1 university clusters are often funded by multi-million dollar grants (NSF or institutional). For example, Frontera cost on the order of $60 million to build ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=Image%20The%20system%2C%20%20166%2C,to%20operate%20for%20five%20years)), and Bridges-2 had a $10 million NSF award for its construction ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=Bridges,from%20the%20National%20Science%20Foundation)). These large investments buy a lot of hardware – Frontera’s cost equates to ~$1.7M per petaflop (peak), whereas Bridges-2’s ~$10M yields ~5 petaflops peak (including CPUs and GPUs) which is about $2M per petaflop. Our cluster will operate in a different regime – perhaps a few hundred thousand dollars total budget. We have to be very strategic to get the best performance for the money. Using commodity hardware and open-source software is key. We avoid any vendor lock-in or software licensing fees (the entire software stack is free). We also choose hardware at a sweet spot: for instance, AMD CPUs typically offer more cores per dollar than the top-bin Intel CPUs, and older-generation GPUs (like NVIDIA A100) could be purchased at a discount now that H100 is out. There’s often a price drop as new models release, and one can capitalize on that. We won’t match the big systems in absolute performance, but we may achieve a similar or better cost-to-performance ratio on a smaller scale. For example, if we spend $300k to get ~0.5 petaflop of capability, that’s $0.6M per petaflop – actually better than the large-system ratio – though it’s not entirely fair to scale linearly like that because small clusters can’t reach the utilization levels of large ones. |
Utilization and Operational Efficiency: Cost efficiency isn’t just about purchase cost; it’s also about utilization. Large centers have to support a wide community and often run at high utilization (90%+ busy). However, some of that time may be used by jobs that checkpoint, wait in queues, etc. On a smaller cluster dedicated to fewer users, one can potentially achieve very high useful utilization for those users. There’s little idle time if users are active, and short queue wait times mean researchers get results faster (time is money too in research productivity). So for a given research group, having their own cluster, even if smaller, might yield more scientific output per dollar spent than waiting in line on a bigger machine – especially for moderate jobs. On the other hand, large clusters benefit from economies of scale in operations (professional staff, data center cooling, etc.). We have to consider maintenance costs: a modest cluster might not have the advanced cooling or power distribution of a data center, but we avoid those overhead costs by operating at a smaller scale (e.g., we might house it in an existing server room with standard cooling). Power consumption is proportional to size; our cluster will draw far less power (few kW) compared to a huge cluster (which can be in the MW range). Thus, electrical and cooling costs will be lower, contributing to cost efficiency in operations.
Upgrade Path: Because of the modest size, we can upgrade components incrementally and relatively inexpensively. Large clusters, once built, often stay in service ~5 years, then are replaced by a new grant – upgrades in between are major (like Bridges-2 adding a new GPU pod for $4.9M ($4.9M NSF Award Funds Major Enhancement to Bridges-2 System)). Our cluster could be upgraded gradually: e.g., adding more memory to nodes next year, or replacing older GPUs with new ones as budgets allow. This incremental approach can extend the life of the cluster with small infusions of funds, which is often not possible on huge systems (where tech refresh is all-or-nothing).
Cloud vs On-Prem: It’s worth noting cost efficiency relative to cloud computing as well. Sometimes, for modest needs, one might consider using cloud HPC instances instead of building a cluster. However, cloud costs (for GPU instances especially) accumulate quickly and often far exceed the cost of owning hardware if the hardware is utilized regularly. Our on-prem cluster, once purchased, can run jobs 24/7 without additional cost per use (aside from power). Many universities found that owning shared HPC is more cost-effective for continuous research computing than renting equivalent capacity. By building an on-prem cluster, we get fixed predictable costs and can optimize for our specific workloads (e.g., fast local scratch, high interconnect – things that are expensive or not available in cloud).
In conclusion, while our cluster cannot compete with multi-million dollar facilities on raw performance, it can be highly cost-efficient for its scale and purpose. By carefully selecting open-source solutions and commodity hardware, we maximize performance per dollar. The cluster provides great value especially when fully utilized by its intended user base. In comparison to top R1 HPC resources, the absolute cost is much lower and thus the risk is lower; yet the benefits (control, immediate access, customization) are significant for the researchers. For many workloads, the marginal cost of running on our cluster (which is already bought and paid for) is effectively zero, whereas on external systems one might have limits or allocations. This gives researchers freedom to experiment more, which can lead to innovation. Therefore, from a cost/benefit perspective, a modest HPC cluster is an excellent complement to larger shared resources – it fills the gap for everyday computational needs in a very cost-effective way, while the big centers handle the extraordinary large-scale needs.
Hardware Recommendations: Based on the analysis, we recommend investing in a balanced set of compute nodes with AMD EPYC CPUs (for cost-effective high core counts) and NVIDIA GPUs (A100 for a proven option; or consider a mix of A100 and a couple of the latest H100 if specific AI workloads demand the absolute state-of-art). Ensure at least one node has extra-large memory (512 GB – 1 TB) if your workload mix includes memory-intensive tasks – this avoids the need to offload such tasks to external resources. Incorporate an InfiniBand HDR100 network, which offers a good price-to-performance point and can scale to a moderate cluster size. For storage, start with an open-source Lustre parallel file system deployment on a small cluster of storage servers (using SSD/NVMe for metadata and a combination of SSD+HDD for bulk storage). This will give high I/O throughput internally. If Lustre administration is deemed too complex for the team, BeeGFS is a viable alternative known for easier setup while still providing parallel access. Plan the rack layout and power such that there is room (and power/cooling headroom) to add perhaps 25–50% more nodes in the future – this ensures the cluster can grow with needs. DPUs are an optional but forward-looking addition: if the budget can stretch, consider adding NVIDIA BlueField-2 DPUs on the storage and head nodes, which can offload networking and potentially enable advanced features (like isolating user IO traffic). If budget is tight, DPUs can be skipped initially without hurting base functionality – just keep an eye on that technology for future upgrades as prices fall, since they align with the trend of in-network computing in HPC.
Software Stack Recommendations: Deploy the OpenHPC stack (latest release) as a baseline to get Warewulf, Slurm, and common libraries installed quickly. On top of that, layer Open OnDemand 4.x to provide a user-friendly interface; this will significantly improve adoption and user satisfaction, as non-HPC-specialist users can access the cluster resources easily through a web browser (RCAC - Knowledge Base: Anvil User Guide: Open OnDemand). Configure interactive apps in OnDemand (Jupyter, RStudio, MATLAB if licensed) to cater to AI/data science workflows – documentation from other universities (e.g., UC Berkeley, FSU) is available on setting these up (Open OnDemand Usage - UCR’s HPCC). For software deployment, use Spack as the primary package manager for scientific applications – it will allow you to deploy updates or new packages on request much faster than building by hand. Also maintain environment modules (Lmod) for users who prefer module load semantics, possibly generating module files from Spack installations. Enable Open XDMoD and integrate it with Slurm’s accounting logs so that you can track usage per user/project and produce reports for stakeholders. XDMoD’s latest version can also monitor GPU usage and job efficiency (ubccr/xdmod: An open framework for collecting and … - GitHub) ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=XDMoD%2C%20which%20stands%20for%C2%A0XD%20Metrics,terms%20of%20scholarship%20and%20research)), which will help in optimizing the cluster (for example, identifying jobs that bottleneck on I/O or memory). Keep all these software components updated periodically – but avoid disrupting active users; perhaps schedule maintenance windows like big centers do for upgrades (since our cluster is smaller, updates can be faster, but still plan it). |
One specific recommendation on MPI libraries: provide multiple MPI implementations (OpenMPI and MPICH/Intel MPI) via Spack, as some applications perform better with one or the other. Ensure the InfiniBand interface is correctly configured for MPI (using UCX or OFI drivers). Testing MPI bandwidth and latency with tools like osu-microbenchmarks after setup is advised to verify the cluster is getting the expected low-latency communication.
For data management, set up a Globus Connect Server endpoint on the Lustre/BeeGFS filesystem. This will require obtaining a Globus subscription (if not already at the institution, often universities have campus licenses). The benefit is huge: easy, reliable data transfer for users and the ability to share data with external collaborators securely (hpc.nih.gov). It complements the HPC usage by solving the data movement challenge. Additionally, consider a backup or archive strategy for important data – perhaps an on-campus object storage or tape library if available (some big clusters have tape archives; for us, maybe syncing important data to a cloud bucket or external NAS for backup).
Cost and Performance Trade-offs: Given the modest budget, prioritize spending on components that directly impact computational performance. This typically means CPUs, GPUs, and RAM take precedence. Network (InfiniBand) is also critical for multi-node work, but you might not need the absolute latest (HDR100 vs HDR200 saves cost while still very fast). Storage is vital but can be scaled gradually – perhaps start with a smaller parallel file system and expand capacity as data needs grow to avoid large upfront costs. If forced to choose due to budget, it might be better to get an extra GPU or two rather than the latest DPU or the largest memory, because GPUs will yield more immediate performance benefits for AI workloads. However, ensure you meet a baseline of memory per core (at least 1–2 GB/core) or the CPUs can’t be fully utilized. One trade-off to consider: new vs. slightly older generation hardware. Often in HPC procurement, one generation older CPUs/GPUs can be significantly cheaper while only marginally less powerful. For instance, AMD “Milan” EPYC CPUs (2021) might be cheaper than “Genoa” (2023) but still provide excellent performance per dollar. Similarly, NVIDIA A100 GPUs (Ampere) might offer a better deal than brand-new H100s (Hopper) where the price premium is very high for maybe 2x performance. Since we want the latest software but not necessarily the bleeding-edge hardware if it blows the budget, a pragmatic mix could be: AMD Milan CPUs, NVIDIA Ampere GPUs, and a plan to upgrade to next-gen CPUs/GPUs in a couple of years when prices normalize.
Comparison Summary: Our cluster design holds up well in software capability and user experience against the big R1 university clusters – by using the same open-source tools, we ensure researchers get a familiar and powerful environment. In performance, we’ve acknowledged the gap: we recommend communicating to stakeholders that this cluster is optimized for throughput and convenience for typical research workloads, but not intended to run the largest hero calculations that require huge scale. Those can be outsourced to national HPC centers as needed (and our software environment makes that transition easy). Essentially, we position the cluster as “80/20 rule”: it can handle ~80% of computational tasks in-house (the daily analyses, medium-scale simulations, model development and testing, data processing, etc.), while for the top ~20% scale experiments, users might still use external supercomputers. This approach maximizes return on investment, as the in-house cluster will be heavily used for the bulk of work, and expensive external resources are only used when absolutely necessary.
Conclusion: Implementing the proposed multipurpose AI HPC cluster with the described hardware and software stack will create a powerful, flexible computing resource tailored to our researchers’ needs. By integrating modern cluster management (Warewulf), web portal access (Open OnDemand), comprehensive monitoring (XDMoD), and a rich set of scientific tools (via Spack, Jupyter, RStudio, MPI, etc.), we ensure both high performance and high usability. Our recommendations emphasize using open-source, widely adopted solutions – this not only controls costs but also aligns our cluster with the environments at top research institutions. The comparison with leading R1 university clusters shows that while we operate at a smaller scale, we are adopting best practices and technologies from those environments. This means our users will benefit from a “miniaturized” version of a world-class HPC center – one that is cost-efficient and can be managed with the resources we have. Overall, this cluster will significantly enhance our computational research capabilities, providing a foundation for advances in data science, AI, and simulation-driven science on campus, and bridging smoothly to larger-scale resources when needed. With careful planning and execution, the cluster will be a scalable, compatible, and efficient addition to our research infrastructure, enabling new discoveries while respecting budgetary limits.
Sources:
Texas Advanced Computing Center – Frontera supercomputer announcement ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=The%20new%20system%20is%20expected,10%20was%20in%20November%202015)) ([New TACC Supercomputer Will Go Into Production Next Year | TOP500](https://www.top500.org/news/new-tacc-supercomputer-will-go-into-production-next-year/#:~:text=Image%20The%20system%2C%20%20166%2C,to%20operate%20for%20five%20years)) |
PSC Bridges-2 Overview and features ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=,data%20security%20and%20expandable%20archiving)) ([Bridges-2 | PSC](https://www.psc.edu/resources/bridges-2/#:~:text=Bridges,from%20the%20National%20Science%20Foundation)) |
Open XDMoD overview (OSC) – metrics and utilization monitoring ([XDMoD Tool | Ohio Supercomputer Center](https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool#:~:text=XDMoD%2C%20which%20stands%20for%C2%A0XD%20Metrics,terms%20of%20scholarship%20and%20research)) |
Warewulf documentation (SUSE) – HPC cluster provisioning intro ([Install your HPC Cluster with Warewulf | SUSE Communities](https://www.suse.com/c/install-your-hpc-cluster-with-warewulf/#:~:text=Warewulf%20is%20an%20operating%20system,the%20development%20happens%20as%20well)) |
Open source in HPC (MAAS blog) – Slurm usage in Top500 and OnDemand with Slurm ([MAAS blog | Open Source in HPC [part 5]](https://maas.io/blog/open-source-in-hpc-part-5#:~:text=started%20as%20a%20collaborative%20effort,the%20Universe%20repositories%20on%20Ubuntu)) ([MAAS blog | Open Source in HPC [part 5]](https://maas.io/blog/open-source-in-hpc-part-5#:~:text=Open%20OnDemand)) |