This document summarizes the process of setting up a reinforcement learning environment for a genomics task on a Slurm-based HPC cluster, and then developing and running the code on both the CPU and a GPU.
We began by discussing the best practices for managing Python environments on an HPC cluster. We concluded that while Conda is user-friendly, its large number of small files can strain shared filesystems. We opted for Micromamba, a lightweight and fast alternative.
I installed Micromamba in the user’s home directory:
# Micromamba was installed to:
/home/forsythc_ucr_edu/bin/micromamba
Our goal was to create an environment for a reinforcement learning project in genomics. We created a Micromamba environment named rl_genomics
.
Our first attempt to create the environment with tensorflow-agents
failed due to package incompatibilities. We then pivoted to a PyTorch-based approach, which is well-suited for the available hardware.
The final environment was created with the following command:
/home/forsythc_ucr_edu/bin/micromamba create -n rl_genomics python=3.10 pytorch jupyter -c conda-forge -y
We then installed the necessary Python packages using pip
:
source ~/.bashrc
micromamba activate rl_genomics
pip install torchrl stable-baselines3[extra] gymnasium ale-py pygame
We started by creating a Python script, genomics_rl.py
, that uses a simple Q-learning algorithm to design a DNA sequence that binds to a specific transcription factor.
The script was run on the CPU using the following command:
source ~/.bashrc
micromamba activate rl_genomics
python genomics_rl.py
The script successfully trained the agent, which learned to generate sequences that were close to the target motif.
To leverage the available L4 GPUs on the cluster, we adapted our script to use PyTorch for GPU-based computation.
We first investigated the Slurm configuration to determine the correct partition and GPU type. We used the following commands:
sinfo -o "%P %G"
scontrol show partition gpu
scontrol show node cluster-g2s4nodeset-0
To definitively identify the GPU, we submitted a simple Slurm job to run nvidia-smi
:
sbatch --wait check_gpu.sbatch
The output confirmed the presence of an NVIDIA L4 GPU.
I created a new script, genomics_rl_gpu.py
, which was a modified version of the original script that uses PyTorch tensors and automatically moves them to the GPU if available.
We then created a Slurm submission script, run_on_gpu.sbatch
, to run the GPU-enabled script. After an initial failure due to an incorrect constraint, we arrived at the following working script:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --output=genomics_rl_gpu.out
#SBATCH --error=genomics_rl_gpu.err
source ~/.bashrc
micromamba activate rl_genomics
python genomics_rl_gpu.py
The job was submitted to the Slurm queue using:
sbatch run_on_gpu.sbatch
The job ran successfully on an L4 GPU, and the output file, genomics_rl_gpu.out
, confirmed that the script used the cuda
device.
This entire process demonstrates a complete workflow for setting up a reinforcement learning environment, developing code, and running it on both CPUs and GPUs on an HPC cluster.