This article will guide you through running an example genomics workflow using Nextflow on UCR’s High Performance Computing Cluster (HPCC), also known as Ursa Major. Nextflow is a powerful workflow management system that simplifies the creation and execution of complex computational pipelines, particularly in bioinformatics. This guide will help you get started with a basic Nextflow workflow on HPCC, leveraging Slurm for job scheduling.
Before you begin, ensure you have the following:
The preferred method to access HPCC is through the web console SSH.
Visual Aid: Screenshot of the UCR Research Computing web console login page and the option to launch a web-based SSH terminal.
Nextflow needs to be available in your environment on HPCC. A recommended way to manage software on HPCC is through environment modules.
Check if Nextflow is available as a module: In your SSH terminal, run the command:
module avail nextflow
If Nextflow is available as a module: Load the desired Nextflow version using the module load
command. For example:
module load nextflow
Replace nextflow
with a specific version if needed, e.g., module load nextflow/23.10.0
.
If Nextflow is not available as a module (or if you need a specific version): You can install Nextflow locally in your home directory. Here’s how to install it using curl
:
curl -s get.nextflow.io | bash
This command downloads and installs Nextflow in your current directory. To make it easily accessible, you can move the nextflow
executable to a directory in your PATH
, such as ~/bin
(create ~/bin
if it doesn’t exist and add export PATH=$PATH:$HOME/bin
to your ~/.bashrc
or ~/.bash_profile
for future sessions).
mkdir -p ~/bin
mv nextflow ~/bin/
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc
source ~/.bashrc # Apply changes to the current session
Verify Nextflow installation: After either loading the module or installing it locally, check if Nextflow is correctly installed by running:
nextflow -version
This should display the installed Nextflow version.
For this example, we will use a simple Nextflow workflow that runs FastQC, a popular tool for quality control of sequencing data.
Create a file named fastqc_workflow.nf
using a text editor like nano
or vim
on HPCC.
nano fastqc_workflow.nf
Paste the following Nextflow script into fastqc_workflow.nf
:
#!/usr/bin/env nextflow
params.reads = './path/to/your/reads/*.fastq.gz' // Replace with your reads path
process fastqc {
input:
path(reads)
output:
path "fastqc_results"
container 'quay.io/biocontainers/fastqc:0.12.1--0' // Using biocontainer for reproducibility
script:
"""
mkdir fastqc_results
fastqc ${reads} -o fastqc_results
"""
}
workflow {
FastQC(params.reads)
}
Explanation of the script:
#!/usr/bin/env nextflow
: Shebang line to execute the script with Nextflow.params.reads = './path/to/your/reads/*.fastq.gz'
: Defines a parameter reads
that specifies the path to your input FASTQ files. You need to replace './path/to/your/reads/*.fastq.gz'
with the actual path to your FASTQ files on HPCC. For testing purposes, you can download example FASTQ files to your HPCC account or use publicly available datasets.process fastqc { ... }
: Defines a Nextflow process named fastqc
.
input: path(reads)
: Declares that the process takes input from the reads
variable, which is expected to be a path to files.output: path "fastqc_results"
: Declares that the process will output a directory named fastqc_results
.container 'quay.io/biocontainers/fastqc:0.12.1--0'
: Specifies a container image from Biocontainers to run FastQC. This ensures reproducibility.script: ...
: Contains the shell script to be executed within the process.
mkdir fastqc_results
: Creates the output directory.fastqc ${reads} -o fastqc_results
: Runs FastQC on the input reads and saves the output to the fastqc_results
directory.workflow { FastQC(params.reads) }
: Defines the workflow, which in this simple case just runs the fastqc
process with the input reads
.To run this Nextflow workflow on HPCC, you need to submit it as a Slurm job.
You have already created the Nextflow script fastqc_workflow.nf
in the previous step. Make sure to edit the params.reads
line in fastqc_workflow.nf
to point to the correct location of your input FASTQ files on HPCC. If you don’t have your own data readily available, you can use a public dataset or create dummy files for testing.
For example, if your FASTQ files are in /home/<your_username>/data/fastq/
, you would change the line to:
params.reads = '/home/<your_username>/data/fastq/*.fastq.gz'
Create a Slurm submission script, for example, run_fastqc.sh
, using a text editor:
nano run_fastqc.sh
Paste the following Slurm script into run_fastqc.sh
:
#!/bin/bash
#SBATCH --job-name=nf_fastqc
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4GB
#SBATCH --output=fastqc_job.out
#SBATCH --error=fastqc_job.err
module load nextflow # Or ensure nextflow is in your PATH if locally installed
nextflow run fastqc_workflow.nf
Explanation of the Slurm script:
#!/bin/bash
: Shebang line to execute the script with bash.#SBATCH --job-name=nf_fastqc
: Sets the job name to nf_fastqc
(you can customize this).#SBATCH --nodes=1
: Requests 1 node. For this simple workflow, 1 node is sufficient.#SBATCH --cpus-per-task=2
: Requests 2 CPU cores per task (Nextflow process). Adjust this based on the resource needs of your actual workflow.#SBATCH --mem=4GB
: Requests 4GB of memory. Adjust based on your workflow’s memory requirements.#SBATCH --output=fastqc_job.out
: Specifies the output file for standard output.#SBATCH --error=fastqc_job.err
: Specifies the error file for standard error.module load nextflow
: Loads the Nextflow module. If you installed Nextflow locally, ensure that the directory containing the nextflow
executable is in your PATH
instead of using module load
.nextflow run fastqc_workflow.nf
: This is the command that executes your Nextflow workflow script.Important Notes for UCR HPCC Slurm scripts:
#SBATCH --time
, #SBATCH --mail-user
, #SBATCH --mail-type
, or #SBATCH --account
in your Slurm submission scripts on UCR HPCC. Time limits are managed by the system, and email notifications and allocation names are not required in SBATCH commands as per UCR Research Computing’s recommendations.Submit the Slurm job using the sbatch
command from the directory where you saved both fastqc_workflow.nf
and run_fastqc.sh
:
sbatch run_fastqc.sh
This command submits the run_fastqc.sh
script to the Slurm scheduler. Slurm will then queue and execute your Nextflow workflow based on the requested resources.
You can monitor the status of your Slurm job using commands like squeue
. To see jobs for your user:
squeue -u <your_username>
Replace <your_username>
with your UCR username.
You can also check the output and error files (fastqc_job.out
and fastqc_job.err
) for progress information and any errors that might occur during the workflow execution.
Nextflow also provides its own monitoring capabilities. It typically outputs a working directory (e.g., .nextflow/
) where it stores logs and intermediate files. You can examine these logs for more detailed information about the workflow’s progress.
Once the Nextflow workflow completes successfully, the output files will be located in the directory specified by the output
directive in your Nextflow script. In our example, the FastQC results will be in the fastqc_results
directory created in the same directory where you ran the workflow.
You can use standard Linux commands like ls
, cd
, and scp
(if you need to transfer files to your local machine) to access and manage your output files.
If you encounter issues or have further questions, please don’t hesitate to reach out to UCR Research Computing for assistance:
We are here to help you make the most of HPCC for your research!