The CNIO HPC cluster

These are some guidelines to using our HPC cluster and the Slurm workflow system.

Cluster resources

Compute nodes

The cluster currently features 11 compute nodes with the following configurations:

Count	Node names	CPU cores	RAM	GPUs
6	bc00[2-7]	52	512Gb	-
3	bc00[8-10]	128	1Tb	-
1	hm001	224	2Tb	-
1	gp001	112	768Gb	3 x Nvidia A100 80Gb

Storage

The cluster features 54Tb of standard storage (where user's home directories are mounted) and 512Tb of high performance storage (for compute jobs input and output).

Paragraph for adding to grant applications, etc.

The CNIO HPC cluster currently consists of: two login nodes working in active/passive mode (2x 40 core/392Gb RAM); a total of 728 compute cores distributed in 9 "standard" compute nodes (6x 52 core/512Gb RAM and 3x 64core/768Gb RAM) and a high-memory node (1x 224 core/2Tb RAM); a dedicated GPU node featuring 3 x Nvidia A100 80Gb GPUs. Local storage is composed of 48 4TB disks in a dual-channel enclosure organized into a RAID-10 unit with an effective available storage of 54TB which host the home directories, and a high-performance lustre storage system with 512 TB of effective available storage for computation. The cluster runs a Slurm queuing system with fairshare priority management.

Cluster usage

Introduction to HPC

See this tutorial for an introduction to HPC concepts and usage (highly recommended if you have little or no experience in using HPC environments).

Requesting access

Please fill out the access request form in order to request access to the cluster.

Accessing the cluster

The cluster's hostname is cluster1.cnio.es (cluster1 for short), and can be accessed via SSH.

Note

If you'd like to SSH into the cluster through the CNIO VPN you will need to ask IT for VPN access.

Mailing list

Important notifications about the cluster are sent to the CNIO HPC mailing list. Make sure to subscribe using this link to stay up to date. The list has very low traffic.

Note

Do subscribe to the list!

Storage

Warning

There are two types of storage in the cluster: your home directory and scratch space.

Your home directory is well suited for storing software (e.g. installing conda), configuration files, etc. It is however not quick enough to store the input or output files of your analyses. Using your home for this will likely hang your jobs and corrupt your files. You should use the scratch space instead.

Warning

Scratch space is to be considered "volatile": you should copy your input files, execute, copy your output files, and delete everything.

Home directories are not volatile, but data safety is not guaranteed in the long term (so keep backups).

Home directories

You have an allocation of 300Gb in your home directory, located in /home/<yourusername>/.

Scratch space

User space

You have an allocation of 500Gb in your "scratch" directory, located at /storage/scratch01/users/<yourusername>/.

This space is meant to be used for testing and ephemeral compute operations.

Project space

To request high-performance scratch space for a new project please fill the storage request form.

Checking your quotas

You can check your current quotas with the following commands:

$ zfs get -r -o name,property,value userquota@$(whoami),userused@$(whoami) homepool2/home

$ quotr quota list --long # check your scratch quotas

Temporary files (/tmp)

Warning

Do not write large files to /tmp. /tmp resides on the root filesystem, which is shared by the operating system and all users. Writing large temporary files there (e.g. aligner indexes, sorting buffers) can fill it up, causing the head node to crash and affecting all users on the cluster.

Use your scratch directories (/storage/scratch01/groups/<yourgroup>/<yourproject> or /storage/scratch01/users/<yourusername>/) for any temporary or intermediate files instead. See Temporary files in pipelines for how to configure common tools and workflow managers.

Copying data to/from the cluster

The most efficient way of copying large amounts of data is by using rsync. rsync allows you to resume a transfer in case something goes wrong.

To transfer a directory to the cluster using rsync you would do something like:

rsync -avx --progress mydirectory/ cluster1:/storage/scratch01/myuser/mydirectory/

Note

Pay attention to the trailing slashes, which completely change rsync's behaviour if missing.

For small transfers you could also use scp if you prefer:

scp -r mydirectory/ cluster1:/storage/scratch01/myuser/

In both cases you can revert the order of the local and remote directory to copy from the cluster to your local computer instead.

Submitting jobs

Warning

As a general rule, no heavy processes should be run directly on the login nodes. Instead you should use the "sbatch" or "srun" commands to send them to the compute nodes. See examples below.

The command structure to send a job to the queue is the following:

sbatch -o <logfile> -e <errfile> -J <jobname> -c <ncores> --mem=<total_memory>G \
    -t<time_minutes> --wrap "<command>"

For example, to submit the command bwa index mygenome.fasta as a job:

sbatch -o log.txt -e error.txt -J index_genome -c 1 --mem=4G -t120 \
    --wrap "bwa index mygenome.fasta"

Job resources

-c, --mem, and -t tell the system how many cores, RAM memory, and time your job will need.

Note

Unless otherwise specified, jobs will get a default of 1 core, 2Gb of RAM, and a time limit of 30 minutes.

Thread management

Many programs can spawn additional threads beyond what you explicitly request through the -c parameter. This can lead to oversubscription of CPU resources and negatively impact both your job and others running on the same node. For example, some Python libraries (like NumPy), R packages, or bioinformatics tools might automatically use multiple threads based on the available CPU cores, regardless of your Slurm allocation.

To prevent this:

Look for program-specific thread control options (e.g., --threads, -t, etc.)
If applicable, consider setting environment variables like OMP_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS, and NUMEXPR_NUM_THREADS to match your requested cores
When using workflow managers like Snakemake or Nextflow, ensure they properly propagate resource constraints to their child processes

Please check this message on the mailing list for more details and an example.

Time limits

Note

Ideally, you should aim at your jobs having a duration of between 1 and 8 hours. Shorter jobs may cause too much scheduling overhead, and longer jobs will make it harder for the scheduler to optimally schedule jobs into the queues.

Warning

Always request a time limit close to the actual expected duration of your job, not the maximum allowed by the queue. For example, if your job takes 1 hour, request ~90 minutes, not 24 hours. Over-requesting time wastes your priority (see below) and prevents the scheduler from backfilling your job: if there is a 2-hour gap before a higher-priority job starts, a job with -t90 can be backfilled into that gap and start immediately, while the same job with -t1440 (24 h) cannot and will sit in the queue waiting.

The "short" queue has a limit of 2 hours per job and the highest priority (i.e. jobs in this queue will run sooner compared to other queues).

The "main" queue has a limit of 24 hours per job and medium priority.

The "long" queue has a limit of 168 hours (7 days) per job and 4 concurrent jobs, and the lowest priority.

Note

You can specify the partition (queue) you want to use with the -p argument to sbatch or srun.

Resources and their effect on job priority

The resources you request for a job will influence the chances that such job has to enter the queue, compared to others: the more resources you request, the longer you may have to wait for those resources to be available.

In addition, the future priority of your jobs will also be influenced by the resources you request (not use, request, even if you don't use them in the end): the more you request, the less priority you'll have for future jobs.

If one of your jobs has lower priority than another one, but its running time would not delay that higher priority one from entering the queue and there are enough resources for it, your job could enter the queue first. That's why it's important to try and assign reasonably accurate time limits (a.k.a. walltimes) to your jobs.

You can check how efficiently a job used its assigned resources with the seff <jobid> command (once the job is finished).

Running Snakemake jobs on the compute nodes

Warning

By default, Snakemake runs all jobs locally on the machine where you launch it. On the cluster, that means the head node. You must use a Slurm executor so that Snakemake submits jobs to the compute nodes instead. Running pipelines on the head node can exhaust its resources and cause it to crash, affecting all users.

Snakemake features a Slurm executor plugin, which will send jobs to Slurm instead of running them locally. After installing it, simply add the --executor slurm argument to your snakemake command.

Warning

The previous way of having Snakemake send jobs to the nodes using a profile (--profile $SMK_PROFILE_SLURM) is still functional but deprecated. You should ideally move to the native executor described below, and report any issues to the list.

Note

In order to indicate the resources required by each Snakemake rule you should use the threads and resources options.

Note

The Snakemake command will remain active while your jobs run, so it's recommended that you launch it inside a detachable terminal emulator (e.g. GNU Screen) so you can disconnect from the cluster and keep your jobs running.

Getting information for jobs submitted via Snakemake

Text extracted from the Snakemake Slurm executor docs

The executor plugin for SLURM uses unique job names to inquire about job status. It ensures inquiring about job status for the series of jobs of a workflow does not put too much strain on the batch system’s database. Human readable information is stored in the comment of a particular job. It is a combination of the rule name and wildcards. You can ask for it with the sacct or squeue commands, for example:

sacct -o JobID,State,Comment%40

Note, the “%40” after Comment ensures a width of 40 characters. This setting may be changed at will. If the width is too small, SLURM will abbreviate the column with a + sign.

For running jobs, you can use the squeue command:

squeue -u $USER -o %i,%P,%.10j,%.40k

Here, the . settings for the ID and the comment ensure a sufficient width, too.

Temporary files in pipelines

Many bioinformatics tools write temporary data to /tmp by default. Since /tmp is on the small root filesystem (see above), you should always redirect temporary files to your scratch space.

Setting TMPDIR globally

The simplest approach is to set the TMPDIR environment variable before launching your pipeline. Most tools respect it automatically:

export TMPDIR=/storage/scratch01/users/$(whoami)/tmp
mkdir -p "$TMPDIR"

Snakemake resources: tmpdir

You can set the temp directory per rule using the resources directive:

rule align:
    resources:
        tmpdir="/storage/scratch01/users/myuser/tmp"
    ...

Or pass it globally on the command line:

snakemake --executor slurm --default-resources 'tmpdir="/storage/scratch01/users/myuser/tmp"'

Tool-specific options

Some tools have their own flags for controlling the temp directory. Here are some examples:

STAR: --outTmpDir /storage/scratch01/users/myuser/tmp/STARtmp
samtools sort: -T /storage/scratch01/users/myuser/tmp/samtools
GATK / Picard: --tmp-dir /storage/scratch01/users/myuser/tmp

Interactive sessions

During development and testing you may want to be able to run commands interactively. You can request an interactive session, which will run on a compute node, using the srun command with the --pty argument.

For example, to request a 2-hour session with 4Gb RAM and 2 CPUs, you would do:

srun --mem=4096 -c 2 -t 120 --pty /bin/bash

Note

Interactive sessions are limited to a maximum of 120 minutes.

GPUs

Warning

GPU usage is "experimental". Please expect changes in the setup (which will be announced through the mailing list).

The cluster features three Nvidia A100 GPUs with 80Gb of VRAM each.

To run a command using GPUs you need to specify the gpu partition, and request the number of GPUs you require using the --gres=gpu:A100:<number_of_gpus> argument of sbatch. Here's an example to run a python script with one GPU:

sbatch -p gpu --gres=gpu:A100:1 --wrap "python train_my_net.py"

Installing software with GPU support

When installing software that requires GPU support (e.g., deep learning frameworks with CUDA), it's important to perform the installation on a GPU node rather than on the head node. Installing on the head node, which lacks GPUs, may result in CPU-only versions of libraries being installed automatically, even when GPU-enabled versions are available.

There are two recommended approaches for installing GPU-enabled software:

Option 1: Interactive session on GPU node

Request an interactive session on the GPU node to install software or create conda environments:

srun -p gpu --gres=gpu:A100:1 --mem=8192 -c 4 -t 120 --pty /bin/bash

Once connected to the GPU node, verify GPU access:

nvidia-smi

Then proceed with your software installation (e.g., creating conda environments, installing packages with pip/conda).

Option 2: Installation via batch job

Alternatively, you can submit the installation commands as a batch job that runs on the GPU node:

sbatch -p gpu --gres=gpu:A100:1 --mem=8192 -c 4 -t 60 --wrap "conda create -n myenv python=3.10 && conda activate myenv && conda install <packages>"

Verifying GPU access

After installation, verify that your software can access the GPU by running nvidia-smi or using framework-specific commands to check CUDA availability. This ensures that the GPU-enabled versions were correctly installed.

Installing software

Software management is left up to the user, and we recommend doing it by installing miniforge and the bioconda channels.

If you're completely unfamiliar with conda and bioconda, we recommend following this tutorial.

Note

Remember to use your home directory to install software (including conda), as lustre performs poorly when reading small files often.