11. Using Supercomputers for DL training


The content of this last part of the course is very practical, in the sense that it was planned as a “learn-by-doing” approach using the MN supercomputer, and it requires little additional theory that can be included in the detailed description of the hands-on exercise. As a side effect of this approach is that the students can do the hands-on exercises at their own pace.

Hands-on exercise motivation

Our world has changed, and it has changed forever due to the impact of the recent, rapid growth of artificial intelligence (AI) technology. And what has caused this AI explosion? As we learned previously in this Master’s course Supercomputers Architecture, supercomputers are one of the key pillars of the progress of AI.  Furthermore, what really accelerated the advances in AI, providing effective enough computing over the last years, was the possibility of speeding up the training of the algorithms thanks to current parallel, distributed supercomputers. 

The reason is that deep learning (DL) algorithms are highly parallel, which makes them not only conducive to taking advantage of GPU acceleration but also scalable to multiple GPUs and multiple nodes. And this is what we will explore from now on in this course. The following four topics (and the following four hands-on exercises) will allow you to program these new parallel, distributed supercomputers for a real artificial intelligent problem. 

Let’s start!

Hands-on exercise description

Until now, we have been doing all the programming tasks on Jupyter notebooks (.ipnb). But how the same DL code can be run on a supercomputer directly as a python program? This is what we will practice in today’s hands-on exercise, learning how to allocate and use supercomputing resources for a DL problem. At the end of this hands-on, the student will know alternatives to allocate resources in the CTE-POWER supercomputer to train a neural network. 

1 — BSC’s CTE-POWER Cluster

This hands-on exercise uses the BSC’s CTE-POWER cluster. Let’s review its characteristics. CTE-POWER is a cluster based on IBM Power9 processors, with a Linux Operating System and an Infiniband interconnection network. CTE-POWER has 54 compute servers, each of them:

  • 2 x IBM Power9 8335-GTG @ 3.00GHz (20 cores and 4 threads/core, total 160 threads per node)
  • 512GB of main memory distributed in 16 DIMMs x 32GB @ 2666MHz
  • 2 x SSD 1.9TB as local storage
  • 2 x 3.2TB NVME
  • 4 x GPU NVIDIA V100 (Volta) with 16GB HBM2.
  • Single Port Mellanox EDR
  • GPFS via one fiber link 10 GBit
  • The operating system is Red Hat Enterprise Linux Server 7.4.

One CTE-POWER computer server (Image from bsc.es)

CTE-POWER supercomputer  cluster at Barcelona Supercomputing Center (Image by author)
 

More details of its characteristics can be found in the CTE-POWER user’s guide and in the information on the manufacturer of the AC922 servers.

The allocation of resources from the cluster for the execution of our code will start with a ssh login in the cluster using one of the login nodes using your account:

ssh -X nct01xxx@plogin1.bsc.es

Task 1:

Once you have a login username and its associated password, you can get into the CTE-POWER cluster. Check that you have access to your home page.


2 —  Warm-up example: MNIST classification

For convenience, we will consider the same neural network that we used to classify MNIST digits in the previous part programmed previously in the Jupyter notebook (link).

2.1 TensorFlow version

In the following lines, there are the lines of code of the TensorFlow version of the MNIST classifier described in class.

Note: For the following code lines, beware at COPY&PASTE!. Some symbols are “converted” by the HTML tranlator into non-standard. If a command/code line does not work properly, repeat it by typing it.

import tensorflow as tf 
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
print(tf.__version__)
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
model = Sequential()
model.add(Conv2D(32, (5, 5), activation=’relu’, 
          input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (5, 5), activation=’relu’))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(10, activation=’softmax’))
model.summary()
from keras.utils import to_categorical
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data(path=’/gpfs/projects/nct00/nct00036/basics-utils/mnist.npz’)
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype(‘float32’) / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype(‘float32’) / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
model.compile(loss=’categorical_crossentropy’,
              optimizer=’sgd’,
              metrics=[‘accuracy’])
model.fit(train_images, train_labels, batch_size=100, 
          epochs=5, verbose=1)
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(‘Test accuracy:’, test_acc)

This will be the code MNIST.py , which we will use as a first case study to show how to launch programs in the CTE-POWER supercomputing.


Task 2:

Write your MNIST classifier program with the file name MNIST.py   . Include all code executed in the answers of the exercise survey. 


2.2  TensorFlow versus PyTorch

We will use the TensorFlow framework; however, the code in PyTorch code doesn’t differ too much (take a look at this brief post). We will use the Keras API because since the release of Tensorflow 2.0, tf.keras.Model API has become the easiest way for beginners to build neural networks (particularly those not requiring custom training loops).

Before moving on, take a look at this post again to answer the following task.


Task 3:

Briefly describe in a textual way the main differences that you consider between TensorFlow and PyTorch.


3 —  Software stack required for deep learning applications

It is important to remember that before executing a Deep Learning application (in TensorFlow), it is required to load all the packages that build the application’s software stack environment.

3.1 Load required modules

At the CTE-POWER supercomputer, it is done though modules  , which can be done with the command module load before running the corresponding .py code.

In our case study, we will load the following modules that include the required libraries:

module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML

python/3.7.4_ML module contains TensorFlow packages.


Task 4:

Load the required modules. Explain in the survey what is the purpose of loading modules cuda   , cudnn , and nccl . 


3.2 Interactive execution of the code

 

How to execute our code in the login node? (remember, we are still connected to the login node). Due it is a normal python code, we can use python:

python MNIST.py

If we want to detach the standard outputs and the standard error messages, we can add this argument 2>err.txt:

python MNIST.py 2>err.txt

Redirecting the standard error allows us to see the result of the training that gives us the Keras by the standard output without the information related to the execution environment:

Epoch 1/5
600/600 [======] - 2s 3ms/step - loss: 0.9553 - accuracy: 0.7612
Epoch 2/5
600/600 [======] - 1s 2ms/step - loss: 0.2631 - accuracy: 0.9235
Epoch 3/5
600/600 [======] - 2s 3ms/step - loss: 0.1904 - accuracy: 0.9446
Epoch 4/5
600/600 [======] - 2s 3ms/step - loss: 0.1528 - accuracy: 0.9555
Epoch 5/5
600/600 [======] - 2s 3ms/step - loss: 0.1288 - accuracy: 0.9629
313/313 [======] - 1s 2ms/step - loss: 0.1096 - accuracy: 0.9671
Test accuracy: 0.9671000242233276

 


Task 5:

Launch your MNIST.py  sequential program in one login node of the CTE-POWER cluster (detach the standard outputs and the standard error messages). (*) it can be done in the login node because it takes only a few seconds, not doing this for larger programs.


 

Well, our code is executed in the login node shared with other jobs from users that are trying to submit jobs to the SLURM system, but what we really need is to allocate resources for our code. How can we do it?

4 — How to allocate computing resources with SLURM

To run a code in CTE-POWER cluster, we use the SLURM workload manager. An excellent Quick Start User Guide can be found here. We have two ways to use it: sbatch and salloc commands.

4.1  sbatch

The method for submitting jobs that we will center our hands-on exercise today will be using the SLURM sbatchcommand directly. sbatch submits a batch script to Slurm. The batch script may be given sbatch through a file name on the command line (.sh file). The batch script may contain options preceded with #SBATCH before any executable commands in the script. sbatch will stop processing further #SBATCH directives once the first non-comment or non-whitespace line has been reached in the script.

sbatch exits immediately after the script is successfully transferred to the SLURM controller and assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

By default, both standard output and standard error are directed to the files indicated by:

#SBATCH --output=MNIST_%j.out
#SBATCH --error=MNIST_%j.err

where the “%j” is replaced with the job allocation number. The file will be generated on the first node of the job allocation. When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.

An example of a job script that allocates a node with 1 GPU for our case study looks like this (MNIST.sh ):

#!/bin/bash
#SBATCH --job-name="MNIST"
#SBATCH -D .
#SBATCH --output=MNIST_%j.out
#SBATCH --error=MNIST_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML
python MNIST.py

You can consult this official page documentation to know all the options we can use in the batch script preceded with#SBATCH.

These are the basic directives to submit and monitor jobs with SLURM that we will use in our case study:

  • sbatch <job_script> submits a job script to the queue system.
  • squeue shows all the submitted jobs with their <job_id>.
  • scancel <job_id> remove the job from the queue system, canceling the execution of the processes, if they were still running.

In summary, this can be an example of a sequence of command lines, and the expected output of their execution will be:

[CTE-login-node ~]$ sbatch MNIST.sh
Submitted batch job 4910352
[CTE-login-node ~]$ squeue
JOBID    PARTITION  NAME    USER    ST TIME  NODES  NODELIST
4910352  main       MNIST   userid  R  0:01  1      p9r1n16
[CTE-login-node ~]$ ls
MNIST.py
MNIST.sh
MNIST_4910352.err
MNIST_4910352.out

The standard output and standard error are directed to the files MNIST_4910355.out and MNIST_4910355.err, respectively. Here, the number 4910352indicates the job id assigned to the job by SLURM.

An alternative to running a job is using the salloc command. It is used to allocate resources for a job in real-time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks. One example can be:

salloc -t 00:10:00 -n 1 -c 160 --gres=gpu:4 -J debug --partition interactive srun --pty /bin/bash

When salloc command obtains the requested allocation, it then runs the command specified by the user. The command may be any program the user wishes. In our case, the command is srun to create an interactive session where we can execute commands.

Note that we are allocating 4 GPUs from a node with this command. It is very likely that we have to wait to be granted to have the resource, as the machine is often very busy:

salloc: Required node not available (down, drained or reserved)
salloc: Pending job allocation <job_id>
salloc: job <job_id> queued and waiting for resources

We are informed with the following message when we already have the resources available:

salloc: Nodes p9login1 are ready for job

Once we are in an interactive session, we can check if we have 4 GPUs assigned with the following command:

nvidia-smi

Remember that we need to run the commands to load the modules and then run the corresponding code:

module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML

python MNIST.py
Task 6:

Execute your MNIST.py program with the salloc command presented in this section.  Summarize in your report the main differences in using salloc and sbatch.


4.3 Resource reservation for SA-MIRI 2022

BSC has made a special reservation of supercomputer nodes to be used by this course:

To use the reservations, you must add this line in the SLURM script:

#SBATCH --reservation=<ReservationName>

e.g. for monday 21/11/22 you could use:
#SBATCH --reservation=SA-MIRI-1

WARNING: Remember that the <ReservationName> is different for each day!


Task 7:

Execute your MNIST.py program with the SLURM workload manager system using a job script that allocates a node with 1 GPU in CTE-POWER. Inspect the .out and .err files obtained. Include in the survey the slurm script and also the relevant parts of .out and .err files.



Task 8:

TO DO ON MONDAY 21/11/22

Execute your MNIST.py program with the SLURM workload manager system using the SA-MIRI queue reservation #SBATCH –reservation=SA-MIRI-1. Check that everything is working fine. 


5 — Transfer Learning Case Study

5.1 Dataset: CIFAR10

CIFAR-10 is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32×32 color images containing 10 object classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. There are 50,000 training images and 10,000 test images (Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009).

 

We have preloaded the CIFAR-10 dataset at CTE-POWER supercomputer in the directory /gpfs/projects/nct00/nct00036/cifar-utils/cifar-10-batches-py downloaded from http://www.cs.toronto.edu/~kriz/cifar.html.

For academic purposes, to make the training even harder and to be able to see larger training times for better comparison, we have applied a resize operation to make the images of 128×128 size. We created a custom load_data function (/gpfs/projects/nct00/nct00036/cifar-utils/load_cifar.py) that applies this resize operation and splits the data into training and test sets. We can use it as:

sys.path.append(‘/gpfs/projects/nct00/nct00036/cifar-utils’)
from cifar import load_cifar

load_cifar.py can be obtained from this repository GitHub for readers that want to review it (for the students of this course, it is not necessary to download it).

5.2 Neural Networks architecture: ResNet

Now we are going to use a neural network that has a specific architecture known as ResNet. In this scientific community, we find many networks with their own name. For instance, AlexNet, by Alex Krizhevsky, is the neural network architecture that won the ImageNet 2012 competition. GoogleLeNet, which with its inception module, drastically reduces the parameters of the network (15 times less than AlexNet). Others, such as the VGGnet, helped to demonstrate that the depth of the network is a critical component for good results. The interesting thing about many of these networks is that we can find them already preloaded in most of the Deep Learning frameworks.

Keras Applications are prebuilt deep learning models that are made available. These models differ in architecture and the number of parameters; you can try some of them to see how the larger models train slower than, the smaller ones and achieve different accuracy.

A list of all available models can be found here (the top-1 and top-5 accuracy refer to the model’s performance on the ImageNet validation dataset.). For this hands-on, we will consider one architecture from the family of ResNet as a case study: ResNet50v2. ResNet is a family of extremely deep neural network architectures showing compelling accuracy and nice convergence behaviors, introduced by He et al. in their 2015 paper, Deep Residual Learning for Image RecognitionA few months later, the same authors published a new paper, Identity Mapping in Deep Residual Network, with a new proposal for the basic component, the residual unit, which makes training easier and improves generalization. And this lets the V2 versions:

tf.keras.applications.ResNet50V2(
    include_top=True,
    weights="imagenet",
    input_tensor=None,
    input_shape=None,
    pooling=None,
    classes=1000,
    classifier_activation="softmax",
)

The “50” stand for the number of weight layers in the network. The arguments for the network are:

  • include_top: whether to include the fully-connected layer at the top of the network.
  • weights: one of None (random initialization), ‘imagenet’ (pre-training on ImageNet), or the path to the weights file to be loaded.
  • input_tensor: optional Keras tensor (i.e. the output of layers.Input()) to use as image input for the model.
  • input_shape: optional shape tuple, only to be specified if include_top is False (otherwise, the input shape has to be (224, 224, 3) (with 'channels_last' data format) or (3, 224, 224)(with 'channels_first' data format). It should have exactly 3 input channels, and the width and height should be no smaller than 32. E.g. (200, 200, 3)would be one valid value.
  • pooling: Optional pooling mode for feature extraction when include_top is False. (a)None means that the output of the model will be the 4D tensor output of the last convolutional block. (b) avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor. (c)max means that global max pooling will be applied.
  • classes: optional number of classes to classify images into, only to be specified if include_topis True, and if no weights argument is specified.
  • classifier_activation: A str or callable. The activation function to use on the “top” layer. Ignored unless include_top=True. Set classifier_activation=None to return the logits of the “top” layer.

Note that if weights="imagenet", Tensorflow middleware requires a connection to the internet to download the imagenet weights (pre-training on ImageNet). Due we are not centering our interest in Accuracy, we didn’t download the file with the imagenet weights; therefore, it must be used weights=None.

 


Task 9:

Have a look at the ResNet50v2 and ResNET152V2 neural networks and describe the main differences.


6—How to use a GPU to accelerate the training

Before showing how to train a neural network in parallel, let’s start with a sequential version of the training in order to get familiarized with the classifier.

6.1 Python code

The sequential code to train the previously described problem of classification of the CIFAR10 dataset using a ResNet50 neural network could be the following (we will refer to it as ResNet50_seq.py):

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
import numpy as np
import argparse
import time
import sys
sys.path.append(‘/gpfs/projects/nct00/nct00036/cifar-utils’)
from cifar import load_cifar
parser = argparse.ArgumentParser()
parser.add_argument(‘ — epochs’, type=int, default=5)
parser.add_argument(‘ — batch_size’, type=int, default=2048)
args = parser.parse_args()
batch_size = args.batch_size
epochs = args.epochs
train_ds, test_ds = load_cifar(batch_size)
model = tf.keras.applications.resnet_v2.ResNet50V2(    
        include_top=True, 
        weights=None, 
        input_shape=(128, 128, 3), 
        classes=10)
opt = tf.keras.optimizers.SGD(0.01)
model.compile(loss=’sparse_categorical_crossentropy’,
              optimizer=opt,
              metrics=[‘accuracy’])
model.fit(train_ds, epochs=epochs, verbose=2)

ResNet50_seq.py file can be downloaded from the course repository GitHub.

6.2 SLURM script

To run this python code using the SLURM system, as you know, it can be done using the following SLURM script (we will refer to it as ResNet50_seq.sh):

#!/bin/bash
#SBATCH — job-name=”ResNet50_seq”
#SBATCH -D .
#SBATCH — output=RESNET50_seq_%j.out
#SBATCH — error=RESNET50_seq_%j.err
#SBATCH — nodes=1
#SBATCH — ntasks=1
#SBATCH — cpus-per-task=40
#SBATCH — time=00:05:00
module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML
python ResNet50_seq.py --epochs 5 --batch_size 256

ResNet50_seq.sh can be downloaded from the course repository GitHub.


Task 10:

Write your  ResNet50_seq.py program and execute it with 5 epochs in CTE-POWER with the SLURM workload manager system using the job script ResNet50_seq.sh presented in this section (with 5 minutes as a maximum time) What is the result? Include in the answer your code of ResNet50_seq.py  , ResNet50_seq.sh, and the relevant part of the  .out and .err files. 


 

If we use the same SLURM script for all three executions, pay attention to indicating the maximum number of GPUs required with --gres=gpu:4 .

Some note from support@bsc.es: In CTE-POWER, if you use “number of cores = 160” then SLURM will assign 4 GPUs to your job, event not using “–gres=gpu” flag.

As a hint, in my execution appears the following error:

slurmstepd: error: *** JOB <job_id> ON p9r2n12 CANCELLED AT 2025-11-19T09:48:59 DUE TO TIME LIMIT ***

If you observe that SLURM system does not control the time limit correctly, I propose to cancel the job after 10 minutes.

6.3 Using a GPU for training

Unlike the MNIST problem, in this problem, we cannot train the neural network with a single CPU. It is clear that we need more computing power for training this problem. In this case, remember that we can add this line to the SLURM script:

#SBATCH --gres=gpu:1


Task 11:

Execute the same  ResNet50_seq.py program the job script ResNet50_seq.sh including the allocation of one GPU. What is the result? Include in the answer your code ofResNet50_seq.sh, and the relevant part of the  .out and .err files. 


In the .out file we can see the result of the output that gives us the Keras that specify the time required for one epoch, the loss value and de accuracy achieved with this new epoch:

Epoch 1/5
196/196 - 41s - loss: 2.0176 - accuracy: 0.2584

Task 12:

Analyzing the .out  file, what is the Accuracy obtained for this problem in this execution? What is the Accuracy obtained? What can we do to improve the Accuracy?


6.4 Improving the Accuracy

From the results of the previous Tasks, you can conclude that within 5 minutes, you can execute three or four epochs.   However, the accuracy obtained is not good (>50%). What we can do is increase the number of epochs. In this case, it is required to increase the time required.


Task 13:

Modify your  job script ResNet50_seq.sh (SLURM time flag and networks epochs parameter)  so you can obtain a model with an Accuracy greater than 95. In the answer, include your code ofResNet50_seq.sh, and the relevant part of the  .out file. 


7—Comparing different  ResNet networks

We can use the same code presented in the previous sections to train any other networks available in Keras. We only need to change in the program the piece of code that identifies the network (resnet_v2.ResNet50V2 in our example) for the corresponding network.

For the purpose of this post, to calculate the time (which will be the metric that we will use to compare performance), we can use the time that Keras himself tells us that it takes an epoch (sometimes we discard the first epoch as it is different from the rest since it has to create structures in memory and initialize them). Remember that we are in a teaching example, and with this approximate measure of time, we have enough for the course goals.


Task 14:

Train the resnet_v2.ResNet152V2  and compare the required time per epoch in comparison of the training resnet_v2.ResNet50V2. Justify why the time is so different. In the answer, include  your SLURM script used and the .py code used.


Hands-on Report


Task 15:

Write a report for this hands-on exercise that includes all the tasks detailing the steps that are done, the code used, and the results. Once finished, generate a PDF version and submit it to the “racó” in the mailbox “exercise 13”..


Acknowledgement: Many thanks to Juan Luis Domínguez and Oriol Aranda, who wrote the first version of the codes that appear in this hands-on, and to Carlos Tripiana and Félix Ramos for the essential support using the CTE-POWER cluster. Also, many thanks to Alvaro Jover Alvarez, Miquel Escobar Castells, and Raul Garcia Fuentes for their contributions to the proofreading of previous versions of this post.

The code used in this post is based on the GitHub https://github.com/jorditorresBCN/Fundamentals-DL-CTE-POWER