11. Using Supercomputers for DL training


Last Monday, all the attendees (including the teacher) agreed that today’s class could be online in our meet room, utilising the hands-on exercise described below.

The experience will be as a proof of concept class to know if we can follow the remaining classes (of this part 2 of the course) without requiring our presence obligatorily in class, and isolating us from potential health issues due to the current health situation in which health indicators in Barcelona are worsening very fast (also considering that, as a warning, last week one of us could not attend the midterm exam due this reason).

Fortunately, the content of this last part of the course is very practical, in the sense that it was planned as a “learn-by-doing” approach using the MN supercomputer, and it requires little additional theory that can be included in the detailed description of the hands-on exercise.

As a side effect, this can benefit that the students can do the hands-on exercises at their own pace. This means that it makes sense to try this proof of concept.

Hands-on exercise motivation

Our world has changed, and it has changed forever due to the impact of the recent, rapid growth of artificial intelligence (AI) technology. And what has caused this AI explosion? As we learned previously in this Master’s course Supercomputers Architecture, supercomputers are one of the key pillars of the progress of AI.  Furthermore, what really accelerated the advances in AI, providing enough effective compute over the last years, was the possibility of speeding up the training of the algorithms thanks to current parallel, distributed supercomputers. 

The reason is that deep learning (DL) algorithms are highly parallel, which makes them not only conducive to taking advantage of GPU acceleration but also scalable to multiple GPUs and multiple nodes. And this is what we will explore from now on in this course. The following four topics (and the following four hands-on exercises) will allow you to program these new parallel, distributed supercomputers for a real artificial intelligent problem. 

Let’s start!

Hands-on exercise description

Until now, we have been doing all the programming tasks on Jupyter notebooks. But how the same DL code can be run on a supercomputer directly as a python program? This is what we will practice in today’s hands-on exercise, learning how to allocate and use supercomputing resources for a DL problem. At the end of this hands-on, the student will know alternatives to allocate resources in the CTE-POWER supercomputer to train a neural network. 

1 — BSC’s CTE-POWER Cluster

This hands-on exercise uses the BSC’s CTE-POWER cluster. Let’s review its characteristics. CTE-POWER is a cluster-based on IBM Power9 processors, with a Linux Operating System and an Infiniband interconnection network. CTE-POWER has 54 compute servers, each of them:

  • 2 x IBM Power9 8335-GTG @ 3.00GHz (20 cores and 4 threads/core, total 160 threads per node)
  • 512GB of main memory distributed in 16 DIMMs x 32GB @ 2666MHz
  • 2 x SSD 1.9TB as local storage
  • 2 x 3.2TB NVME
  • 4 x GPU NVIDIA V100 (Volta) with 16GB HBM2.
  • Single Port Mellanox EDR
  • GPFS via one fiber link 10 GBit
  • The operating system is Red Hat Enterprise Linux Server 7.4.

One CTE-POWER computer server (Image from bsc.es)

CTE-POWER supercomputer  cluster at Barcelona Supercomputing Center (Image by author)
 

More details of its characteristics can be found in the CTE-POWER user’s guide and also in the information of the manufacturer of the AC922 servers.

The allocation of resources from the cluster for the execution of our code will start with a ssh login in the cluster using one of the login nodes using your account:

ssh -X nct01xxx@plogin1.bsc.es

Task 1:

Once you have a login username and its associated password, you can get into the CTE-POWER cluster. Check that you have access to your home page.


2 —  Warm-up example: MNIST classification

For convenience, we will consider the same neural network that we used to classify MNIST digits in the previous part programmed previously in the Jupyter notebook.

2.1 TensorFlow version


Task 2:

Write your MNIST classifier program with the file name MNIST.py  .


In the following lines there are the lines of code of the TensorFlow version of the MNIST classifier described in class.

Note: For the following code lines, beware at COPY&PASTE!. Some symbols are “converted” by the HTML tranlator into non-standard. If a command does not work properly, repeat it by typing it.

import tensorflow as tf 
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
print(tf.__version__)
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
model = Sequential()
model.add(Conv2D(32, (5, 5), activation=’relu’, 
          input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (5, 5), activation=’relu’))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(10, activation=’softmax’))
model.summary()
from keras.utils import to_categorical
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data(path=’/gpfs/projects/nct00/nct00002/basics-utils/mnist.npz’)
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype(‘float32’) / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype(‘float32’) / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
model.compile(loss=’categorical_crossentropy’,
              optimizer=’sgd’,
              metrics=[‘accuracy’])
model.fit(train_images, train_labels, batch_size=100, 
          epochs=5, verbose=1)
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(‘Test accuracy:’, test_acc)

This will be the code MNIST.py (available at GitHub), which we will use as a first case study to show how to launch programs in the CTE-POWER supercomputing.

2.2  TensorFlow or PyTorch

We will use the TensorFlow framework; however, the code in PyTorch code doesn’t differ too much. We will use the Keras API because since the release of Tensorflow 2.0, tf.keras.Model API has become the easiest way for beginners of building neural networks, particularly those not requiring custom training loops.

Before moving on, take a look at this post to answer the following task.


Task 3:

Briefly describe in a textual way the main differences that you consider that your MNIST classifier code has with its equivalent in PyTorch.


3 —  Software stack required for deep learning applications

It is important to remember that before executing a DL application, it is required to load all the packages that build the application’s software stack environment.

3.1 Load required modules

At CTE-POWER supercomputer, it is done though modules  , that can be done with the command module load before running the corresponding .py code.

In our case study, we need the following modules that include the required libraries:

module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML

Task 4:

Load the required modules.


3.2 Interactive execution of the code

 

How to execute our code in the login node?

python MNIST.py

If we want to detach the standard outputs and the standard error messages, we can add this argument 2>err.txt:

python MNIST.py 2>err.txt

Redirecting the standard error allows us to see the result of the training that gives us the Keras by the standard output without the information related to the execution environment:

Epoch 1/5
600/600 [======] - 2s 3ms/step - loss: 0.9553 - accuracy: 0.7612
Epoch 2/5
600/600 [======] - 1s 2ms/step - loss: 0.2631 - accuracy: 0.9235
Epoch 3/5
600/600 [======] - 2s 3ms/step - loss: 0.1904 - accuracy: 0.9446
Epoch 4/5
600/600 [======] - 2s 3ms/step - loss: 0.1528 - accuracy: 0.9555
Epoch 5/5
600/600 [======] - 2s 3ms/step - loss: 0.1288 - accuracy: 0.9629
313/313 [======] - 1s 2ms/step - loss: 0.1096 - accuracy: 0.9671
Test accuracy: 0.9671000242233276

 


Task 5:

Launch your MNIST.py  sequential program in the CTE-POWER supercomputing (detach the standard outputs and the standard error messages).


 

Well, our code is executed in the login node shared with other jobs from users that are trying to submit jobs to the SLURM system, but what we really need is to allocate resources for our code. How can we do it?

4 — How to allocate computing resources with SLURM

To run a code in CTE-POWER we use the SLURM workload manager. An excellent Quick Start User Guide can be found here. We have two ways to use it: sbatch and salloc commands.

4.1  sbatch

The method for submitting jobs that we will center our today hands-on exercise will be using the SLURM sbatchcommand directly. sbatch submits a batch script to Slurm. The batch script may be given sbatch through a file name on the command line (.sh file). The batch script may contain options preceded with #SBATCH before any executable commands in the script. sbatch will stop processing further #SBATCH directives once the first non-comment or non-whitespace line has been reached in the script.

sbatch exits immediately after the script is successfully transferred to the SLURM controller and assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

By default, both standard output and standard error are directed to the files indicated by:

#SBATCH --output=MNIST_%j.out
#SBATCH --error=MNIST_%j.err

where the “%j” is replaced with the job allocation number. The file will be generated on the first node of the job allocation. When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.

An example of a job script that allocates a node with 1 GPU for our case study looks like this (MNIST.sh ):

#!/bin/bash
#SBATCH --job-name="MNIST"
#SBATCH -D .
#SBATCH --output=MNIST_%j.out
#SBATCH --error=MNIST_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML
python MNIST.py

You can consult this official page documentation to know all the options we can use in the batch script preceded with#SBATCH.

These are the basic directives to submit and monitor jobs with SLURM that we will use in our case study:

  • sbatch <job_script> submits a job script to the queue system.
  • squeue shows all the submitted jobs with their <job_id>.
  • scancel <job_id> remove the job from the queue system, cancelling the execution of the processes, if they were still running.

In summary, this can be an example of a sequence of command lines, and the expected output of their execution will be:

[CTE-login-node ~]$ sbatch MNIST.sh
Submitted batch job 4910352
[CTE-login-node ~]$ squeue
JOBID    PARTITION  NAME    USER    ST TIME  NODES  NODELIST
4910352  main       MNIST   userid  R  0:01  1      p9r1n16
[CTE-login-node ~]$ ls
MNIST.py
MNIST.sh
MNIST_4910352.err
MNIST_4910352.out

The standard output and standard error are directed to the files MNIST_4910355.out and MNIST_4910355.err, respectively. Here, the number 4910352indicates the job id assigned to the job by SLURM.

An alternative to run a job is using the salloc command. It is used to allocate resources for a job in real-time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks. One example can be:

salloc -t 00:10:00 -n 1 -c 40 --gres=gpu:1 -J debug --partition interactive srun --pty /bin/bash

When salloc command obtains the requested allocation, it then runs the command specified by the user. The command may be any program the user wishes. In our case, the command is srun to create an interactive session where we could execute commands.

Note that we are allocating 4 GPUs from a node with this command. It is very likely that we have to wait to be granted to have the resource, as the machine is often very busy:

salloc: Required node not available (down, drained or reserved)
salloc: Pending job allocation <job_id>
salloc: job <job_id> queued and waiting for resources

We are informed with the following message when we already have the resources available:

salloc: Nodes p9login1 are ready for job

Once we are in an interactive session, we can check if we have 4 GPUs assigned with:

nvidia-smi

Remember that we need to run the commands to load the modules and then run the corresponding code:

module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML

python MNIST.py

In the case of using salloc command if we want to detach the standard outputs and the standard error messages, we can also add this argument 2>err.txt:

python MNIST.py 2>err.txt

4.3 Resource reservation for SA-MIRI 2021

BSC has made a special reservation of supercomputer nodes to be used by this course:

ReservationName=SA-MIRI21-1711 
StartTime=2021-11-17T08:00:00 
EndTime=2021-11-17T10:00:00 

ReservationName=SA-MIRI21-2211 
StartTime=2021-11-22T08:00:00 
EndTime= 2021-11-22T10:00:00 

ReservationName=SA-MIRI21-2411 
StartTime=2021-11-24T08:00:00 
EndTime=2021-11-24T10:00:00 

ReservationName=SA-MIRI21-2911 
StartTime=2021-11-29T08:00:00 
EndTime=2021-11-29T10:00:00 

ReservationName=SA-MIRI21-0112 
StartTime=2021-12-01T08:00:00 
EndTime=2021-12-01T10:00:00 

ReservationName=SA-MIRI21-1312 
StartTime=2021-12-13T08:00:00 
EndTime=2021-12-13T10:00:00 

ReservationName=SA-MIRI21-1512 
StartTime=2021-12-15T08:00:00 
EndTime=2021-12-15T10:00:00 

ReservationName=SA-MIRI21-2012 
StartTime=2021-12-20T08:00:00 
EndTime=2021-12-20T10:00:00 

ReservationName=SA-MIRI21-2212 
StartTime=2021-12-22T08:00:00 
EndTime=2021-12-22T10:00:00

 

For using the reservations, you must add this line in the SLURM script:

#SBATCH --reservation=<ReservationName>

WARNING: Remember that the <ReservationName> is different for each day!


Task 6:

Execute your MNIST.py program with the SLURM workload manager system using a job script that allocates a node with 1 GPU in CTE-POWER. Inspect the .out and .err files obtained.


Final report


Task 7:

Write a report for this hands-on exercise that includes all the tasks detailing the steps that are done, the code used, and the results.

Once finished, generate a PDF version and submit it to the “racó” in the mailbox “exercise 11”.


Acknowledgement: Many thanks to Juan Luis Domínguez and Oriol Aranda, who wrote the first version of the codes that appear in this hands-on, and to Carlos Tripiana and Félix Ramos for the essential support using the CTE-POWER cluster. Also, many thanks to Alvaro Jover Alvarez, Miquel Escobar Castells, and Raul Garcia Fuentes for their contributions to the proofreading of previous versions of this post.

The code used in this post is based on the GitHub https://github.com/jorditorresBCN/PATC-2021