12. Parallel and Distributed Deep Learning Frameworks

Hands-on exercise description

In today’s hands-on exercise we will review the basic concepts and an image classification problem (based on the CIFAR10 dataset) using a neural network model as a classifier (the neural network ResNet50V2 with 25,613,800 parameters). We will introduce and analyze all the required steps to train this classifier sequentially using one GPU. This classifier will be used as a case study for parallelisation and distribution its training in the next two hands-on exercises.

Resource reservation for SA-MIRI 2021

Remember that BSC has made a special reservation of supercomputer nodes to be used for this hands-on exercise:

EndTime= 2021-11-22T10:00:00 



For using the corresponding reservations remember to add this line in the SLURM script with the corresponding <ReservationName> for each day:

#SBATCH --reservation=<ReservationName>

1 — Basic concepts

“Methods that scale with computation are the future of Artificial Intelligence”— Rich Sutton, father of reinforcement learning (video 4:49)

Deep Neural Networks (DNN) base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained.


1.1 Performance metrics: Speedup, Throughput, and Scalability

In order to make the training process faster, we are going to need some performance metrics to measure it. The term performance in these systems has a double interpretation. On the one hand, it refers to the predictive accuracy of the model. On the other, to the computational speed of the process.

Accuracy is independent of the computational resources, and it is the performance metric to compare different DNN models.

In contrast, the computation speed depends on the platform on which the model is deployed. We will measure it by metrics such as Speedup, the ratio of solution time for the sequential algorithms (using one GPU in our hands-on exercises)  versus its parallel counterpart (using many GPUs). This is a prevalent concept in our daily argot in the supercomputing community.

Another important metric is Throughput. In general terms, throughput is the rate of production or the rate at which something is processed;  for example, the number of images per unit time that can be processed. This can give us a good benchmark of performance (although it depends on the neural network type).

Finally, a concept that we usually use is Scalability. It is a more generic concept that refers to the ability of a system to handle a growing amount of work efficiently. These metrics will be highly dependent on the computer cluster configuration, the type of network used, or the framework’s efficiency using the libraries and managing resources.

All these metrics will be used in the following hands-on exercises.

1.2 Parallel computer platforms

The parallel and distributed training approach is broadly used by Deep Learning practitioners. This is because DNNs are compute-intensive, making them similar to traditional supercomputing (high-performance computing, HPC) applications. Thus, large learning workloads perform very well on accelerated systems such as general-purpose graphics processing units (GPU) that have been used in the Supercomputing field.

The main idea behind this computing paradigm is to run tasks in parallel instead of serially, as it would happen in a single machine (or single GPU). Multiple GPUs increase both memory and compute available for training a DNN. In a nutshell, we have several choices, given a minibatch of training data that we want to classify. In the next subsection, we will go into an introduction of the main options.

1.3 Types of parallelism

To achieve the distribution of the training step, there are two principal implementations, and it will depend on the needs of the application to know which one will perform better, or even if a mix of both approaches can increase the performance.

For example, different layers in a Deep Learning model may be trained in parallel on different GPUs. This training procedure is commonly known as Model parallelism. Another approach is Data parallelism, where we use the same model for every execution unit, but train the model in each computing device using different training samples.


1.3.1 Data parallelism

In this mode, the training data is divided into multiple subsets, and each one of them is run on the same replicated model in a different GPU (worker nodes). These will need to synchronize the model parameters (or its “gradients”) at the end of the batch computation to ensure they are training a consistent model (just as if the algorithm run on a single processor) because each device will independently compute the errors between its predictions for its training samples and the labeled outputs (correct values for those training samples). Therefore, each device must send all of its changes to all of the models on all the other devices.

One interesting property of this setting is that it will scale with the amount of data available, and it speeds up the rate at which the entire dataset contributes to the optimization. Also, it requires less communication between nodes, as it benefits from a high amount of computations per weight. On the other hand, the model has to fit on each node entirely, and it is mainly used for speeding the computation of convolutional neural networks with large datasets.

1.3.2 Model parallelism

We could partition the network layers across multiple GPUs (even we could split the work required by individual layers). That is, each GPU takes as input the data flowing into a particular layer, processes data across several subsequent layers in the neural network, and then sends the data to the next GPU.

In this case (also known as Network Parallelism), the model will be segmented into different parts that can run concurrently, and each one will run on the same data in different nodes. This method’s scalability depends on the degree of task parallelization of the algorithm, and it is more complex to implement than the previous one. It may decrease the communication needs, as workers need only to synchronize the shared parameters (usually once for each forward or backward-propagation step) and works well for GPUs in a single server that shares a high-speed bus.

In practice, model parallelism is only used with models that are too large to fit n any single device. In this case, the parallelization of the algorithm is more complex to implement than run the same model in a different node with a subset of data.

1.3.3 Pipeline parallelism

Therefore, a natural approach for really large models is to split the model into layers so that a small consecutive set of layers are grouped into one GPU. However, this approach could lead to under-utilization of resources due to the implementation for running every data batch through multiple GPUs with sequential dependency. A new strategy called Pipeline parallelism is introduced recently.  The pipeline parallelism strategy combines model parallelism with data parallelism to reduce these limitations. The idea behind this approach is to split each mini-batch into several micro-batches and allow each GPU to process one micro-batch simultaneously. Nowadays, we have different approaches to pipeline parallelism depending on the inter-communication and the gradient aggregation used, however this is out of scope of this course. 

Task 1:

Summarizes the main advantages and disadvantages between the data parallelism and model parallelism approach.

In this hands-on, we will focus on the Data Parallelism approach.

2— Case study description

2.1 Dataset: CIFAR10

CIFAR-10 is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32×32 colour images containing 10 object classes, with 6000 images per class. It was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. There are 50,000 training images and 10,000 test images (Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009).


We have preloaded the CIFAR-10 dataset at CTE-POWER supercomputer in the directory /gpfs/projects/nct00/nct00002/cifar-utils/cifar-10-batches-py downloaded from http://www.cs.toronto.edu/~kriz/cifar.html.

For academic purposes, to make the training even harder and to be able to see larger training times for better comparison, we have applied a resize operation to make the images of 128×128 size. We created a custom load_data function (/gpfs/projects/nct00/nct00002/cifar-utils/load_cifar.py) that applies this resize operation and splits the data into training and test sets. We can use it as:

from cifar import load_cifar

load_cifar.py can be obtained from this repository GitHub for readers that want to review it (for the students of this course it is not necessary to download it).

2.2 Neural Networks architecture: ResNet

Now we are going to use a neural network that has a specific architecture known as ResNet. In this scientific community, we find many networks with their own name. For instance, AlexNet, by Alex Krizhevsky, is the neural network architecture that won the ImageNet 2012 competition. GoogleLeNet, which with its inception module drastically reduces the parameters of the network (15 times less than AlexNet). Others, such as the VGGnet, helped to demonstrate that the depth of the network is a critical component for good results. The interesting thing about many of these networks is that we can find them already preloaded in most of the Deep Learning frameworks.

Keras Applications are prebuilt deep learning models that are made available. These models differ in architecture and the number of parameters; you can try some of them to see how the larger models train slower than the smaller ones and achieve different accuracy.

A list of all available models can be found here (the top-1 and top-5 accuracy refers to the model’s performance on the ImageNet validation dataset.). For this hands-on, we will consider one architecture from the family of ResNet as a case study: ResNet50v2. ResNet is a family of extremely deep neural network architectures showing compelling accuracy and nice convergence behaviors, introduced by He et al. in their 2015 paper, Deep Residual Learning for Image RecognitionA few months later, the same authors published a new paper, Identity Mapping in Deep Residual Network, with a new proposal for the basic component, the residual unit, which makes training easier and improves generalization. And this lets the V2 versions:


The “50” stand for the number of weight layers in the network. The arguments for the network are:

  • include_top: whether to include the fully-connected layer at the top of the network.
  • weights: one of None (random initialization), ‘imagenet’ (pre-training on ImageNet), or the path to the weights file to be loaded.
  • input_tensor: optional Keras tensor (i.e. output of layers.Input()) to use as image input for the model.
  • input_shape: optional shape tuple, only to be specified if include_top is False (otherwise the input shape has to be (224, 224, 3) (with 'channels_last' data format) or (3, 224, 224)(with 'channels_first' data format). It should have exactly 3 inputs channels, and width and height should be no smaller than 32. E.g. (200, 200, 3)would be one valid value.
  • pooling: Optional pooling mode for feature extraction when include_top is False. (a)None means that the output of the model will be the 4D tensor output of the last convolutional block. (b) avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor. (c)max means that global max pooling will be applied.
  • classes: optional number of classes to classify images into, only to be specified if include_topis True, and if no weights argument is specified.
  • classifier_activation: A str or callable. The activation function to use on the “top” layer. Ignored unless include_top=True. Set classifier_activation=None to return the logits of the “top” layer.

Note that if weights="imagenet", Tensorflow middleware requires a connection to the internet to download the imagenet weights (pre-training on ImageNet). Due we are not centering our interest in Accuracy, we didn’t download the file with the imagenet weights; therefore, it must be used weights=None.


Task 2:

Have a look at the ResNet50v2 and ResNET152V2 neural networks and describe the main differences.

2.3 Frameworks for parallel and distributed training

We can use frameworks like TensorFlow or Pytorch to program multi-GPU training in one server. To parallelize the training of the model, you only need to wrap the model with torch.nn.parallel.DistributedDataParallelif you use PyTorch or with tf.distribute.MirroredStrategy if you use TensorFlow. We will do it in the next hands-on exercise.

However, the number of GPUs that we can place in a server is very limited, and the solution goes through putting many of these servers together, as we did at the BSC, with the CTE-POWER supercomputer, where 54 servers are linked together with an InfiniBand network on optical fiber.

In this new scenario, we need an extension of the software stack to deal with multiple distributed GPUs in the neural network training process. There are many options, as Horovod and Ray. Both plug into TensorFlow or PyTorch. In the last hands-on you will be accompanied in learning this process of distribution of a neural network.

3—Sequential version of ResNet

Before showing how to train a neural network in parallel, let’s start with a sequential version of the training in order to get familiarized with the classifier.

3.1 Python code

The sequential code to train the previously described problem of classification of the CIFAR10 dataset using a ResNet50 neural network could be the following (we will refer to it as ResNet50_seq.py):

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
import numpy as np
import argparse
import time
import sys
from cifar import load_cifar
parser = argparse.ArgumentParser()
parser.add_argument(‘ — epochs’, type=int, default=5)
parser.add_argument(‘ — batch_size’, type=int, default=2048)
args = parser.parse_args()
batch_size = args.batch_size
epochs = args.epochs
train_ds, test_ds = load_cifar(batch_size)
model = tf.keras.applications.resnet_v2.ResNet50V2(    
        input_shape=(128, 128, 3), 
opt = tf.keras.optimizers.SGD(0.01)
model.fit(train_ds, epochs=epochs, verbose=2)

ResNet50_seq.py file can be downloaded from the course repository GitHub.

3.2 SLURM script

To run this python code using the SLURM system, as you know, it can be done using the following SLURM script (we will refer to it as ResNet50_seq.sh):

#SBATCH — job-name=”ResNet50_seq”
#SBATCH — output=RESNET50_seq_%j.out
#SBATCH — error=RESNET50_seq_%j.err
#SBATCH — nodes=1
#SBATCH — ntasks=1
#SBATCH — cpus-per-task=160
#SBATCH — time=00:05:00
module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML
python ResNet50_seq.py --epochs 5 --batch_size 256

ResNet50_seq.sh can be downloaded from the course repository GitHub.

Task 3:

Write your  ResNet50_seq.py program and execute it with 5 epochs in CTE-POWER with the SLURM workload manager system using the job script ResNet50_seq.sh presented in this section (with 5 minutes as a maximum time) What is the result? Include in the answer your code of ResNet50_seq.py  , ResNet50_seq.sh, and the relevant part of the  .out and .err files. 

As a hint, in my execution appears the following error:

slurmstepd: error: *** JOB <job_id> ON p9r2n12 CANCELLED AT 2025-11-19T09:48:59 DUE TO TIME LIMIT ***

If you observe that SLURM system does not control the time limit correctly, I propose to cancel the job after 10 minutes.

3.3 Using a GPU for training

Unlike the MNIST problem, in this problem, we cannot train the neural network with a single CPU. It is clear that we need more computing power for training this problem. In this case, remember that we can add this line to the SLURM script:

#SBATCH --gres=gpu:1

Task 4:

Execute the same  ResNet50_seq.py program the job script ResNet50_seq.sh including the allocation of one GPU. What is the result? Include in the answer your code ofResNet50_seq.sh, and the relevant part of the  .out and .err files. 

In the .out file we can see the result of the output that gives us the Keras that specify the time required for one epoch, the loss value and de accuracy achieved with this new epoch:

Epoch 1/5
196/196 - 41s - loss: 2.0176 - accuracy: 0.2584

Task 5:

Analyzing the .out  file, what is the Accuracy obtained for this problem in this execution? What is the Accuracy obtained? What can we do to improve the Accuracy?

3.4 Improving the Accuracy

From the results of the previous Tasks, you can conclude that with 5 minutes you can execute three  or four epochs.   However, the Accuracy obtained is not good (>50%). What we can do is to increase the number of epochs. In this case, it is required to increase the time required.

Task 6:

Modify your  job script ResNet50_seq.sh (SLURM time flag and networks epochs parameter)  so you can obtain a model with an Accuracy greater than 95. In the answer include the your code ofResNet50_seq.sh, and the relevant part of the  .out file. 

4—Comparing different  ResNet networks

We can use the same code presented in the previous sections to train any other networks available in Keras. We only need to change in the program the piece of code that identifies the network (resnet_v2.ResNet50V2 in our example) for the corresponding network.

For the purpose of this post, to calculate the time (which will be the metric that we will use to compare performance), we can use the time that Keras himself tells us that it takes an epoch (sometimes we discard the first epoch as it is different from the rest since it has to create structures in memory and initialize them). Remember that we are in a teaching example, and with this approximate measure of time, we have enough for the course goals.

Task 7:

Train the resnet_v2.ResNet152V2  and compare the required time per epoch in comparison of the training resnet_v2.ResNet50V2. Jusfity why the time is so different. In the answer include  your slurm script used and the .py code used.

Final report

Task 8:

Write a report for this hands-on exercise that includes all the tasks detailing the steps that are done, the code used, and the results.

Once finished, generate a PDF version and submit it to the “racó” in the mailbox “exercise 12”.


Acknowledgement: Many thanks to Juan Luis Domínguez and Oriol Aranda from BSC-CNS, who wrote the first version of the codes that appear in this hands-on, and to Carlos Tripiana and Félix Ramos for the essential support using the POWER-CTE cluster. Also, many thanks to Alvaro Jover Alvarez, Miquel Escobar Castells, and Raul Garcia Fuentes for their contributions to the proofreading of this document. The code used in this post is based on the GitHub https://github.com/jorditorresBCN/PATC-2021