14. Accelerate Transformers using a multi-GPU Parallel Server

Hands-on description

The realize of large language models like ChatGPT, the latest question-answering chatbot, only reinforces the perception that the type of neural network called transformers still has a long way to go, and given their computational complexity, studying and understanding the scalability of these models is crucial.

Transformer-based models have become very useful because they can process text in parallel, so they’re not limited by the bottleneck of having to process text sequentially, like RNN-based models. And thanks to the parallel infrastructures we have available today, the training of these models can be significantly accelerated.

In this hands-on, we will discover that this is so with a set of experiments that the student will carry on to experiment with the efficient speed up that a transformer’s training can achieve. Specifically, we will use a version of the well-known BERT, one of the first large language models developed by Google.  We will use it to create a Sentiment Classifier and paralyze it over many GPUs using different strategies.

Also, in this hands-on, we will use PyTorch instead TensorFlow, and as you will see, it does not differ from TensorFlow use due to both being libraries of Python.

1- Background basics

Natural Language Processing

Natural Language Processing (NLP) is a field of artificial intelligence focused on automatically understanding human language, trying to understand not only single words but also the context of those words. Examples of applications could be text classification, text generation, question answering, summarization, translation, etc. Transformer models are used to solve all these applications mentioned.


As we already presented in a previous post, Transformers is a kind of neural network architecture which were introduced in the paper “Attention is all you need” by Vaswani et al. in 2017. This was followed by the introduction of several influential Transformers models, as we introduced in this previous post.

All the Transformer models have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training that is not needed to label the input data, and we simply mask some percentage of the input tokens at random and then predict those masked tokens.

(image source)

Transfer Learning

We refer to this type of model as a language model, which develops a statistical understanding of the language it has been trained on. However, it’s not very useful for them for specific practical tasks. Because of this, we use these pre-trained language models in a process called Transfer Learning, which consists of training a model for a task and using the knowledge acquired to perform a different task. During this process, the model (pretrained in an unsupervised way) is fine-tuned in a supervised way (using human-annotated labels) on a given task.

Pretraining is the act of training a model from scratch without any prior knowledge, usually done on very large amounts of data. Instead, fine-tuning is the training done (with a dataset specific to your task) after a model has been pretrained.

Transformer Architecture

In a nutshell, the transformers are composed of two main blocks: Encoder and Decoder.

(image source)

While the encoder receives input and builds a representation of it (of its characteristics), the decoder uses the representation of the encoder (with its characteristics) together with other inputs to generate a target sequence. Without going into detail, mention that these two blocks can be used independently, depending on the task we are pursuing. For example, for sentence classification, which requires an understanding of the input, we use encoder-only models. Instead, for tasks such as text generation, we require Decoder-only models.

Finally, the encoder-decoder models (also known as sequence-to-sequence models) are good for generative tasks that require input, such as translation, summarization, etc. In fact, the Transformer architecture was originally designed for translation so that during training, the encoder receives inputs (sentences) in a certain language while the decoder receives the same sentences in the desired target language.

Original Transformer Architecture (image source)

In a nutshell, a transformer is built on a combination of Fully Connected and Self-Attention layers, a method that a Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

Attention layers

The title of the paper introducing the Transformer architecture was “Attention Is All You Need”, indicating that they are built with special layers called attention layers. Given the introductory nature of this hands-on, all we need to know in order to accomplish this hands-on is that the attention layer will tell the model to pay specific attention to certain words in the input sentence and not pay so much attention to others. The idea behind this is that a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word before or after the word being studied.

Consider the task of translating text from English to French. Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

(extracted from HuggingFace tutorial)

2- Case study

BERT model

There are many Transformers-based neural networks. In this hands-on, we are going to use BERT (Bidirectional Encoder Representations from Transformers), which is one of the most famous Transformers, introduced in the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding“ published by researchers at Google AI Language. Specifically, we are going to use BERT Base, which has 12 encoder layers stacked on top of each other (and has 340M parameters). This model was introduced in this paper and first released in this repository (this model is case-sensitive: bert-base-cased).

Hugging Face library

Hugging Face has become a popular platform that provides state-of-the-art Transformers models using the power of transfer learning, offering an open-source library to build, train, and share Transformers models. HuggingFace’s transformers library is the de-facto standard for Natural Language Processing because it’s powerful, flexible, and easy to use. This is the reason we will use it in this hands-on.

The Transformers library provides the functionality to create and use shared models, and the Model Hub contains pre-trained models that we can download and use.

As a demonstration of its operation and accessibility, in this post you can see how to program a simple “Hello World” that shows how this library can be used.


As we saw in this other post, this pipeline groups together three steps: (1)preprocessing, (2)passing the inputs through the model, and (3) postprocessing:

(image source)

The first step that we need to do is tokenization using a tokenizer. Tokenization is the process of encoding a string of text into transformer-readable token ID integers. The following figure shows a visual example of this step:

(image source)

All these preprocessing needs to be done in exactly the same way as when the model was pre-trained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained()method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

As a recap, remember that in transformers, each model architecture is associated with three main classes:

– A model class of a particular pre-train model.

– A tokenizer class to pre-process the data and make it compatible with a particular pre-train model.

– A configuration class with the configuration of a particular pre-train model.

IMDB dataset

The model that we will consider as a case study is a Sentiment Classifier for IMDB Dataset, to determine the attitude, or sentiment of a customer’s opinion, categorizing it into positive or negative (for each text movie review, the model has to predict a positive/negative label for the sentiment). So, the purpose of our Transformer will be to be a classifier that predicts if one review is positive or negative.

The IMDB sentiment classification dataset was created by researchers at Stanford University and consists of 50K movie reviews in text from IMDB users labeled as either positive or negative. This is a dataset for binary sentiment classification, the IMDB reviews are split into a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

3- Code and Supercomputing resources

CTE-POWER cluster

In this hands-on, we will use one node of the CTE-POWER cluster at BSC-CNS. This means that we are using very specific arguments on many methods, very dependent on our cluster. However, the reader may notice that the sequence of method calls is the same as if we were using an infrastructure with internet access to download data online.


The code used in this post is based on GitHub https://github.com/jorditorresBCN/Fundamentals-DL-CTE-POWER


Due to the lack of internet connection in BSC Clusters, we have to load locally some data that need to be downloaded during the training process. We loaded all data at the Marenostrum GPFS file system in a folder named data  in the teacher account. You can obtain these data copying locally all required data with the following command:

cp -r /gpfs/projects/nct00/nct00036/transformers/data data

As we will see later, in the python code, we should indicate that the data will be offline (not online as default) using this code line:

os.environ[‘HF_DATASETS_OFFLINE’] = ‘1’

Software stack

In this hands-on, we will use v4.10 of the Transformers library and v1.8 of Datasets. All the libraries, such as transformers, pytorch etc, are already installed in the Cluster. The software stack required can be created by:

module purge; module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML arrow/3.0.0 torch/1.9.0a0 text-mining/2.1.0

Job submission

To run a code in the CTE-POWER cluster, we use the SLURM workload manager.

In the previous hands-on, “Using Supercomputers for training DL training”, you can find more information about how to use SLURM in CTE-POWER.

In short, the basic structure of a SLURM shell script we need can be:

#SBATCH — job-name <job-name>
#SBATCH -D <folder-name>
#SBATCH — output %j.out
#SBATCH — error %j.err
#SBATCH — ntasks-per-node=1 
#SBATCH -n 1 
#SBATCH -c 40
#SBATCH — gres=’gpu:1'
#SBATCH — time 01:00:00
module purge; module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML arrow/3.0.0 torch/1.9.0a0 text-mining/2.1.0
python main.py — json_file ‘./config.json’

4- How to use one GPU to accelerate the training with PyTorch

PyTorch library

A few years ago appeared, the first version of PyTorch and without question, it is gaining great momentum. Initially incubated by Facebook, PyTorch rapidly developed a reputation for being an ideal flexible framework for rapid experimentation and prototyping gaining thousands of fans within the deep learning community. For instance, Ph.D. students in my research team chose to use PyTorch because it allows them to write native-looking Python code and still get all the benefits of a good framework, like auto differentiation and built-in optimization. Last week, Pytorch just introduced PyTorch 2.0!

How to launch a job and how to specify training arguments

In the BSC CTE-POWER cluster, we can launch an execution as:

python main.py — json_file ‘./config.json’

As we will see later, the Trainer class from Hugging Face provides an API for feature-complete training in PyTorch for most standard use cases. You must pass a TrainingArguments object with all the information about the training. You can find all parameters here.

In our case, we will modify some of them using the config.jsonfile:

    "TrainingArgs": {
        "output_dir": "/gpfs/scratch/bsc??/bsc?????/test_trainer", 
        "evaluation_strategy": "epoch",
        "num_train_epochs": 2,
        "per_device_train_batch_size": 16,
        "per_device_eval_batch_size": 16,
        "learning_rate": 5e-5,
        "fp16": false,
        "logging_strategy": "no",
        "save_strategy": "no",
        "report_to": "none",
        "disable_tqdm": true,

Warning: it is required to change the local output directory to yours in your config.json file. The teacher test config.json  can be found on GitHub for your convenience.

Get started with python code

But before we start running the experiments, let’s review the lines of the code to understand its details. The code mentioned below can be found on GitHub for your convenience (main.py file). First, we have to import all required packages:

from datasets import load_dataset
from datasets import load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
import os
import numpy as np
import json
import argparse

as we mentioned, we should indicate that the data will be obtained offline (in your  data folder):

os.environ[‘HF_DATASETS_OFFLINE’] = ‘1’

The load of the TrainingArguments can be done as:

parser = argparse.ArgumentParser() 
parser.add_argument(‘ — json_file’, type=str,    
args = parser.parse_args()
with open(args.json_file, ‘r’) as f:
    config = json.load(f)

As we mentioned, due to the fact that data can’t be obtained online, we need to specify different methods and parameters in comparison to the online version of the code. For instance, to load data with the method loal_dataset, instead of using the simple argument "imdb"(loal_dataset("imdb") ).

In summary, we need to specify as an argument the function that loads the data and the folder where the data are stored (previously downloaded in your  data folder in your local GPFS file system):

raw_datasets = load_dataset(“data/imdb.py”,cache_dir=’data/cache’)

The same happens with the tokenizer, the equivalent of the online code (AutoTokenizer.from_pretrained("bert-base-cased")) requires more complex code:

def tokenize_function(examples):
    return tokenizer(examples[“text”], 
           padding=”max_length”, truncation=True)
tokenizer =AutoTokenizer.from_pretrained(
tokenized_datasets = raw_datasets.map(tokenize_function,          
full_train_dataset = tokenized_datasets[“train”]
full_eval_dataset = tokenized_datasets[“test”]

In the offline case, we need to download into the local file system the tokenizer bert-base-casedin the local folder /data/cache/tokenizer. In this case, after loading the tokenizer, we apply it using a map method that returns tokens (with the max_length, which value is specified in the tokenizer).

The preloaded model is also obtained in the same way (including the path to the folder /data/cache that contains the model):

model = AutoModelForSequenceClassification.from_pretrained(

In summary, if we download online the data and classes, the parameter pretrained_model_name of the common class method from_pretrained(pretrained_model_name, ...) is a string with the shortcut name of a pre-trained model/tokenizer/configuration to load, e.g. bert-base-cased. We can find all the shortcut names in the transformers documentation here. Remember that because at BSC cluster we are working in an offline mode, we need to make the specification of the arguments a little more complicated.

Remember that to train the model, we need a metric (for example, Accuracy). In the Hugging Face library, there are already metrics that we have also had to download locally at data previously:

metric = load_metric(“./data/accuracy.py”)

We will use the compute_metrics function to see the Accuracy; although it is informative, the transformer does not use it for training.

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = metric.compute(predictions=predictions, 
    return acc

Remember that the Trainer class from Hugging Face allows training the model. The model, specific arguments, train dataset, evaluation dataset, and metric must be passed as arguments.

training_args = TrainingArguments(**config[‘TrainingArgs’])
trainer = Trainer(

Finally, all the parameters that we have not specified in our config.json file takes their default value.

Now we are ready to run the trainer invoking the method train :


Let’s remember that since the configuration file config.json has these values:

"evaluation_strategy": "epoch",
"num_train_epochs": 2,

this indicates that we calculate the time to do the train with only two epochs and two evaluations. A short execution has been deliberately specified here since we want to show how the scaling of the algorithm behaves (and we must make limited use of the resources offered by BSC).

From the output of the execution of the program (in our case main.py ), we can obtain the following results:

‘train_runtime’: 2507.3917, 
‘train_samples_per_second’: 19.941, 
‘train_steps_per_second’: 1.247, 
‘train_loss’: 0.20414053440399071, 
‘epoch’: 2.0

The 'train_runtime' line indicates the time required for training.

Task 1: Sequential training using one GPU

Train the transformer using one GPU using the main.py   code you can download from the course GitHub. Execute it in the CTE-POWER cluster using a SLURM job job-transformer-sequential.sh from the course GitHub.  Remember to create the configuration file config.json in your local directory and check that the "output dir" flag is pointing to a folder in your file system.

Inspect the .out  to obtain the execution time required for this training ( 'train_runtime'). Inspect the .err to see what happened if the .out    file does not contain the information expected.   

Include in the answer the used python program main.py, your SLURM script job-transformer-sequential.sh, and the  JSON config file config.json.

WARNING: the SLURM job can take around 45 minutes as an execution time.


5- Basics in parallel use of GPUs with PyTorch

Parallel execution in PyTorch

In Pytorch, there is dataparallel or distributed data parallel to make full use of a number of GPUs (there are significant caveats to using CUDA models with multiprocessing).

The dataparallel library splits a batch of data into several mini-batches and feeds each mini-batch to one GPU; each GPU has a copy of the model (data parallelism). After each forward pass, all gradients are sent to the master GPU, and only the master GPU does the backpropagation and update parameters; then, it broadcasts the updated parameters to other GPUs.

The distributed data parallel library is similar to the dataparallel, but it uses AllReduce function, becoming more efficient in the data transaction between GPUs. A more detailed explanation of this process can be found in this post from Dongda Li.

This hands-on deals with how to use distributed data parallel on a single machine with multiple GPUs (not to distribute amount multiple servers).

According to the official Python web page, it is recommended to use DistributedDataParallel, instead of DataParallel to do multi-GPU training, even if there is only a single node.

Parallel execution using DistributedDataParallel library

In DistributedDataParallel we could use torch.distributed.launch utility to launch the python program, and in the BSC cluster, we can launch this parallel execution (using the distributed data parallel (DDP) library) as:

python -m torch.distributed.launch — nproc_per_node=2 main_ddp.py — json_file ‘./config.json’

torch.distributed.launch will spawn multiple processes for you. nproc_per_node is set to the number of GPUs on the node. As we can see in the code, in DDP, each process will know its own local_rankvalue updated during the code execution provided by torch.distributed.launch.

For the use of the DDP library, we need to add an additional line to the code to load the local_rank argument provided by the torch.distributed.launch

parser.add_argument(“ — local_rank”, type=int, default=0)

The python code that contains the DDP version can be found as main_ddp.py  on GitHub.

Task 2: Parallel execution with DDP

Create and submit the corresponding SLURM job scripts (*) to measure the time required to execute the training for 2 and 4 GPUs using main_ddp.py. From the .outfile, obtain the time, compute the Speedup and take note of the accuracy. Print the results as a table. In the answer, discuss the obtained results. 

(*) You can use  job-transformer-DDP-2-GPUs.sh  and job-transformer-DDP-4-GPUs.sh files from GitHub if you want

6- Mixed Precision

There are numerous benefits to using numerical formats with lower precision than 32-bit floating-point. First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth, thereby speeding up data transfer operations. Third, math operations run much faster with reduced precision, especially on GPUs with Tensor Core support for that precision. Mixed precision training achieves all these benefits while ensuring no task-specific accuracy is lost compared to full precision training (in our case study, because we will train only with two epochs, it presents a slight difference). It does so by identifying the steps that require full precision and using a 32-bit floating-point for only those steps while using a 16-bit floating-point everywhere else.

A more detailed explanation of this process can be found in this post from NVIDIA. It can also be modified the parameter fp16_opt_level to obtain different types of quantitation.


Task 3: Evaluate the benefits of using Mixed Precision 

Create and submit a job script to execute an experiment using 1 GPU using the flag fp16 in the config.json  . From the .out file, obtain the time, compute the Speedup and take note of the accuracy (the first row is the same as task 1). Print the results as a table. In the answer, discuss the obtained results. 

(*) As a help, you can use the files config_mp.json  and job-transformer-mixed-precision.sh from the course GitHub if you want

Hands-on Report

Task 4:  

Write a report for this hands-on exercise that includes all the tasks detailing the steps, the code used, and the results. Once finished, generate a PDF version and submit it to the “racó” in the mailbox “Exercise 14” .

Acknowledgment: Many thanks to Juan Luis Domínguez, who wrote the first version of the codes in this hands-on. Also, many thanks to Cristina Peralta for her contribution to the proofreading of this document.