Accelerate the learning with Distributed training using a multiple parallel servers

IMPORTANT:  Before starting to carry out the 10 Tasks in this exercise, it is recommended to take a look at the overall activity in order to have a clear vision of the work to be submitted at the end of the exercise.
WARNING: The resources assigned to this master’s course allow using up to 16 GPUs at CTE-POWER. If it fails during the execution of this hands-on lab, please, use up to 8 GPUs (2 servers) as a minimum. It could be that the SLURM system allows you to execute up to 32 GPUs (4 servers), but I can not assure it at the moment of writing this post.
Hands-on exercise description (exercise 15

In the previous hands-on exercise, we explored how to scale the training on multiple GPUs in one server with TensorFlow using tf.distributed.MirroredStrategy(). In today’s hands-on exercise, we will scale the training on multiple servers following data parallelism strategies using the middleware Horovod. Specifically, we will employ Horovod to distribute the work present in a Deep Learning model to solve the CIFAR-10 dataset. This hands-on exercise has two basic parts: (1) A step-by-step guide to studying the scaling efficiency of Horovod for the ResNet50V2 model (50% grade) and (2) a proposal to compare the scaling efficiency of Horovod for ResNet50V2 model versus ResNet152V2 model (50% grade).

Resource reservation for SA-MIRI 2022

Remember that BSC has made a special reservation of supercomputer nodes to be used for this hands-on exercise.

To use the corresponding reservations, remember to add this line in the SLURM script with the corresponding <ReservationName> for each day:

#SBATCH --reservation=<ReservationName>

Remember that you can also use the regular SLURM submission system, just not including this additional line.

1—Horovod basics

Uber Engineering introduced Michelangelo, an internal ML-as-a-service platform that makes it easy to build and deploy these systems at scale. Horovod, a component of Michelangelo, is an open-source distributed training framework for TensorFlow, PyTorch, and MXNet. Its goal is to make distributed Deep Learning fast and easy to use via ring-allreduce and requires only a few lines of modification to user code (an existing training script can be scaled up in just a few lines of Python code). Horovod is hosted by the LF AI & Data Foundation.

1.1 Why Horovod

According to the Horovod documentation page, the primary motivation for this project was to make it easy to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. This has two aspects:

  1. How much modification does one have to make to a program to make it distributed, and how easy is it to run it?
  2. How much faster would it run in distributed mode?

Once a training script has been written for scale with Horovod, it can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes. In addition to being easy to use, according to the official documentation, Horovod is fast as you could observe later in this exercise.

1.2 Horovod concepts

(source of this section)

Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, broadcast, and alltoall. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

  • Size would be the number of processes, in this case, 16.
  • Rank would be the unique process ID from 0 to 15 (size – 1).
  • Local rank would be the unique process ID within the server from 0 to 3.
  • Allreduce is an operation that aggregates data among multiple processes and distributes results back to them. Allreduce is used to average dense tensors. Here’s an illustration from the MPI Tutorial:

Allreduce Illustration

  • Allgather is an operation that gathers data from all processes on every process. Allgather is used to collect values of sparse tensors. Here’s an illustration from the MPI Tutorial:

Allgather Illustration

  • Broadcast is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here’s an illustration from the MPI Tutorial:
    Broadcast Illustration
  • Alltoall is an operation to exchange data between all processes. Alltoall may be useful to implement neural networks with advanced architectures that span multiple devices.

1.3 A Data-parallel Distributed Training Paradigm

Conceptually, the data-parallel distributed training paradigm under Horovod is straightforward:

1. Run multiple copies of the training script and each copy:

  • reads a chunk of the data
  • runs it through the model
  • computes model updates (gradients)

2. Average gradients among those multiple copies

3. Update the model

4. Repeat (from Step 1)

Horovod applies Baidu’s algorithm for averaging gradients and communicating those gradients to all nodes (steps 2 and 3 above) that follows the ring-allreduce decentralized scheme. The algorithm was based on the approach introduced in the 2009 paper by Patarasuk and Yuan. Horovod replaced the Baidu ring-allreduce implementation with NCCL-2, which is NVIDIA’s library for collective communication that provides a highly optimized version of ring-allreduce across multiple machines.

The following figure from the paper by Sergeev and Balso shows the ring-allreduce algorithm that allows workers nodes to average gradients and disperses them to all nodes without the need for a centralized scheme with a parameter server.

Source: Sergeev, A., & Del Balso, M. Horovod: fast and easy distributed deep learning in TensorFlow

A more clear and visual explanation can be obtained in this post from Medium: “Visual intuition on ring-allreduce for distributed Deep Learning”.

In this ring-allreduce algorithm, each of N nodes communicates with two of its peers 2∗(N−1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iteration, received values replace the values held in the node’s buffer. Patarasuk and Yuan suggest that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.

1.4 Installing Horovod

Horovod is a python package installed using pip. In general, it assumes the installation of MPI for worker discovery and reduction coordination and Nvidia’s NCCL-2 libraries to support inter-GPU communication (NCCL is supported for Allreduce, Allgather, Broadcast, and Alltoall operations). This is because, as you know, MPI is used extensively in the supercomputing community for high-performance parallel computing.

Here you can find more details on installing Horovod with GPU support. For the complete list of Horovod installation options, read the Installation Guide.  If you are going to use MPI, read Horovod with MPI. Finally, if you want to use Docker, read Horovod on the Docker web page.

However, if MPI is not installed, Horovod includes Gloo, an open-source collective communications library developed by Facebook. Gloo is a more recent controller for Horovod. When Gloo is used in combination with NCCL, it performs almost identically to MPI on standard benchmarks.

In this exercise at CTE-POWER BSC cluster we will use gloo

1.5 Horovod Usage

Horovod introduces an hvd  object that has to be initialized and has to wrap the optimizer (Horovod averages the gradients using allreduce or allgather). A GPU is bound to this process using its local rank, and we broadcast variables from rank 0 to all other processes during initialization.

To use Horovod in Tensorflow, you must make the following additions to your program:

  1. Import Horovod:
import tensorflow as tf
import horovod.tensorflow as hvd

2. Horovod must be initialized before starting to run hvd.init():


3. Pin each GPU to a single process to avoid resource contention. We will be employing local rank for that, such that the first process in the server will be pinned to the first GPU, the second process to the second GPU, and so forth:

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
                 gpus[hvd.local_rank()], 'GPU')

4. Scale the learning rate by the number of workers. Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in the learning rate compensates for the increased batch size. You can learn more about this in this Facebook paper.

opt = tf.keras.optimizers.SGD(0.0005 * hvd.size())

5. Wrap optimizer in hvd.DistributedOptimizer. The distributed optimizer delegates gradient computation to the original optimizer averages gradients using allreduce or allgather, and then applies those averaged gradients.

opt = hvd.DistributedOptimizer(opt)

6. Specify experimental_run_tf_function=False to ensure TensorFlow uses Horovod’s distributed optimizer to compute gradients.

      loss= ... ,  
      optimizer= ... ,  
      metrics= ... , 


7. Add hvd.callbacks.BroadcastGlobalVariablesCallback(0) (as a parameter in fit()) to broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

callbacks = [

8. If you need to save checkpoints, do it only on worker 0 to prevent other workers from corrupting them. Or if you want to run an evaluation or print information to the standard output, it is recommended to do it on worker 0. This can be accomplished with hvd.rank() = 0.

if hvd.rank() == 0:

See the oficial examples directory for full training examples.

1.6 Running Horovod

A Horovod Python program is launched using the horovodrun  command. It takes as parameters the hostname of each server as well as the number of GPUs to be used on each server. For example, to run on a machine with 4 GPUs:

$ horovodrun -np 4 -H localhost:4 python

Or if we want to run our program on 4 machines with 4 GPUs each, the parameters of the horovodruncommand looks like this:

$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python

We already introduced that Horovod includes Gloo, an open-source collective communications library. We can use it at runtime by passing the --gloo argument to horovodrun:

$ horovodrun --gloo -np 4 -H localhost:4 python


2—Case Study

To learn how to use Horovod, we will use the same example presented in the previous hands-on exercises that train a classifier of the CIFAR10 dataset based on a ResNet50 using the CTE-POWER machine.

2.1 ResNet50 code

Following the steps presented in the above subsection 1.5 on how to use Horovod API, below we present the resulting parallel code based on your code used in the previous hands-on exercise:

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
import horovod.tensorflow.keras as hvd
import numpy as np
import argparse
import time
import sys
from cifar import load_cifar
parser = argparse.ArgumentParser()
parser.add_argument(‘ -- epochs’, type=int, default=5)
parser.add_argument(‘ -- batch_size’, type=int, default=256)
args = parser.parse_args()
batch_size = args.batch_size
epochs = args.epochs
model_name = args.model_name
gpus = tf.config.experimental.list_physical_devices(‘GPU’)
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    gpus[hvd.local_rank()], ‘GPU’)
train_ds, test_ds = load_cifar(batch_size)
model = tf.keras.applications.resnet_v2.ResNet50V2(
        include_top=True, weights=None,
        input_shape=(128, 128, 3), classes=10)
if hvd.rank() == 0:
opt = tf.keras.optimizers.SGD(0.0005 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
callbacks = [
if hvd.rank() == 0:
   verbose = 2
   verbose=0, epochs=epochs, 
          verbose=verbose, callbacks=callbacks)


Task 1: (10% grade)

Analyze the previous code (included on the GitHub for this course as highlighting, and describing in a textual way, the lines of code that correspond exclusively to the use of Horovod presented in section 1.5.

(*) The code used in this post is based on the GitHub

REMINDER: The resources assigned to this master’s course allow using up to 16 GPUs at CTE-POWER. If it fails during the execution of this hands-on lab, please, use up to 8 GPUs (2 servers) as a minimum. It could be that the SLURM system allows you to execute up to 32 GPUs (4 servers), but I can not assure it at the moment of writing this post.

2.2 SLURM script

After modifying your sequential python code with Horovod API calls, you must write your job script to submit the job to the SLURM workload manager as we did in the previous section.

For the concrete example introduced in the previous section, the SLURM script to send the jobs to run on one server with 4GPUs looks like this:

#SBATCH --job-name horovod1
#SBATCH --output hvd_1_%j.out
#SBATCH --error hvd_1_%j.err
#SBATCH --nodes=1
#SBATCH --gres='gpu:4'
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task 40
#SBATCH --time 00:25:00
module purge; module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML
horovodrun -np $SLURM_NTASKS -H localhost:$SLURM_NTASKS --gloo \
python3.7 --epochs 10 --batch_size 512

If we take a look at this SLURM script, we can find the following variables that will determine how our model will be distributed:

  • nodes: Amount of servers that will be employed to distribute the workload
  • gres: Amount of GPUs per server that will be used.
  • ntasks-per-node: Number of processes running in a server. In Horovod, each GPU gets pinned to a process. Therefore this should be equal to the number of GPUs employed in each server (it must be added to create SLURM_NTASKS variable that will be required for the -np flag in horovodrunthat specifies the number of processes).
  • cpu-per-task: according to the recommendations of the support team of CTE-POWER it must be equal to 40 per GPU used.

Remember that in previous hands-on exercises, you can find more detailed information about all the #SBATCH commands we usually use in our supercomputer.

The previous listing shows different names than the previous one for the #SBATCH options. It is done deliberately with the purpose of showing that SLURM has some “flexibility” with the names used.

2.3 Horovodrun command flags

Detailed information about the possible flags for horovodrun command can be obtained by executing the command horovodrun --help.

In the above SLURM script, we are using one server with 4 GPUs. If we would like to employ 8 GPUs to do our computation, we would need two servers, as each server in the CTE-POWER has 4 GPUs.

But how can we run our program on more than one server, for example, on two?  For this, we can use the following SLURM script:

#SBATCH --job-name horovod-multinode
#SBATCH --output hvd_multinode_%j.out
#SBATCH --error hvd_multinode_%j.err
#SBATCH --nodes=2
#SBATCH --gres=’gpu:4'
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task 40
#SBATCH --time 00:50:00
module purge; module load gcc/8.3.0 cuda/10.2 cudnn/7.6.4 nccl/2.4.8 tensorrt/6.0.1 openmpi/4.0.1 atlas/3.10.3 scalapack/2.0.2 fftw/3.3.8 szip/2.1.1 ffmpeg/4.2.1 opencv/4.1.1 python/3.7.4_ML
for node in $(scontrol show hostnames "$SLURM_JOB_NODELIST"); do
horovodrun --start-timeout 120 --gloo-timeout-seconds 120 \
-np $SLURM_NTASKS $HOSTS_FLAG --network-interface ib0 --gloo \
python3.7 --epochs 10 --batch_size 512

Using this example, let’s describe in a little more detail the flags of thehorovodrun command.  Specifically, if we display these two variables, SLURM_NTASKS and HOST_FLAGS,  including these two lines in our SLURM job script:


we can obtain their values:

HOST_FLAG= -H p9r1n01-ib0:4,p9r1n03-ib0:4

As expected, SLURM_NTASKS indicates the number of processes in parallel and HOST_FLAGSthe name of hosts used. In this case,  the SLURM manager system assigned the nodes identified by hp9r1n01-ib0  and hp9r1n03-ib0 (these nodes are known once the execution of our SLURM job has started). :4 indicates that our job employs 4 GPUs of each server (therefore 8 GPUs in total).

The other three flags ( start-timeout , gloo-timeout-seconds ,network-interface ib0 ) are included in order to solve some runtime settings at the launching horovodrun server when we were preparing this hands-on exercise. We will not go into details to not lose focus in this exercise.

Task 2: 

Based on the previous code (included on the GitHub for this course as  write a new job script with the name job_hvd_multi_node_16gpus.shthat allows submitting a job that uses 16 GPUs for training your python code Highlight and describe the main lines of the job script (as a hint, you can inspect the that uses 32 GPUs included on GitHub).


3—Scaling Efficiency Study

In this section, we will present a set of experiments that will help us explain how the Horovod API works and, at the same time, experiment with its scalability.

3.1 Ideal definition of the experiment

The canonical test that would be more suitable to evaluate the scalability of Horovod is to perform a set of tests where the same number of epochs are executed with different GPUs: 1,2,4,8,16, and 32. But, unfortunately, this is not feasible given the limitations of resources assigned to this Master’s course. Let’s see why.

Let’s assume that with the 32 GPU test, for example, we ran a minimum number of 8 epochs. In the ideal canonical comparison test, this would involve running 256 epochs on 1 GPU (32×8) in order to compare the tests. Even in the small problem of classifying the CIFAR40 dataset with a ResNet50, each epoch takes around 40 seconds. That means that it takes around 3 hours to train the model with 1 GPU.

Also, more executions need to be added to a real problem. A hyperparameter tuning should be performed, which may require various executions until we obtain the expected results with the validation data. Furthermore, remember that we should carry out several executions to give statistical validity to the tests given the innate randomness in the behavior of the software stack or the initializations of the model’s initializations.

In short, a huge amount of computing resources is required for each student’s group — totally unfeasible given the computing resources available for this Master’s course.

3.2 A limited but sufficient experiment for this course

Given these constraints, we have defined a hands-on exercise that performs specific tests that cover the academic objectives of this exercise: understand how Horovod works. Specifically, we have designed this hands-on exercise centered on measuring only the throughput (images per unit of time) and not considering the accuracy metric when we compare the different tests performed with different amounts of GPUs in order to measure Horovod scaling efficiency. This allows us to define smaller experiments in terms of resource usage.

We can do that because we are assuming that if we perform a correct hyperparameter tuning of the models,  the Accuracy remains fairly constant in relation to the number of total epochs made, regardless of the number of GPUs between which the execution of the epochs has been distributed. This assumption is acceptable considering the results of Facebook’s paper, among others, which describes the adjustments needed to model hyperparameters to achieve the same accuracy in a distributed training job compared to training the same model on a single GPU, demonstrating the feasibility of training a TensorFlow model on 256 GPUs.

In summary, we have designed a limited academic exercise composed of six tests using, respectively,  1,2,4,8, and 16GPUs. In all test cases, each GPU performs the same number of epochs (we suggest 10 epochs). That means that, for instance, the test with 16 GPUs will perform 16 times the number of epochs performed by the test with 1 GPU. In this scenario, the measurement of time of the training process of each test can serve as a metric of the quality of scaling at the throughput level. For example, if Horovod’s scaling efficiency showed a perfect linear speedup, the time in all tests should be exactly the same. On the contrary, if it is observed that the execution time of the tests increases as more GPUs are used, this indicates that there is an overhead due to the distribution.

REMINDER: The resources assigned to this master’s course allow using up to 16 GPUs at CTE-POWER. If it fails during the execution of this hands-on lab, please, use up to 8 GPUs (2 servers) as a minimum. It could be that the SLURM system allows you to execute up to 32 GPUs (4 servers), but I can not assure it at the moment of writing this post.

3.3 Measuring the training execution time

Then, as we focus our hands-on exercise on the computational speed of the process rather than the model’s accuracy, we will only compare the time required to execute the .fitmethod with the different tests. For this purpose, we will use the following simple code to measure the time:

start = time.time(), epochs=epochs, verbose=verbose, callbacks=callbacks)
end = time.time()
if hvd.rank() == 0:
   print(‘Total Time:’, round((end — start), 2), ‘(s)’)

Once we run the previous python code in the file .out that has stored the standard output, we will find the execution time required to execute the method .fit().

There are different ways to design the experiment, but a simple way seems to be to send a job to the SLURM management system for each test: 1,2,4,8,16 GPUs. We created a job script for each test that issues them to the queue system sequentially:

[nct010xx@p9login2]$ ls

Here you can download some of these job scripts if you want to use them (You can try with 32 GPUs, I’m not sure if your account can allocate 4 servers).


And after the execution of each model, each of these SLURM jobs generated a .out file containing the time it took to train 10 epochs:

[nct010xx@p9login2]$ tail hvd_1_4gpus_4721140.out

Total Time: 461.1 (s)

where the last line indicates the required time to complete a total training.

In summary, we will have 6 .out files, one for each test. Remember that we have designed a limited academic exercise composed of six tests (for 1,2,4,8,16, and 32 GPUs). In all executions, each GPU performs the same number of epochs (10 epochs). That means that, for instance, a test with 32 GPUs will perform 32 times the epochs performed by a test with 1 GPU.

Note: The default  shared in the GitHub only outputs the results for process 0 (the time required for process 0). If you want to follow the training process of the other processes, change the verbosevariable value.

Task 3: (10% grade)

Create and submit all job scripts required to measure the time required to execute all different tests for 1,2,4,8,16 and 32.  

Include in the answer a table that summarizes the results obtained. the SLURM scripts for each test (1,2,4,8,16 and 32).

Warning: This exercise requires paying attention to the design and execution of the different tests.

 Note: We have employed two styles of SLURM scripts to deploy this model to the CTE-Power cluster. The first one we have used is prepared to run on one server. Where ntasks-per-node  and gres='gpu:X' define the number of GPUs we will use in the node.  The number of nodes is 1, as we will only deploy this in a single server. The second SLURM file style we have used is prepared to run the code in multiple servers, as we showed in section 2.3, requiring setting the nodes variable to the number of servers.

3.4 Analysis of the results

In summary, in the .out files, we have all the times that interest us in order to discover the scaling efficiency of Horovod. The following graph plots the expected measured time required for executing 10 epochs using up to 64 GPUs (execution done with teacher’s account that has access up to 64 GPUs):

We can observe that the more GPUs we have, despite having the same job per GPU (epochs = 10 * num GPUs), there is more overhead time to add. It should be noted that these times come from a single execution and can vary between several executions. It is evident that if we want to do a good study, we should carry out several tests and then take the averages of the times obtained. But given that the purpose of this academic exercise is to learn by doing the use of Horovod (and the cost of resources, which we have to save!), it is suitable for only one execution.


Task 4: (20% grade)

Plot the results of your execution times required for executing 10 epochs using up to 32 GPUs and compare it with the previous plot.

 Finally, as we already mentioned in previous hands-on exercises, speedup is one of the popular metrics in our community of supercomputing. From the previous data, we can 

The following graph plots the speedup in blue and in white which is missing to reach the optimal linear speed up. As we can see, we are pretty close to it.

Each column n is obtained by, first, dividing the time required to execute 10 epochs with 1 GPU by the time required when we execute the same workload with n GPUs. And finally, multiply the resulting value by n (the number of GPUs used).

We can conclude that the training really scales with regard to images per second processed (which does not mean that in terms of Accuracy it happens the same since, in this example, we have ignored the hyperparameter tunning stage and therefore, we cannot draw conclusions from the Accuracy we get).


Task 5: (10% grade)

Plot the speedup of your experiment in the same way that we did in the previous plot and compare it with the optimal liner speed-up (use your preferred plotting tool).


4—ResNet50 vs ResNet152 Scaling Efficiency

Now it’s time to compare the behavior of Horovod API for different models. For simplicity, we suggest comparing two models from the same family of models, for example, ResNet50V2 and ResNet152V2. However, the  shared in GitHub can be used to measure the training time for any other model.

4.1 ResNet152 experiments

Now it’s time to reproduce the above results in section 4 for the ResNet152V2. This should help us to review and consolidate all the essential concepts of parallelization with Horovod API.

You should pay attention to change, if it is required, the value for the parameters and hyperparameters used to train the ResNet50V2model. For instance, the batch_size  for the ResNet152V2 model probably should be modified since the model is very deep and we have to use a reduced batch_size of 256, compared to the 512 batch_size used in the ResNet50V2 model.

Task 6: (20% grade)

Repeat  Task 3 for ResNet152V2 model and summarize in a table the results obtained. Explain why you think these results come out

4.2 ResNet152 results

Task 7: (10% grade)

Plotting in the same graph the results for  ResNet50V2 model obtained in Task 3 and the results for ResNet152V2 model obtained in Task 6.  Explain the results plotted in the graph.

Note: As a clue to check that the ResNet152V2 tests are being done correctly, we expect an execution times (time to train per GPU) roundly the shown in this graph:

4.3 ResNet152 vs ResNet152

Task 8: (10% grade)

Plot the speedup of your experiment with ResNet152V2  model in the same way that we did in Task 5 with ResNet50V2 (including the reference to the ideal speedup).

Task 9: (10% grade)

Plotting in the same graph the results obtained in Task 5 for  ResNet50V2 model and the results obtained in Task 8 for ResNet152V2 model.  Explain why you think these results come out and compare the results.

Note: As a clue, here are some things to see from the plots: (1) Does the ResNet152V2 model take overall more/equal time? (2) Can we see or not see similarities in the scalability of both cases? (3) Are suboptimal results in one/both cases? (4) Could we conclude that horovod utilizes a very convenient scalability method? 


Final report

In this hands-on exercise, we have practiced how to distribute the training of a single Deep Neural Network over many GPU servers using Horovod.


Task 10:

Write a report for this hands-on exercise that includes all the tasks detailing the steps that are done, the code used, the plots, and the results.  Once finished, generate a PDF version and submit it to the “racó” in the mailbox “exercise 15” of the practices section.


Acknowledgment: Many thanks to Juan Luis Domínguez and Oriol Aranda from BSC-CNS, who wrote the first version of the codes that appear in this series of posts, and to Carlos Tripiana and Francisco Gonzalez for the essential support with the deployment of the software stack of POWER-CTE Supercomputer. Also, many thanks to Alvaro Jover Alvarez, Miquel Escobar Castells, and Raul Garcia Fuentes for their contributions to the proofreading of this document.