Horovod on NVIDIA Jetson


Distributed training with Keras on NVIDIA Jetson TX2 and Xavier

Horovod (github) is a software framework for distributed training of neural networks with Keras, Tensorflow, MxNet or PyTorch. Developed at Uber since 2017 it is still in beta version (latest release is v0.16.4 as of 2019/06/18).

This is a short report on my setup for using a number of NVIDIA Jetson Developer Kits for distributed training.


Install on all machines: MPI, NCCL, Tensorflow, Keras, Horovod.

MPI OpenMPI should have been already installed with the Jetpack. It is important to have the same version on all computers. Check with $ mpirun --version


$ sudo apt-get install libffi6 libffi-dev
$ pip install --user horovod

Run distributed training

If installation finished without errors, you are ready to run a test. I used Keras MNIST example.

It is important to have the script file on the PATH on all machines. If it’s not the case, you can use -x option to mpirun command to copy environment variable (PATH) to all machines.

To start training I used the following command:

$ mpirun -np 3 -H localhost:1,jetson1:1,jetson2:1 \
-mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0 \
-bind-to none -map-by slot python keras_mnist.py

-np 3 tells MPI that I need 3 processes: 3 machines each with 1 GPU: 3 processes in total.
-H localhost:1,jetson1:1,jetson2:1 use 1 GPU on localhost (Xavier in my case), 1 GPU on jetson1 and one on jetson2 host. You must be able to login to these hosts with SSH without password ( ~/.ssh/config can help).
-mca btl_tcp_if_include eth0 and -x NCCL_SOCKET_IFNAME=eth0 options for using eth0 network interface.
-bind-to none -map-by slot OpenMPI options.

Below is the video of running Horovod Keras MNIST sample on the 3 machines.

How fast is parallel training? Check my notes below the video.

Performance Analysis

After training the sample network in parallel, I trained it on Xavier and on TX2 with the same parameters without parallelisation. The only hyper-parameter changed is the learning rate, which is set in the sample code to be larger for parallel training: lr = 1.0 * hvd.size(). hvd.size() is the number of MPI processes, which in my case is 3 for parallel training and 1 for one-machine training. I ran training in parallel on 3 machines, on Xavier only and on TX2 only until validation accuracy reached 0.99.

The graph below shows time and validation accuracy per epoch for each training.

Regarding epoch time (seconds per epoch) you can see, that:

  1. For parallel training one epoch time is about the same as epoch time on TX2 (the slowest type of machine).
  2. Epoch time on Xavier is the shortest – about 3 times shorter than on TX2.


Though accuracy gain per epoch for parallel training is higher than for training on one machine, one parallel training epoch time is equal to epoch time on the slowest of machines. That is because parallel training is done in a synchronous manner and after each epoch all workers share model weight gradients. That means the process on Xavier finishes one epoch earlier, but have to wait for the other two processes on TX2-s to finish.

All-in-all, unbalance in performance between machines makes parallel training even slower than on one faster machine – Xavier. That, however, may change if more machines were used in training making accuracy gain per epoch even higher.

You can learn more about parallel training from this thorough post Distributed TensorFlow using Horovod.