Repository for Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
Alternatives To Lsgd
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
2 years ago53HTML吴恩达老师的深度学习课程笔记及资源)
Biggan Pytorch2,356
3 years ago30mitPython
The author's officially unofficial PyTorch BigGAN implementation.
Deeprl Agents2,004
5 years ago44mitJupyter Notebook
A set of Deep Reinforcement Learning Agents implemented in Tensorflow.
35 months ago1April 12, 201723bsd-2-clauseHaskell
Deep Learning in Haskell
2 years ago50gpl-3.0Python
PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(
Awesome Deep Rl1,225
7 months agomitHTML
For deep RL and the future of AI.
a day ago128otherLLVM
High-performance automatic differentiation of LLVM and MLIR.
2 months ago2mitPython
Layers Outputs and Gradients in Keras. Made easy.
Tf Explain93416a year ago8November 18, 202141mitPython
Interpretability Methods for tf.keras models with Tensorflow 2.x
Awesome Gradient Boosting Papers930
3 months ago1cc0-1.0Python
A curated list of gradient boosting research papers with implementations.
Alternatives To Lsgd
Select To Compare

Alternative Project Comparisons

Leader Stochastic Gradient Descent [LSGD Blog]

This is the code repository for paper Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models.


PyTorch from official website. This is one of the most popular deep learning tools.


  1. On the first GPU node, use command ifconfig in terminal to check its IP address;

  2. Open the bash file fill in ip_addr with the IP address you obtained from step 1 and num_groups with the total number of GPU nodes you have (i.e. num_groups=n);

  3. On j-th GPU note, in terminal type bash [j-1] (i.e. index starts from 0) to run the codes.

Possible Issues

  1. Type nvidia-smi in terminal to check nvidia driver and cuda compatibility;

  2. Check consistency of PyTorch versions across the machines.


This page is mainly used for explaining the distributed training library of LSGD I developed.

The general architecture is the following:

At the beginning of each iteration, the local worker sends out a request to its local server and then the local server passes on the workers request to the global server. The global server checks the current status and replies to the local server. The local server passes on the global servers message to the worker. Finally, depending on the message from the global server, the worker will choose to follow the local training or distributed training scheme.

More specifically, the communication between the local server and the workers is based on shared queues. For each worker, at the beginning of each iteration, it sends a requenst to local server and then waits for the answer.

The local server gets the answer from global server and puts the answer into the queue shared with the workers. Then the worker reads and follows the answer.

The advantage of using queue is that: (1) the number of request the local server paases on to the global server is always equal to the number of requests raised by workers (a shared tensor which stores the total number of requests from workers may induce inaccurate count as this number can be modified simutaneously by the workers) and (2) the worker waits until it gets the answer from the local server.

For each local server, the global server creates a thread for listening and responding to it. All the threads share a iteration counter which stores the total number of iterations of all workers after the last communication. Again, the data structure of iteration counter is a queue instead of a number or a single-value tensor as there is a chance that two or more threads can modify the value simutaneously. Althout due to Global Interpreter Lock the conflicts should not happen, in practice I found using queue is safer.

Each Node/Cluster/DevBox is usually provided with 4/8/16 GPUs (workers). We call all GPUs (workers) in the same Node/Cluster/DevBox as a group.


Hyperparameter Selection

For learning rates and pulling strengths, in a relatively reasonalble range, smaller values usually encourage lower final test error while larger values encourage faster convergence.

For communication period, this depends on the network condition. Our lab is using Ethernet which is usually slower than InfiniBand. The data transfer speed can be checked by netowrk tool iperf.

Batch Normaliztion

One dilemma for distributed training is to deal with running mean and running variance of batch normalization. Based on our knowledge, there is no perfect solution to synchronize running mean and running variance especially for asynchronous communication so far. Usually, when synchronizing we can ignore running mean and running variance so that each model (worker) keeps their own statistics. In our implementation, we average both running mean and running variance of all workers for every commumication period (As a side note, for PyTorch 1.0 version, synchronizing the empty tensors will cause an error, so make sure to comment out the parts of synchronizing the buffers).


The codes were developed when PyTorch first officially released its distributed training library. There might be potential issues in the codes I haven't discovered yet due to the limitation of my knowledge. Feedbacks and questions are welcome!

Popular Deep Learning Projects
Popular Gradient Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Gradient Descent