mirror of
https://github.com/facebookresearch/ReAgent.git
synced 2026-05-17 12:40:39 +00:00
d2be5b2fe3
Summary: Pull Request resolved: https://github.com/facebookresearch/ReAgent/pull/162 Reviewed By: kittipatv Differential Revision: D17921494 Pulled By: MisterTea fbshipit-source-id: 6c5d4d90e5f36ae05ff9fcb387611c995dd110ac
48 lines
2.3 KiB
ReStructuredText
48 lines
2.3 KiB
ReStructuredText
.. _distributed:
|
|
|
|
Training a model across multiple GPUs
|
|
=====================================
|
|
|
|
Before we get started, please check out the :ref:`Usage Guide <usage>` and
|
|
the `PyTorch Distributed documentation <https://pytorch.org/docs/stable/distributed.html>`_.
|
|
|
|
How distributed training works
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
With a single GPU and model, training follows this process:
|
|
|
|
1. Compute the loss from a minibatch of data (the forward pass of the model)
|
|
2. Backpropagate that loss through the model to compute gradients (the backward pass of the model)
|
|
3. Sum the gradients across the minibatch.
|
|
4. Run the optimizer by calling the "step()" function.
|
|
|
|
Now assume we have several GPUs, and they each have exactly the same model.
|
|
#1 and #2 are `embarrassingly parallel <https://en.wikipedia.org/wiki/Embarrassingly_parallel>`_ and can be distributed to many nodes.
|
|
As long as we can sum across nodes to complete #3 (this is known as an 'all-reduce'), then each node can run #4 on the same gradients,
|
|
and the resulting models will again be identical. This is the premise behind distributed training.
|
|
|
|
Training on a single node
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Using multiple GPUs on a single node is relatively straightforward. When running either the dqn_workflow or the parametric_dqn_workflow,
|
|
set the "use_all_avail_gpus" parameters in the input config (the json file) to true. ReAgent will detect the number of available GPUs and
|
|
run on all of them without any additional effort.
|
|
|
|
Training on multiple nodes
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Multi-node training requires more setup. Some prerequisites:
|
|
|
|
1. A networked filesystem (such as NFS) that all trainers can access
|
|
2. A local filesystem that is unique to each trainer
|
|
|
|
To set up the environment for *N* nodes, do the following:
|
|
|
|
1. Create an empty file on the networked filesystem which the *N* machines will use to communicate
|
|
2. Either split the dataset into *N* equal parts, **or** shuffle the dataset to create N copies.
|
|
3. Put one copy of the dataset onto each machine's local filesystem in exactly the same path.
|
|
4. Set the "use_all_avail_gpus" parameter to true as above
|
|
5. Also set the "num_nodes" parameter to *N*.
|
|
6. On each machine, run the workflow with the "--node_index=n" flag, where n is the index of that machine.
|
|
7. The machine with --node_index=0 will save the final model to the output path specified.
|