File size: 330 Bytes
5fa1a76
 
 
 
1
2
3
4
An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows:

NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported.