Validation on GPU Systems - NCCL over HPE ProLiant Gen10
Senario - 1: 2-GPUs All-Reduce
Hardware Setup
2 GPUs connected in ring
Each GPU has 6 NVLINKs at 25 GB/s
NCCL ring Algorithm
ASTRA-Sim setup
Modelled with the ASTRA-Sim Analytical Backend
Bidirectional Ring
Collectives run
All-Reduce
Reduction operation - Sum
Results
Geomean error rate = 11.4%
Senario - 2: 4-GPUs All-Reduce
Hardware Setup
4 GPUs connected in ring
Each GPU has 6 NVLINKs at 25 GB/s
NCCL ring Algorithm
ASTRA-Sim setup
Modelled with the ASTRA-Sim Analytical Backend
Bidirectional Ring
Collectives run
All-Reduce
Reduction operation - Sum
Results
Geomean error rate = 7.9%
Senario - 3: 8-GPUs All-Reduce
Hardware Setup
8 GPUs connected in a hybrid cube mesh:
Each GPU has 6 NVLINKs at 25 GB/s
NCCL ring Algorithm
ASTRA-Sim setup
Modelled with the ASTRA-Sim Analytical Backend
3 Bidirectional Rings
Collectives run
All-Reduce
Reduction operation - Sum
Results
Geomean error rate = 2.8%
Recommended practices
Emperically extract warm up latency by running smaller collectives
Emperically extract practical link latency by first running smaller collectives and varying number of NPUs/GPUs
For more information contact Saeed Rashidi (saeed.rashidi@gatech.edu)