Validation on GPU Systems - NCCL over HPE ProLiant Gen10
Senario - 1: 2-GPUs All-Reduce
Hardware Setup
- 2 GPUs connected in ring 
- Each GPU has 6 NVLINKs at 25 GB/s 
- NCCL ring Algorithm 
ASTRA-Sim setup
- Modelled with the ASTRA-Sim Analytical Backend 
- Bidirectional Ring 
Collectives run
- All-Reduce 
- Reduction operation - Sum 
Results

Geomean error rate = 11.4%
Senario - 2: 4-GPUs All-Reduce
Hardware Setup
- 4 GPUs connected in ring 
- Each GPU has 6 NVLINKs at 25 GB/s 
- NCCL ring Algorithm 
ASTRA-Sim setup
- Modelled with the ASTRA-Sim Analytical Backend 
- Bidirectional Ring 
Collectives run
- All-Reduce 
- Reduction operation - Sum 
Results

Geomean error rate = 7.9%
Senario - 3: 8-GPUs All-Reduce
Hardware Setup
- 8 GPUs connected in a hybrid cube mesh:  
- Each GPU has 6 NVLINKs at 25 GB/s 
- NCCL ring Algorithm 
ASTRA-Sim setup
- Modelled with the ASTRA-Sim Analytical Backend 
- 3 Bidirectional Rings 
Collectives run
- All-Reduce 
- Reduction operation - Sum 
Results

Geomean error rate = 2.8%
Recommended practices
- Emperically extract warm up latency by running smaller collectives 
- Emperically extract practical link latency by first running smaller collectives and varying number of NPUs/GPUs 
For more information contact Saeed Rashidi (saeed.rashidi@gatech.edu)