Validation on GPU Systems - NCCL over HPE ProLiant Gen10

Senario - 1: 2-GPUs All-Reduce

Hardware Setup

2 GPUs connected in ring
Each GPU has 6 NVLINKs at 25 GB/s
NCCL ring Algorithm

ASTRA-Sim setup

Modelled with the ASTRA-Sim Analytical Backend
Bidirectional Ring

Collectives run

All-Reduce
Reduction operation - Sum

Results

Alt text

Geomean error rate = 11.4%

Senario - 2: 4-GPUs All-Reduce

Hardware Setup

4 GPUs connected in ring
Each GPU has 6 NVLINKs at 25 GB/s
NCCL ring Algorithm

ASTRA-Sim setup

Modelled with the ASTRA-Sim Analytical Backend
Bidirectional Ring

Collectives run

All-Reduce
Reduction operation - Sum

Results

Alt text

Geomean error rate = 7.9%

Senario - 3: 8-GPUs All-Reduce

Hardware Setup

8 GPUs connected in a hybrid cube mesh:
Each GPU has 6 NVLINKs at 25 GB/s
NCCL ring Algorithm

ASTRA-Sim setup

Modelled with the ASTRA-Sim Analytical Backend
3 Bidirectional Rings

Collectives run

All-Reduce
Reduction operation - Sum

Results

Alt text

Geomean error rate = 2.8%

Recommended practices

Emperically extract warm up latency by running smaller collectives
Emperically extract practical link latency by first running smaller collectives and varying number of NPUs/GPUs

For more information contact Saeed Rashidi (saeed.rashidi@gatech.edu)

Read the Docs v: latest

Versions: 1.0; 2.2; latest