Validation on GPU Systems - NCCL over HGX-H100 Systems

HGX-H100 Topology

alt text

Senario - 1: 2-GPUs All-Reduce

Hardware Setup

  1. 2 H100 GPUs connected in switch, over 4 NVSwitches

  2. Each GPU has bidirectional BW of 900 GB/s

  3. NCCL Ring Algorithm

ASTRA-Sim setup

  1. Modelled with the ASTRA-Sim Analytical Backend

  2. Switch Network Topology

Collectives run

  1. All-Reduce

  2. Reduction operation - Sum

Results

alt text

Geomean error rate = 20.63%

Senario - 2: 4-GPUs All-Reduce

Hardware Setup

  1. 4 H100 GPUs connected in switch, over 4 NVSwitches

  2. Each GPU has bidirectional BW of 900 GB/s

  3. NCCL Ring Algorithm

ASTRA-Sim setup

  1. Modelled with the ASTRA-Sim Analytical Backend

  2. Switch Network Topology

Collectives run

  1. All-Reduce

  2. Reduction operation - Sum

Results

alt text

Geomean error rate = 12.01%

Senario - 3: 8-GPUs All-Reduce

Hardware Setup

  1. 8 H100 GPUs connected in switch, over 4 NVSwitches

  2. Each GPU has bidirectional BW of 900 GB/s

  3. NCCL Ring Algorithm

ASTRA-Sim setup

  1. Modelled with the ASTRA-Sim Analytical Backend

  2. Switch Network Topology

Collectives run

  1. All-Reduce

  2. Reduction operation - Sum

Results

alt text

Geomean error rate = 9.69%