<- FabricLabChecking session...

Learning path

HPC networking from hardware to routed AI fabrics

Browse the full learning path in public. Sign in to open chapters and labs, sync progress, and participate in the technical discussion.

Part 1 - Foundations

No prerequisites - Ch 0-2

Part 2 - Fabric Operations

Requires Part 1 - Ch 3-8

3

The CLI - Reading the Fabric

45 minSign in required
ibstatshow dcb pfcethtoolDiagnostic workflow

The commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.

4

InfiniBand Operations - ONYX CLI and Fabric Management

50 minSign in required
ONYXibdiagnetUFMSubnet ManagerError counters

The InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.

5

PFC, ECN, and Congestion Control

55 minSign in required
PFCECNDCQCNPause stormCongestion

How losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.

6

Efficient Load Balancing

50 minSign in required
ECMPDLBGLBRSHPFlowletsIn-castElephant flows

Why AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.

7

Topology Design

60 minSign in required
Fat-treeBasePODSuperPODOversubscriptionRODRUDCabling

How AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.

8

NCCL - The Application Layer

55 minSign in required
NCCLAllReducebusbwnccl-testsDCQCN tuning

How NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.

Part 3 - Physical Layer and Infrastructure

Requires Part 2 - Ch 9-11

Part 4 - Scale and Architecture

Requires Part 3 - Ch 12-16

Labs

Scenario-based simulator work. Sign in to launch a lab and keep your troubleshooting progress attached to one account.

0

Identify the failed rail

12 minSign in required
CLIPhysical diagnosticsRail state

A GPU rail has gone dark. Use the topology map and CLI tools to identify which rail, confirm the RDMA state, and isolate whether the fault is on the NIC or switch side.

1

Fix the PFC misconfiguration

10 minSign in required
PFCRoCEv2CLI

A RoCEv2 workload is experiencing retransmissions. PFC is misconfigured. Diagnose and fix it.

2

Diagnose fabric congestion

15 minSign in required
CongestionECNCounters

Throughput has dropped 40%. Investigate using interface counters and ECN configuration commands.

3

Diagnose uneven spine utilisation

15 minSign in required
ECMPLoad balancingSpine counters

AllReduce throughput has dropped with no drops visible. Diagnose why spine links are uneven and fix load balancing to restore full training throughput.

4

Evaluate topology proposals

15 minSign in required
OversubscriptionCapacity planningTopology

Two vendors propose different switch configurations for a 64-node DGX H100 cluster. Calculate oversubscription ratios, identify which proposal meets requirements, and submit a recommendation before the purchase order is signed.

5

Diagnose NCCL transport fallback

20 minSign in required
NCCLTransport fallbackEnv vars

A 16-node cluster shows 3 GB/s busbw instead of expected performance. All hardware is healthy. Diagnose why NCCL is using socket transport and fix the environment variable misconfiguration.

6

Triage a silent fabric degradation

25 minSign in required
MonitoringDCGMUFMTelemetry

Training is 12% slower - but no hard errors anywhere. Use UFM port counters, DCGM GPU metrics, and switch counters to correlate a rising pre-FEC BER across three monitoring layers and identify the marginal physical connector before it becomes a full link failure.

7

Uncover the hidden pause storm

15 minSign in required
PFCPause stormECN

A switch looks healthy at a glance, but the NIC reveals a severe pause storm. Check both ends, identify missing ECN, and restore rate control before continuous PFC pauses collapse throughput.

8

Fix the PFC priority mismatch

18 minSign in required
PFCPriority mappingRoCE

PFC is enabled, but on the wrong traffic class. Cross-check NIC drops, PFC priority output, and RoCE DSCP-to-priority mapping, then move PFC protection back to priority 3.

9

Recover the err-disabled rail

20 minSign in required
OpticsErr-disableRail recovery

A rail has gone err-disabled after a physical fault. Recognise the NIC-side active-state trap, confirm the switch port failure, replace the optic, clear err-disable, and verify full recovery.

10

ECMP hotspot: BGP bandwidth community

22 minSign in required
BGPECMPWeighted paths

A reduced-capacity spine is still receiving equal ECMP traffic. Identify the missing BGP Link Bandwidth community and restore weighted ECMP before PFC storms spread.

11

BGP suboptimal routing: spine ASN design

24 minSign in required
BGPFailoverASN design

A link failure triggers a bad 3-hop path because the spines use different ASNs. Trace the suboptimal route, unify the spine ASN design, and verify clean failover behavior.