Learning path
HPC networking from hardware to routed AI fabrics
Browse the full learning path in public. Sign in to open chapters and labs, sync progress, and participate in the technical discussion.
Part 1 - Foundations
No prerequisites - Ch 0-2
The Hardware Story
40 minSign in requiredPhysical layer orientation. What an HCA is, why NICs became DPUs, how a DGX node is wired, the three separate networks.
Operating Systems and Management Platforms
35 minSign in requiredWhat runs on every device. How you access it after power-on. The management philosophy. CLI vs orchestrated. First power-on sequence.
Why HPC Networking Is Different
25 minSign in requiredThe AllReduce barrier, why TCP fails, tail latency math, and the mental model shift from enterprise to AI networking.
Part 2 - Fabric Operations
Requires Part 1 - Ch 3-8
The CLI - Reading the Fabric
45 minSign in requiredThe commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.
InfiniBand Operations - ONYX CLI and Fabric Management
50 minSign in requiredThe InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.
PFC, ECN, and Congestion Control
55 minSign in requiredHow losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.
Efficient Load Balancing
50 minSign in requiredWhy AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.
Topology Design
60 minSign in requiredHow AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.
NCCL - The Application Layer
55 minSign in requiredHow NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.
Part 3 - Physical Layer and Infrastructure
Requires Part 2 - Ch 9-11
Optics, Cabling, and the Physical Layer
40 minSign in requiredThe physical layer beneath the fabric: 400G/800G optics, DSPs, fiber types, form factors, cable selection, and why signal integrity and power density now shape AI cluster design.
The Storage Fabric
45 minSign in requiredThe separate network that feeds and protects training: storage isolation, GDS data paths, NVMe-oF transports, parallel file systems, checkpoint economics, and storage topology choices.
Monitoring, Telemetry, and Observability
48 minSign in requiredKnow about problems before the ML engineer's Slack message arrives. UFM REST API, DCGM GPU metrics, Prometheus alert design, threshold calibration, and cross-layer correlation across four monitoring streams.
GPU Hardware Generations
55 minSign in requiredNetwork-relevant implications of GPU generations: NVLink/NVSwitch generation table, SXM vs PCIe form factors, GH200, H100 CNX, and Confidential Computing.
Part 4 - Scale and Architecture
Requires Part 3 - Ch 12-16
Scale-Up Networking - NVLink Switch System
45 minSign in requiredExternal NVLink Switch modules, 57.6 TB/s all-to-all at 256 GPUs, NVLink Network addressing, scale-up vs scale-out architecture decisions, and NVLink Switch diagnostics.
Alternative Topologies
45 minSign in requiredTorus, folded torus, dragonfly, and TPU Pod design choices - where they came from, what workloads they suit, and why fat-tree remains dominant for AI training clusters.
IP Routing for AI/ML Fabrics
55 minSign in requiredHow modern AI fabrics use routed Ethernet: BGP unnumbered, ASN design, BGP DPF, RIFT comparisons, Flex Algo, SRv6 path steering, and multi-tenant EVPN-VXLAN design.
The GPU Compute Network - Packet Anatomy
55 minSign in requiredA packet-level walkthrough from NCCL work queue entries to remote DMA completion: DGX interfaces, Queue Pair mechanics, ConnectX-7 processing, switch forwarding, and end-to-end packet decode.
Labs
Scenario-based simulator work. Sign in to launch a lab and keep your troubleshooting progress attached to one account.
Identify the failed rail
12 minSign in requiredA GPU rail has gone dark. Use the topology map and CLI tools to identify which rail, confirm the RDMA state, and isolate whether the fault is on the NIC or switch side.
Fix the PFC misconfiguration
10 minSign in requiredA RoCEv2 workload is experiencing retransmissions. PFC is misconfigured. Diagnose and fix it.
Diagnose fabric congestion
15 minSign in requiredThroughput has dropped 40%. Investigate using interface counters and ECN configuration commands.
Diagnose uneven spine utilisation
15 minSign in requiredAllReduce throughput has dropped with no drops visible. Diagnose why spine links are uneven and fix load balancing to restore full training throughput.
Evaluate topology proposals
15 minSign in requiredTwo vendors propose different switch configurations for a 64-node DGX H100 cluster. Calculate oversubscription ratios, identify which proposal meets requirements, and submit a recommendation before the purchase order is signed.
Diagnose NCCL transport fallback
20 minSign in requiredA 16-node cluster shows 3 GB/s busbw instead of expected performance. All hardware is healthy. Diagnose why NCCL is using socket transport and fix the environment variable misconfiguration.
Triage a silent fabric degradation
25 minSign in requiredTraining is 12% slower - but no hard errors anywhere. Use UFM port counters, DCGM GPU metrics, and switch counters to correlate a rising pre-FEC BER across three monitoring layers and identify the marginal physical connector before it becomes a full link failure.
Uncover the hidden pause storm
15 minSign in requiredA switch looks healthy at a glance, but the NIC reveals a severe pause storm. Check both ends, identify missing ECN, and restore rate control before continuous PFC pauses collapse throughput.
Fix the PFC priority mismatch
18 minSign in requiredPFC is enabled, but on the wrong traffic class. Cross-check NIC drops, PFC priority output, and RoCE DSCP-to-priority mapping, then move PFC protection back to priority 3.
Recover the err-disabled rail
20 minSign in requiredA rail has gone err-disabled after a physical fault. Recognise the NIC-side active-state trap, confirm the switch port failure, replace the optic, clear err-disable, and verify full recovery.
ECMP hotspot: BGP bandwidth community
22 minSign in requiredA reduced-capacity spine is still receiving equal ECMP traffic. Identify the missing BGP Link Bandwidth community and restore weighted ECMP before PFC storms spread.
BGP suboptimal routing: spine ASN design
24 minSign in requiredA link failure triggers a bad 3-hop path because the spines use different ASNs. Trace the suboptimal route, unify the spine ASN design, and verify clean failover behavior.