OPEN PLATFORM / HPC NETWORKING / COMMUNITY REVIEWED
Master the fabricthat runs AI.
InfiniBand. RoCEv2. RDMA. DGX SuperPOD. Spectrum-X. Learn AI and HPC networking through interactive chapters, stateful labs, and a simulator built for network engineers. Sign in once, then keep your progress and discussion history attached to one account.
scroll to explore
THE GAP
There is no Packet Tracer for HPC networking.
Network engineers who can troubleshoot BGP, reason about ECMP, and design VxLAN fabrics still walk into AI data centers and find an unfamiliar world. The knowledge is fragmented across vendor docs, conference talks, and incident writeups.
FabricLab turns that scattered knowledge into a structured, reviewable, interactive platform. Chapters explain the hardware and protocols. Labs let you test commands against live state. The community can then sharpen both.
400G
per GPU rail in a modern DGX training fabric
0
packet drops tolerated in healthy RDMA training flows
17
published chapters live in the open catalog today
CURRICULUM
A structured path from hardware to protocol.
17 chapters. 12 scenario labs. One simulator. Browse the catalog in public, then sign in to open the learning surfaces.
Chapter 0
The Hardware Story
Physical layer orientation. What an HCA is, why NICs became DPUs, how a DGX node is wired, the three separate networks.
Chapter 1
Operating Systems and Management Platforms
What runs on every device. How you access it after power-on. The management philosophy. CLI vs orchestrated. First power-on sequence.
Chapter 2
Why HPC Networking Is Different
The AllReduce barrier, why TCP fails, tail latency math, and the mental model shift from enterprise to AI networking.
Chapter 3
The CLI - Reading the Fabric
The commands and discipline for reading HPC fabric state. Which commands run where, how to read their output, and the investigation workflow from physical layer to configuration.
Chapter 4
InfiniBand Operations - ONYX CLI and Fabric Management
The InfiniBand operations layer: ONYX CLI, error counter interpretation, Subnet Manager management, ibdiagnet fabric sweep, and UFM event correlation.
Chapter 5
PFC, ECN, and Congestion Control
How losslessness actually works: PFC mechanics at the wire level, pause storm formation, ECN CE bit marking, DCQCN rate control algorithm, and the complete RoCEv2 port configuration checklist.
Chapter 6
Efficient Load Balancing
Why AI traffic is structurally low-entropy and how that breaks ECMP. The four load balancing modes (SLB, DLB, GLB, sDLB). Per-packet spraying and RSHP. In-cast congestion patterns and how to diagnose them from spine utilisation counters.
Chapter 7
Topology Design
How AI fabric scales from one switch to a SuperPOD. Fat-tree topology math, bisection bandwidth, BasePOD vs SuperPOD reference designs, oversubscription calculations, ROD vs RUD wiring, switch buffer selection, and cabling constraints.
Chapter 8
NCCL - The Application Layer
How NCCL translates AllReduce into RDMA operations. Ring vs Tree vs Double-Binary Tree algorithms. The environment variables that determine whether NCCL finds RDMA or falls back to TCP. Reading nccl-tests output. Correlating busbw degradation to fabric diagnostics.
Chapter 9
Optics, Cabling, and the Physical Layer
The physical layer beneath the fabric: 400G/800G optics, DSPs, fiber types, form factors, cable selection, and why signal integrity and power density now shape AI cluster design.
Chapter 10
The Storage Fabric
The separate network that feeds and protects training: storage isolation, GDS data paths, NVMe-oF transports, parallel file systems, checkpoint economics, and storage topology choices.
Chapter 11
Monitoring, Telemetry, and Observability
Know about problems before the ML engineer's Slack message arrives. UFM REST API, DCGM GPU metrics, Prometheus alert design, threshold calibration, and cross-layer correlation across four monitoring streams.
Chapter 12
Scale-Up Networking - NVLink Switch System
External NVLink Switch modules, 57.6 TB/s all-to-all at 256 GPUs, NVLink Network addressing, scale-up vs scale-out architecture decisions, and NVLink Switch diagnostics.
Chapter 13
Alternative Topologies
Torus, folded torus, dragonfly, and TPU Pod design choices - where they came from, what workloads they suit, and why fat-tree remains dominant for AI training clusters.
Chapter 14
GPU Hardware Generations
Network-relevant implications of GPU generations: NVLink/NVSwitch generation table, SXM vs PCIe form factors, GH200, H100 CNX, and Confidential Computing.
Chapter 15
IP Routing for AI/ML Fabrics
How modern AI fabrics use routed Ethernet: BGP unnumbered, ASN design, BGP DPF, RIFT comparisons, Flex Algo, SRv6 path steering, and multi-tenant EVPN-VXLAN design.
Chapter 16
The GPU Compute Network - Packet Anatomy
A packet-level walkthrough from NCCL work queue entries to remote DMA completion: DGX interfaces, Queue Pair mechanics, ConnectX-7 processing, switch forwarding, and end-to-end packet decode.
HOW IT WORKS
Three systems. One learning loop.
Structured chapters
Each chapter connects physical hardware, transport behavior, congestion control, and operator workflow into one narrative with visual support.
Stateful CLI labs
The simulator is not static text. Commands read live lab state, so outputs change as you diagnose faults, fix configuration, or recover links.
Community feedback loop
Readers can comment directly on chapters and labs, report technical glitches, and help tighten the curriculum without waiting for a closed release cycle.
WHO THIS IS FOR
Built by a network engineer, for network engineers.
CCNP / CCIE engineers
You can read BGP tables and design VxLAN fabrics. You have not spent time inside InfiniBand or RoCE yet. FabricLab is the transition path.
HPC cluster administrators
You manage the servers, but the fabric still feels opaque. FabricLab closes the gap between compute operations and network operations.
Cloud and platform architects
You are designing GPU infrastructure and need to understand what lossless fabrics demand at the protocol and operational level.
Network engineers growing into AI infrastructure
AI fabrics are a fast-moving specialisation. FabricLab gives you a structured path before the first production incident lands on your desk.
COMMUNITY
Keep the platform open. Make it sharper every week.
Comment where the issue appears
Leave technical corrections, lab glitches, and operator notes directly on the relevant chapter or lab page.
Contribute through the repo
The repository is set up for focused fixes, issue reports, new labs, and chapter improvements without turning the platform into a private product wall.
Support without restricting access
FabricLab stays openly accessible. Support links are optional and simply help fund more chapter review, better labs, and platform polish.
START LEARNING
Learn the fabric in the open.
Browse the catalog in public, then sign in to open chapters and labs, join tracked discussions, and keep your progress attached to one account.