§36 / 32

AI/HPC Data Center Design

AI workloads · 30-100+ kW racks · DLC · InfiniBand · SuperPODs

AI/HPC workloads have flipped data center design. Atlas DC1 is air-cooled and traditional. AI clusters need liquid cooling, busway distribution, sub-millisecond network fabric, and 5-100× the power density. This section maps the gap.

What Makes AI/HPC Different

Atlas DC1 was designed for traditional cloud + enterprise workloads — distributed servers, mixed densities, web/database work. AI/HPC is fundamentally different. Training a single large language model can use 25,000+ GPUs in tightly-coupled clusters with sub-millisecond network coordination. The design constraints flip:

Dimension	Traditional DC (Atlas DC1)	AI/HPC
Rack density	~ 12 kW/rack	30-100+ kW/rack (training); up to 200+ for inference
Cooling	Air (CRAH + containment)	Liquid (DLC, immersion) — mandatory
Power per row	1-1.5 MW	5-20+ MW
Network	10-100 Gbps Ethernet, ms latency OK	InfiniBand or NVLink, sub-ms latency required
Workload pattern	Bursty (web requests come/go)	Constant (training runs for weeks at full load)
Failure tolerance	Application-level (web servers fail individually)	Cluster-level (one server failing can halt 1000-server training run)
Power continuity	UPS ride-through 5 min OK	Same — but checkpointing failures can cost days of training time
PUE target	1.3-1.5	1.05-1.2
Capital cost / MW	$15-22M/MW	$25-50M/MW (cooling + network premium)
Build timeline	18-24 months	24-36 months (custom mech, complex commissioning)

The AI Compute Stack — What's Inside an "AI Cluster"

Layer	Component	Power per unit
Accelerator	NVIDIA H100 (700W), H200 (1000W), B100/B200 (~ 1200W), AMD MI300 (750W), custom (Google TPU, AWS Trainium, Meta MTIA, Microsoft Maia)	700-1200W per chip
Server (DGX-style)	8 GPUs + 2 CPUs + memory + NICs	10-12 kW per server (NVIDIA HGX H100 = 10.2 kW)
Rack	4-8 servers per rack (with cooling)	40-80+ kW/rack
Pod / Cluster	16-128 racks tightly coupled by InfiniBand	1-10+ MW per pod
SuperPOD / SuperCluster	Multiple pods coordinated for very large training (NVIDIA SuperPOD = 32-127 DGX systems)	10-50+ MW per SuperPOD
Hyperscale AI campus	Multiple SuperPODs (xAI Memphis = 100,000+ H100 = 100+ MW)	100 MW - 1+ GW

The Network Fabric — Why It Matters for Power

AI training requires GPUs in different servers to share gradients millions of times per second. Standard Ethernet has too much latency. Two competing technologies:

Technology	Use	Power impact
NVLink (NVIDIA)	GPU-to-GPU within and between servers — 900 GB/s per link, sub-microsecond latency	Switch racks (NVLink switches) consume 5-20 kW each
InfiniBand (Mellanox/NVIDIA)	Server-to-server within pod — 400 Gbps per port, microsecond latency	IB switches consume 1-3 kW each
Ethernet (RoCE)	Alternative for scale-out; emerging Ultra Ethernet	Lower than InfiniBand
Optical interconnect	Cross-pod cabling at 800 Gbps+ optical	Optical transceivers add 10-30W per port

For a 10 MW AI cluster, the network fabric alone can consume 5-10% of total power — not negligible.

Power Distribution Architecture for AI/HPC

Element	Atlas DC1 (traditional)	AI/HPC equivalent
Service voltage	12.47 kV utility	Same OR higher (138 kV for hyperscale campuses)
Service transformers	2 × 2,500 kVA	Multiple 5-30 MVA transformers (per pod)
Distribution voltage	480Y/277V to 415Y/240V	480V or 415V → some hyperscale exploring 800V DC for direct-to-server feed
Per-row distribution	RPP panelboard (400 A)	Bus duct (2,000-4,000 A)
Per-rack delivery	30-60A branch circuits	100-225A plug-in disconnect from busway
UPS ride-through	5 minutes	Same OR shorter (some designs use rotary UPS for inertia + ride-through)
Redundancy	2N (dual-fed servers)	2N OR distributed redundant (4N3) at hyperscale; some accept N+1 at module level
Cooling power	~ 30% of IT	~ 5-15% of IT (DLC much more efficient)

Worked Example — A 10 MW AI Pod

Example · NVIDIA HGX H100 SuperPODSizing electrical for a typical AI training cluster

The cluster

Compute

128 × HGX H100 servers (8-GPU each) = 1,024 H100 GPUs

Server power

10.2 kW each × 128 = 1,306 kW (just GPUs + CPUs)

Network

36 InfiniBand switches at 1.5 kW = 54 kW

Storage

Parallel filesystem (Weka, Lustre): 100 kW

Total IT load

~ 1,460 kW = 1.46 MW

Sized facility load

IT load × 1.25 (continuous):
1,460 × 1.25 = 1,825 kW
Mech load (DLC + facility):
10% of IT (vs 30% for air-cooled) = 146 kW + facility overhead 50 kW = ~ 200 kW
Total facility demand:
1,825 + 200 = ~ 2,025 kW
PUE achieved:
2,025 / 1,460 = ~ 1.39 (could be lower with optimized DLC)

Electrical infrastructure

Service transformer

2,500 kVA pad-mount (or 2 × 1,500 if 2N)

UPS

2 × 1,250 kVA online double-conversion (2N for IT)

Generators

2 × 2,500 kW Tier 4 diesel

Power distribution

480Y/277V busway (4,000 A) down each row

Per-rack feed

100 A plug-in disconnect (10 kW × 1.25 = ~ 30 A safety margin built-in)

Cooling

Direct liquid cooling (CDUs serving multiple racks); chilled water 30°C supply

Why this single pod is bigger than half of Atlas DC1

Atlas DC1 = 2.5 MW total. This single AI pod = 1.46 MW IT (2.0 MW total facility). One pod consumes more power than HALF of Atlas DC1's design capacity. Modern hyperscale AI campuses might have 50-100 of these pods coordinating on a single training run.

The Frontier — Coming Architectures

Trend (2026)	Implication
800V DC distribution	Eliminates AC-DC-AC conversion at every PSU. Pioneered by Open Compute Project (OCP). Adopted by hyperscale.
Battery backup IN the rack	Replace centralized UPS with batteries at each rack — eliminates UPS losses, simplifies redundancy
Microgrid + on-site generation	Pair AI campus with on-site PV + ESS + gas turbines. 100+ MW microgrids becoming common.
Submersion / two-phase immersion	Pushing rack densities to 200-400 kW/rack
Heat reuse to district heating	Datacenter waste heat (50-80°C with DLC) feeds neighboring buildings or even municipal heat grids (Helsinki, Stockholm)
Modular AI pods	Factory-built pods shipped to site; deploy in 6 months instead of 24
Co-location with renewables	Build AI campus next to wind/solar farms; long-term PPAs lock in low-cost clean power

If You See THIS, Think THAT

If you see…	Think / use…
"AI/HPC data center"	30-100 kW/rack · DLC mandatory · sub-ms network · single training run uses 1000s of GPUs
"NVIDIA HGX H100" / "DGX"	NVIDIA's reference 8-GPU server. ~ 10 kW. Standard AI building block.
"SuperPOD"	NVIDIA terminology for 32-127 DGX systems coordinated by InfiniBand
"InfiniBand"	Required for tight GPU coordination. Higher cost than Ethernet but required for training.
"NVLink switch"	NVIDIA's GPU-to-GPU interconnect within and between servers
"800V DC"	Open Compute Project standard. Direct DC to server. Hyperscale-only currently.
"Liquid cooling" in 2026 context	Almost certainly DLC (cold plates), increasingly immersion. See §35.
"PUE 1.1" or lower	DLC or immersion. Air-cooled cannot achieve this.
"Hyperscaler"	AWS, Google, Microsoft, Meta, Apple, Alibaba, Tencent. Operate own DCs.
"Cloud GPU on-demand"	End-user accesses these AI clusters via cloud APIs. The DC is hyperscaler's; the GPUs are rented by hour.