UALink and the Battle for Rack-Scale GPU Interconnect

Abstract

Scaling AI workloads to tens of thousands of accelerators requires more than raw FLOPS — it requires the ability to connect GPUs into a coherent compute fabric. Nvidia has long dominated this domain with NVLink and NVSwitch, which provide high-bandwidth, low-latency interconnects tightly integrated into its CUDA ecosystem. These technologies allowed Nvidia to extend its dominance from chips to entire racks and pods, creating systemic lock-in for hyperscalers.

In 2025, however, a new open standard emerged: UALink (Ultra Accelerator Link). Backed by AMD, Intel, Broadcom, HPE, Dell, Cisco, and other industry leaders, UALink is designed to break Nvidia’s monopoly at rack scale. UALink aims to provide a vendor-neutral, high-speed interconnect supporting up to 1,024 accelerators per pod, enabling coherent memory pools and multi-vendor GPU deployments.

This whitepaper examines the origins of Nvidia’s interconnect dominance, the technical underpinnings of UALink, its implications for system design, the challenges it faces, and its potential to reshape competition between AMD and Nvidia in the datacenter AI market.

1. Introduction

The exponential growth of AI workloads has forced system architects to think in terms of racks, pods, and clusters, not individual GPUs. Training a trillion-parameter model can require tens of thousands of GPUs operating as a tightly coupled system. To achieve this, accelerators must exchange data at terabytes per second within nodes and hundreds of gigabytes per second across racks.

Nvidia anticipated this early. By the mid-2010s, it realized PCIe interconnects would be insufficient for large-scale GPU workloads. It developed NVLink, a point-to-point GPU interconnect delivering up to 900 GB/s of bandwidth in the latest Blackwell generation. NVLink was then extended into NVSwitch, a fully connected switch fabric allowing up to 256 GPUs in a single domain to share memory and operate like one giant accelerator.

This infrastructure, combined with InfiniBand across pods, created Nvidia’s systemic lock-in: hyperscalers had to buy not just GPUs, but the entire stack. Competitors like AMD lacked an equivalent interconnect and were confined to small-scale deployments.

UALink, launched in 2025, represents the first serious attempt to break this monopoly. By offering an open, multi-vendor interconnect, it could shift power away from Nvidia and toward a more diverse ecosystem.

2. Nvidia’s NVLink and NVSwitch: The Legacy

2.1 NVLink Origins
PCIe, the dominant server interconnect, offered limited bandwidth and high latency for GPU-GPU communication. Nvidia developed NVLink in 2016 to bypass this bottleneck. Each NVLink lane offered tens of GB/s, with multiple links per GPU delivering hundreds of GB/s aggregate bandwidth.

2.2 NVSwitch Expansion
NVSwitch extended NVLink into a switch fabric. In the Hopper generation, NVSwitch 3 supported 900 GB/s per GPU in a fully connected topology, allowing 256 GPUs in a single system.

This architecture created memory pooling: GPUs could directly access each other’s memory, effectively expanding usable VRAM per workload. For training models with hundreds of billions of parameters, this was critical.

2.3 Strategic Implications
NVLink + NVSwitch locked hyperscalers into Nvidia’s ecosystem. CUDA’s NCCL library was tightly optimized for this topology, delivering performance unmatched by competitors. Without an alternative, even hyperscalers wary of Nvidia’s pricing had no viable second source.

3. UALink: Origins and Objectives

In 2025, AMD, Intel, Broadcom, HPE, Dell, and others announced UALink. Its goals were explicit:

Provide a standardized, open GPU interconnect at rack scale.
Support up to 1,024 accelerators per domain, far surpassing Nvidia’s 256-GPU NVSwitch.
Enable coherent memory pools across heterogeneous accelerators.
Break single-vendor dependence by ensuring interoperability.

The consortium’s motivation was clear: no single vendor could counter Nvidia alone. By joining forces, they created the possibility of a credible alternative.

4. Technical Foundations of UALink

4.1 Architecture
UALink is designed as a switched interconnect fabric, similar in topology to NVSwitch but with broader scalability. Each accelerator connects to a UALink switch via high-speed serial lanes (expected to support >1.6 Tb/s aggregate per GPU).

Multiple switches can be connected into a rack-scale fabric, with support for 256–1,024 GPUs in a coherent domain.

4.2 Memory Coherency
One of UALink’s defining features is coherent memory pooling. GPUs and other accelerators can directly address each other’s memory, enabling larger effective VRAM pools. This is critical for trillion-parameter model training, where memory capacity, not just compute, is the bottleneck.

4.3 Vendor-Neutral Protocols
UALink defines an open transport protocol, allowing any compliant GPU, accelerator, or NIC to join the fabric. Unlike NVLink, which is proprietary and closed, UALink is governed by the consortium, ensuring multi-vendor interoperability.

4.4 Integration with Ethernet
While UALink handles intra-pod connectivity, Ethernet fabrics (as standardized by UEC) handle inter-pod communication. Together, they form a two-tier model: UALink for memory-coherent domains, Ethernet for scaling across racks.

5. System Design Implications

5.1 Pod-Level Scaling
With UALink, a single pod could include 1,024 GPUs sharing memory. This enables workloads that would otherwise require complex model parallelism. For hyperscalers, this simplifies software stacks and improves efficiency.

5.2 Heterogeneous Accelerator Pools
UALink is designed to support not only GPUs but also custom accelerators (AI ASICs, NPUs). A single domain could include AMD GPUs, Intel accelerators, and even FPGAs, all addressing a shared memory pool.

5.3 Rack-Scale Supercomputers
UALink allows pods of 1,024 GPUs to be interconnected over Ethernet, creating superclusters of 10,000–100,000 accelerators. This architecture mirrors Nvidia’s InfiniBand + NVSwitch approach but with openness and scale as differentiators.

6. Challenges Facing UALink

6.1 Software Ecosystem
Nvidia’s dominance is not only hardware but software. CUDA and NCCL are deeply optimized for NVLink/NVSwitch. UALink requires ROCm and other libraries to match this level of maturity — a daunting task.

6.2 Deployment Inertia
Hyperscalers have invested billions in Nvidia-based infrastructure. Shifting to UALink requires operational retraining, software refactoring, and ecosystem validation.

6.3 Performance Proof
UALink is in its infancy. While the roadmap promises >1.6 Tb/s per GPU and 1,024-GPU domains, actual deployments in 2025–26 will determine whether it can deliver real-world scaling.

7. Case Studies and Early Adoption

Supermicro and HPE
Supermicro and HPE announced early systems supporting UALink fabrics, targeting 2026 delivery. These systems promise multi-rack AMD MI350X clusters with UALink for intra-pod and Ethernet for inter-pod scaling.

Hyperscaler Pilots
Rumors suggest that Microsoft and Oracle are evaluating UALink for next-generation clusters, given their strategic partnerships with AMD. If adopted, these would be the first hyperscale deployments outside Nvidia’s stack.

8. Strategic Impact for AMD vs Nvidia

8.1 Breaking the Rack-Scale Monopoly
Nvidia’s systemic advantage rests on controlling not just chips but racks. UALink directly attacks this moat, offering hyperscalers an alternative.

8.2 Enabling AMD at Scale
For AMD, UALink is transformative. Without an interconnect, AMD’s GPUs could only compete in small clusters. With UALink, MI300/MI350 GPUs can scale to parity with Nvidia across racks, making AMD a credible second source.

8.3 Toward a Multi-Vendor Future
If UALink succeeds, the datacenter AI market shifts from single-vendor dependence to multi-vendor ecosystems. This reduces costs for hyperscalers, increases supply resilience, and levels the playing field.

9. Conclusion

Interconnects are as critical as GPUs in the AI era. Nvidia’s NVLink and NVSwitch established dominance by making GPUs scale like supercomputers. But UALink, backed by an unprecedented industry coalition, aims to break this monopoly.

By supporting 1,024-GPU pods, coherent memory pools, and vendor-neutral protocols, UALink represents a credible alternative for hyperscalers. Its success could enable AMD and others to challenge Nvidia at rack scale for the first time.

The battle for AI is not only about how fast a GPU can compute, but how many GPUs can act as one. In that battle, UALink is the most significant challenge Nvidia has ever faced.