Ethernet Fabrics and the Rise of Open AI Networking

Abstract

Artificial intelligence has transformed datacenter architecture. The challenge today is not just how fast a single GPU can compute, but how efficiently tens of thousands of accelerators can work as one coherent, distributed system. For the past two decades, InfiniBand has dominated this role, offering deterministic latency and lossless characteristics critical for large-scale HPC and, more recently, AI workloads.

By 2025, however, Ethernet has emerged as a credible rival. No longer “best effort,” Ethernet now supports the stringent demands of hyperscale AI thanks to advances in RDMA over Converged Ethernet (RoCE), congestion-aware switching, SmartNIC offloads, high-speed optics, and the institutional backing of the Ultra Ethernet Consortium (UEC).

This whitepaper examines the technical evolution that enabled Ethernet fabrics to challenge InfiniBand, presents case studies from hyperscalers already deploying Ethernet-based AI clusters, and explores the strategic consequences for Nvidia and AMD. We argue that Ethernet fabrics are not just another interconnect choice but a structural shift that could open the AI datacenter to true multi-vendor competition.

1.Introduction

The last decade has seen an extraordinary escalation in the scale of AI workloads. In 2012, AlexNet was trained on just two consumer-grade GPUs. A decade later, GPT-3 required over 10,000 Nvidia V100 GPUs, and by 2024 GPT-4 and LLaMA-3 demanded clusters of 25,000–50,000 accelerators.

This scaling transformed the economics of AI. A training run that once cost tens of thousands of dollars now costs tens of millions. Efficiency in the fabric connecting GPUs became as critical as raw compute throughput.

Nvidia foresaw this and built a vertically integrated stack:

CUDA as the software substrate, locking frameworks to Nvidia hardware.
NVLink/NVSwitch for intra-node and pod-scale GPU communication.
InfiniBand, acquired via Mellanox, to tie racks together in a lossless, deterministic fabric.

This “compute + interconnect” trifecta created lock-in. Hyperscalers couldn’t easily swap GPUs without losing the ecosystem benefits, cementing Nvidia’s >90% datacenter AI share by 2023.

Ethernet, meanwhile, was dismissed as inadequate for AI. It was cheap, ubiquitous, and fine for storage and web traffic — but packet drops and variable latency made it unsuitable for collective operations like all-reduce.

By the mid-2020s, that assumption changed. Hyperscalers facing ballooning costs and dependency on Nvidia began investing heavily in Ethernet. With its broad vendor ecosystem and standardization history, Ethernet offered openness, flexibility, and cost advantages — if it could match InfiniBand’s determinism. By 2025, the first production AI superclusters over Ethernet proved it could.

2.Technical Evolution of Ethernet for AI Workloads

The maturation of Ethernet into a first-class AI fabric was not the result of a single breakthrough but a series of cumulative advances across five domains:

2.1 RDMA over Converged Ethernet (RoCE)

Traditional Ethernet traffic traverses the CPU and TCP/IP stack, introducing tens of microseconds of latency. For AI training, this is unacceptable. RDMA allows memory-to-memory transfers between nodes without CPU intervention.

RoCEv2 extended RDMA across routable Ethernet fabrics, creating low-latency GPU-to-GPU communication. By 2024, 400G and 800G SmartNICs shipped with RoCEv2 as standard, cutting synchronization latency nearly in half.

2.2 Congestion Control and Telemetry

In Clos networks, thousands of concurrent GPU flows easily congest links. Standard Ethernet drops packets under pressure, stalling collectives.

The solution: Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and switch-level telemetry. Broadcom’s Tomahawk 5 switch (51.2 Tb/s) introduced per-queue visibility, while Arista’s EOS enabled adaptive routing. These innovations created “lossless Ethernet” approximating InfiniBand behavior.

2.3 SmartNIC Offloads

NICs evolved into co-processors. Instead of simply moving packets, SmartNICs offload collective operations such as all-reduce.

This reduces east–west traffic by aggregating gradients locally and cuts tail latency by smoothing stragglers. Tail latency — the time taken by the slowest node — is the limiter for AI scaling. By minimizing it, SmartNICs allowed Ethernet clusters to approach InfiniBand’s efficiency.

2.4 Optical Advances

By 2025, GPUs consumed 600–700W each. Networking had to deliver bandwidth without overwhelming power budgets.

Hyperscalers adopted 800G optics widely, with 1.6T modules on roadmap. Co-packaged optics (CPO) further reduced energy per bit by integrating optical engines beside ASICs. Vendors like Marvell, Broadcom, and Coherent drove DSP and packaging innovation, making optics the backbone of Ethernet’s AI competitiveness.

2.5 Open Standardization: Ultra Ethernet Consortium (UEC)

In 2023, 50+ companies formed the UEC to prevent fragmentation.

UEC 1.0 (2025): defined transport semantics, telemetry APIs, and RoCE integration.
UEC 2.0 (2027): will add 1.6T optics and collective offloads.
UEC 3.0 (2029): aims to extend memory semantics across fabrics.

Unlike Nvidia’s proprietary InfiniBand, UEC ensures interoperability and multi-vendor sourcing, a feature hyperscalers demand.

3.Case Studies from Hyperscalers

Microsoft Azure

Azure deployed ND MI300X v5 instances using AMD accelerators over Ethernet. Benchmarks showed near-linear scaling on LLaMA fine-tuning. Microsoft engineers highlighted Ethernet’s flexibility: one fabric carried both AI and storage traffic, reducing cost and complexity.

Oracle Cloud Infrastructure

Oracle built MI300X superclusters of 16,384 GPUs, with plans for >100,000 GPU deployments in the MI355X era. All were interconnected with Ethernet. Oracle cited multi-vendor procurement and TCO reduction as drivers.

Meta

Meta used Ethernet for LLaMA-3 inference clusters. Inference requires consistent millisecond latency at billions of queries/day. Congestion-aware routing in Ethernet fabrics reduced tail latency, delivering predictable performance

4.Challenges and Remaining Gaps

Ethernet still lags InfiniBand in some respects:

Software maturity: Nvidia’s NCCL library is deeply optimized for InfiniBand. ROCm and PyTorch’s RoCE stack are catching up.
Determinism: InfiniBand shows slightly more predictable latency under extreme congestion. Ethernet needs continued algorithmic tuning.
Inertia: Hyperscalers have billions invested in InfiniBand estates. Migrating requires retraining operations and orchestration systems.

5.Strategic Impact for AMD vs Nvidia

Ethernet changes the dynamics of the GPU race.

For Nvidia: InfiniBand was not just networking but a systemic moat. As Ethernet adoption grows, Nvidia loses this control point. Its GPUs remain dominant, but its fabric monopoly weakens.
For AMD: Ethernet is an equalizer. AMD’s MI300/MI350 GPUs can scale to thousands of nodes without Nvidia’s interconnect. Hyperscalers can now deploy AMD as a true second source.
For the market: By 2030, Ethernet adoption could push AMD’s share of AI datacenter GPUs into double digits. The race shifts from raw FLOPS to openness, TCO, and supply diversity.

6.Conclusion

Ethernet has undergone a remarkable metamorphosis. Once derided as “best effort,” it now stands shoulder-to-shoulder with InfiniBand as a substrate for AI supercomputing.

With RoCE, congestion-aware switching, SmartNIC offloads, high-speed optics, and UEC standardization, Ethernet fabrics deliver the determinism required by AI while preserving openness and vendor diversity.

For hyperscalers, this promises freedom from lock-in and lower costs. For AMD, it provides a path to compete against Nvidia not just on GPU design but on system-level parity.

The battle for AI infrastructure is not only about FLOPS per GPU. It is about how those GPUs are networked. In that battle, Ethernet has become the great equalizer.

References

Back to Blog