- Huawei stacks hundreds of NPUs to point out brute-force supercomputing dominance
- Nvidia delivers polish, steadiness, and confirmed AI efficiency that enterprises belief
- AMD teases radical networking materials to push scalability into new territory
The race to construct probably the most highly effective AI supercomputing methods is intensifying, and main manufacturers now desire a flagship cluster that proves it could possibly deal with the following technology of trillion-parameter fashions and data-heavy analysis.
Huawei’s recently-announced Atlas 950 SuperPoD, Nvidia’s DGX SuperPOD, and AMD’s upcoming Intuition MegaPod every signify completely different approaches to fixing the identical drawback.
All of them goal to ship huge compute, reminiscence, and bandwidth in a single scalable bundle, powering AI instruments for generative fashions, drug discovery, autonomous methods, and data-driven science. However how do they examine?
Class |
Huawei Ascend 950DT |
NVIDIA H200 |
AMD Radeon Intuition MI300 |
---|---|---|---|
Chip Household / Identify |
Ascend 950 sequence |
H200 (GH100, Hopper) |
Radeon Intuition MI300 (Aqua Vanjaram) |
Structure |
Proprietary Huawei AI accelerator |
Hopper GPU structure |
CDNA 3.0 |
Course of / Foundry |
Not but publicly confirmed |
5 nm (TSMC) |
5 nm (TSMC) |
Transistors |
Not specified |
80 billion |
153 billion |
Die Dimension |
Not specified |
814 mm² |
1017 mm² |
Optimization |
Decode-stage inference & mannequin coaching |
Common-purpose AI & HPC acceleration |
AI/HPC compute acceleration |
Supported Codecs |
FP8, MXFP8, MXFP4, HiF8 |
FP16, FP32, FP64 (by way of Tensor/CUDA cores) |
FP16, FP32, FP64 |
Peak Efficiency |
1 PFLOPS (FP8 / MXFP8 / HiF8), 2 PFLOPS (MXFP4) |
FP16: 241.3 TFLOPS, FP32: 60.3 TFLOPS, FP64: 30.2 TFLOPS |
FP16: 383 TFLOPS, FP32/FP64: 47.87 TFLOPS |
Vector Processing |
SIMD + SIMT hybrid, 128-byte reminiscence entry granularity |
SIMT with CUDA and Tensor cores |
SIMT + Matrix/Tensor cores |
Reminiscence Kind |
HiZQ 2.0 proprietary HBM (for decode & coaching variant) |
HBM3e |
HBM3 |
Reminiscence Capability |
144 GB |
141 GB |
128 GB |
Reminiscence Bandwidth |
4 TB/s |
4.89 TB/s |
6.55 TB/s |
Reminiscence Bus Width |
Not specified |
6144-bit |
8192-bit |
L2 Cache |
Not specified |
50 MB |
Not specified |
Interconnect Bandwidth |
2 TB/s |
Not specified |
Not specified |
Type Elements |
Playing cards, SuperPoD servers |
PCIe 5.0 x16 (server/HPC solely) |
PCIe 5.0 x16 (compute card) |
Base / Increase Clock |
Not specified |
1365 / 1785 MHz |
1000 / 1700 MHz |
Cores / Shaders |
Not specified |
CUDA: 16,896, Tensor: 528 (4th Gen) |
14,080 shaders, 220 CUs, 880 Tensor cores |
Energy (TDP) |
Not specified |
600 W |
600 W |
Bus Interface |
Not specified |
PCIe 5.0 x16 |
PCIe 5.0 x16 |
Outputs |
None (server use) |
None (server/HPC solely) |
None (compute card) |
Goal Situations |
Massive-scale coaching & decode inference (LLMs, generative AI) |
AI coaching, HPC, knowledge facilities |
AI/HPC compute acceleration |
Launch / Availability |
This autumn 2026 |
Nov 18, 2024 |
Jan 4, 2023 |
The philosophy behind every system
What makes these methods fascinating is how they mirror the methods of their makers.
Huawei is leaning closely on its Ascend 950 chips and a customized interconnect known as UnifiedBus 2.0 – the emphasis is on constructing out compute density at a rare scale, then networking it collectively seamlessly.
Nvidia has spent years refining its DGX line and now gives the DGX SuperPOD as a turnkey resolution, integrating GPUs, CPUs, networking, and storage right into a balanced surroundings for enterprises and analysis labs.
AMD is getting ready to affix the dialog with the Intuition MegaPod, which goals to scale round its future MI500 accelerators and a brand-new networking material known as UALink.
Whereas Huawei talks about exaFLOP ranges of efficiency at this time, Nvidia highlights a steady, battle-tested platform, and AMD pitches itself because the challenger providing superior scalability down the street.
On the coronary heart of those clusters are heavy-duty processors constructed to ship immense computational energy and deal with data-intensive AI and HPC workloads.
Huawei’s Atlas 950 SuperPoD is designed round 8,192 Ascend 950 NPUs, with reported peaks of 8 exaFLOPS in FP8 and 16 exaFLOPS in FP16 – so it’s clearly geared toward dealing with each coaching and inference at an infinite scale.
Nvidia’s DGX SuperPOD, constructed on DGX A100 nodes, delivers a special taste of efficiency – with 20 nodes containing a complete of 160 A100 GPUs, it seems to be smaller by way of chip depend.
Nonetheless, every GPU is optimized for blended precision AI duties and paired with high-speed InfiniBand to maintain latency low.
AMD’s MegaPod continues to be on the horizon, however early particulars counsel it’ll pack 256 Intuition MI500 GPUs alongside 64 Zen 7 “Verano” CPUs.
Whereas its uncooked compute numbers will not be but revealed, AMD’s aim is to rival or exceed Nvidia’s effectivity and scale, particularly because it makes use of next-generation PCIe Gen 6 and 3-nanometer networking ASICs.
Feeding hundreds of accelerators requires staggering quantities of reminiscence and interconnect pace.
Huawei claims the Atlas 950 SuperPoD carries greater than a petabyte of reminiscence, with a complete system bandwidth of 16.3 petabytes per second.
This sort of throughput is designed to maintain knowledge transferring with out bottlenecks throughout its racks of NPUs.
Nvidia’s DGX SuperPOD doesn’t try and match such headline numbers, as an alternative counting on 52.5 terabytes of system reminiscence and 49 terabytes of high-bandwidth GPU reminiscence, coupled with InfiniBand hyperlinks of as much as 200Gbps per node.
The main focus right here is on predictable efficiency for workloads that enterprises already run.
AMD, in the meantime, is concentrating on the bleeding edge with its Vulcano change ASICs providing 102.4Tbps capability and 800Gbps per tray exterior throughput.
Mixed with UALink and Extremely Ethernet, this means a system that may surpass present networking limits as soon as it launches in 2027.
One of many greatest variations between the three contenders lies in how they’re bodily constructed.
Huawei’s design permits for enlargement from a single SuperPoD to half 1,000,000 Ascend chips in a SuperCluster.
There are additionally claims that an Atlas 950 configuration might contain greater than 100 cupboards unfold over a thousand sq. meters.
Nvidia’s DGX SuperPOD takes a extra compact method, with its 20 nodes built-in in a cluster model that enterprises can deploy with no need a stadium-sized knowledge corridor.
AMD’s MegaPod splits the distinction, with two racks of compute trays plus one devoted networking rack, exhibiting that its structure is centered round a modular however highly effective format.
By way of availability, Nvidia’s DGX SuperPOD is already available on the market, Huawei’s Atlas 950 SuperPoD is anticipated in late 2026, and AMD’s MegaPod is deliberate for 2027.
That stated, these chips are preventing very completely different battles underneath the identical banner of AI supercomputing supremacy.
Huawei’s Atlas 950 SuperPoD is a present of brute power, stacking hundreds of NPUs and jaw-dropping bandwidth to dominate at scale, however its dimension and proprietary design could make it more durable for outsiders to undertake.
Nvidia’s DGX SuperPOD seems to be smaller on paper, but it wins on polish and reliability, providing a confirmed platform that enterprises and analysis labs can plug in at this time with out ready for guarantees.
AMD’s MegaPod, nonetheless in growth, has the makings of a disruptor, with its MI500 accelerators and radical new networking material that might tilt the steadiness as soon as it arrives, however till then, it’s a challenger speaking massive.
Through Huawei, Nvidia, TechPowerUp