«

无限扩展:Azure人工智能超级工厂背后的架构设计

qimuai 发布于 阅读:31 一手编译


无限扩展:Azure人工智能超级工厂背后的架构设计

内容来源:https://blogs.microsoft.com/blog/2025/11/12/infinite-scale-the-architecture-behind-the-azure-ai-superfactory/

内容总结:

微软公司今日宣布,位于美国佐治亚州亚特兰大的新一代Azure人工智能数据中心"清溪"(Fairwater)正式启用。该设施与威斯康星州的首个清溪站点及历代AI超算系统互联,构建全球首个行星级人工智能超级工厂,以突破性的算力密度满足全球爆炸式增长的AI计算需求。

这一革命性数据中心采用扁平化网络架构,可整合数十万颗最新英伟达Blackwell系列GPU,形成巨型超级计算机。通过双层建筑设计与液冷系统创新,单个机架功率提升至140千瓦,在降低延迟的同时实现算力密度最大化。其闭环冷却系统仅需一次性注水(相当于20户家庭年用水量),即可持续运行六年以上,显著提升资源利用效率。

在能源管理方面,亚特兰大站点通过智能电网技术实现高可用性供电,在保证99.999%可用性的同时将成本控制在99.9%水平。微软还开发了软硬件协同的电力调控方案,通过智能负载调节与现场储能系统维护电网稳定。

该平台通过自研的AI广域网骨干系统,将遍布全球的Azure数据中心连成弹性算力网络,支持预训练、微调、强化学习等多样化AI工作负载的动态调度。采用开放式网络生态与SONiC操作系统,既避免供应商锁定,又实现数十万GPU在单一平面网络中的无缝协作。

清溪架构标志着云计算基础设施的范式革新,为全球开发者提供从前沿模型研发到行业应用落地的全场景AI算力支持,推动人工智能技术普惠化发展。

中文翻译:

今日,我们在佐治亚州亚特兰大市正式发布新一代Azure人工智能数据中心——清溪基地。这座专为人工智能设计的设施与威斯康星州的首个清溪基地、前几代AI超级计算机以及更广泛的Azure全球数据中心网络相连,共同构建全球首个行星级AI智造工厂。通过实现前所未有的计算密度,每个清溪基地都能高效满足爆发式增长的AI算力需求,持续拓展模型智能的边界,助力全球每个个体与组织成就非凡。

为应对这一需求,我们彻底重构了AI数据中心及其内部系统的设计理念。清溪基地突破传统云数据中心模式,采用扁平化网络架构,可将数十万颗最新NVIDIA GB200与GB300 GPU整合为巨型超级计算机。这些创新凝聚了我们数十年数据中心与网络设计经验,以及支撑全球最大规模AI训练任务的技术积淀。

清溪架构不仅适用于训练前沿模型,更具备灵活通用性。当前AI训练已从单一任务演变为涵盖预训练、微调、强化学习、合成数据生成等多元工作负载。微软部署了专属AI广域网骨干,将各清溪基地整合为弹性系统,实现动态分配多样化AI工作负载,最大化提升整体系统GPU利用率。

下文将详解清溪基地背后的核心技术突破,从数据中心建设到跨站点组网方案。

极致计算密度
现代AI基础设施正面临物理定律的制约。光速已成为加速器、计算与存储高性能协同的关键瓶颈。清溪基地通过最大化计算密度,显著降低机架内与跨机架延迟,释放系统峰值性能。

提升密度的核心在于规模化冷却技术。清溪基地的AI服务器接入全设施循环冷却系统,采用闭环设计在首次注水后持续循环利用(设计使用年限超6年),年耗水量仅相当于20户家庭用水量,且仅在水质监测提示时需更换,实现极高能效与可持续性。

液冷技术带来更高效的热传导能力,使单机架功率达约140kW、单排机架功率达1,360kW,实现数据中心内部计算单元极致密集排布。尖端冷却技术还保障高密度算力在稳态运行中持续满载,支撑大规模训练任务高效执行。流经GPU阵列冷板系统的热量,最终由全球最大制冷机组之一完成消散。

双层建筑结构是提升计算密度的另一创新。许多AI工作负载对延迟极其敏感,线缆长度会显著影响集群性能。清溪基地内所有GPU全互联,双层设计支持三维立体化机架布局,最短化线缆路径,进而优化延迟、带宽、可靠性及成本。

高可用低成本电力
我们以经济可靠的电力支撑澎湃算力。亚特兰大基地选址注重电网韧性,能够以三级可用性成本实现四级可用性。通过获取高可用电网电力,我们无需为GPU集群配置传统冗余方案(如现场发电、UPS系统、双路配电),既为客户节约成本,也加速微软产品上市进程。

我们与行业伙伴共同研发电源管理方案,应对大规模任务引发的电网功率振荡——这是AI需求激增背景下维持电网稳定的关键挑战。解决方案包括:在低负载期引入补充工作负载的软件方案、GPU自主功率阈值控制的硬件方案,以及利用现场储能系统平抑功率波动的能源方案。

尖端加速器与网络系统
清溪基地搭载定制服务器、前沿AI加速器与创新网络系统。每个清溪数据中心均运行由互联NVIDIA Blackwell GPU构成的统一集群,采用先进网络架构,在当前交换机世代突破传统Clos网络限制,实现单层网络连接数十万GPU。这要求我们在纵向扩展、横向扩展及网络协议领域全面创新。

纵向扩展方面,每台AI加速器机架最多集成72颗NVIDIA Blackwell GPU,通过NVLink实现机架内超低延迟通信。Blackwell加速器提供当今最高计算密度,支持FP4等低精度格式以提升总算力与内存效率。单机架提供1.8TB GPU间带宽,每颗GPU可共享超过14TB聚合内存。

横向组网则将机架整合为资源池与集群,使所有GPU以最少跳数构成统一超级计算机。我们采用基于以太网的双层后端网络,支持800 Gbps GPU互联的超大规模集群。依托广泛以太网生态与SONiC(云开放网络软件——我们自研的网络交换机操作系统),我们既能使用通用硬件替代专有方案,又避免供应商锁定与控制成本。

数据包修剪、数据包喷洒及高频遥测技术的提升构成优化AI网络的核心。我们正深化网络路由控制与优化能力。这些技术共同实现先进拥塞控制、快速检测重传与灵活负载均衡,为现代AI工作负载提供超高可靠性与超低延迟。

行星级规模
即便具备这些创新,单设施的电力和空间限制仍难以满足万亿参数级训练任务的算力需求。为此我们构建专属AI广域网光传输网络,延伸清溪基地的纵向与横向扩展能力。凭借规模优势与数十年超大规模实践经验,去年我们在美国新增超12万英里光缆,全面提升全国AI网络覆盖与可靠性。

通过这条高性能高韧性骨干网,我们将多代超级计算机联接为跨地域AI智造工厂,突破单点能力局限。AI开发者可灵活调用Azure人工智能数据中心网络,根据需求在站点内纵向/横向网络间分流任务,或通过横跨大陆的AI广域网实现跨站点调度。

这彻底改变了过往所有流量无论需求均需经由横向网络的模式,不仅为客户提供更精细的专属网络服务,还通过基础设施通用性最大化系统灵活性与利用率。

整体协同效应
亚特兰大清溪基地标志着Azure AI基础设施的全新飞跃,凝聚了我们运营全球最大AI训练任务的宝贵经验。它融合了计算密度、可持续性与网络系统的突破性创新,高效满足当前巨大的算力需求,并通过与其它AI数据中心及Azure平台的深度集成,形成全球首个AI智造工厂。这些进步共同构建出灵活专属的基础设施,支撑全场景现代AI工作负载,赋能全球每个个体与组织成就卓越。对客户而言,这意味着AI更易融入各类工作流,并能打造以往难以实现的创新AI解决方案。

[了解微软Azure如何助力您集成人工智能,优化强化开发生命周期]

斯科特·格思里负责超大规模云计算解决方案与服务,包括微软云计算平台Azure、生成式AI解决方案、数据平台及信息安全防护。这些平台与服务助力全球组织应对紧迫挑战,推动长效转型。

编者注:已更新部分表述以更清晰说明网络优化原理。

英文来源:

Today, we are unveiling the next Fairwater site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is connected to our first Fairwater site in Wisconsin, prior generations of AI supercomputers and the broader Azure global datacenter footprint to create the world’s first planet-scale AI superfactory. By packing computing power more densely than ever before, each Fairwater site is built to efficiently meet unprecedented demand for AI compute, push the frontiers of model intelligence and empower every person and organization on the planet to achieve more.
To meet this demand, we have reinvented how we design AI datacenters and the systems we run inside of them. Fairwater is a departure from the traditional cloud datacenter model and uses a single flat network that can integrate hundreds of thousands of the latest NVIDIA GB200 and GB300 GPUs into a massive supercomputer. These innovations are a product of decades of experience designing datacenters and networks, as well as learnings from supporting some of the largest AI training jobs on the planet.
While the Fairwater datacenter design is well suited for training the next generation of frontier models, it is also built with fungibility in mind. Training has evolved from a single monolithic job into a range of workloads with different requirements (such as pre-training, fine-tuning, reinforcement learning and synthetic data generation). Microsoft has deployed a dedicated AI WAN backbone to integrate each Fairwater site into a broader elastic system that enables dynamic allocation of diverse AI workloads and maximizes GPU utilization of the combined system.
Below, we walk through some of the exciting technical innovations that support Fairwater, from the way we build datacenters to the networking within and across the sites.
Maximum density of compute
Modern AI infrastructure is increasingly constrained by the laws of physics. The speed of light is now a key bottleneck in our ability to tightly integrate accelerators, compute and storage with performant latency. Fairwater is designed to maximize the density of compute to minimize latency within and across racks and maximize system performance.
One of the key levers for driving density is improving cooling at scale. AI servers in the Fairwater datacenters are connected to a facility-wide cooling system designed for longevity, with a closed-loop approach that reuses the liquid continuously after the initial fill with no evaporation. The water used in the initial fill is equivalent to what 20 homes consume in a year and is only replaced if water chemistry indicates it is needed (it is designed for 6-plus years), making it extremely efficient and sustainable.
Liquid-based cooling also provides much higher heat transfer, enabling us to maximize rack and row-level power (~140kW per rack, 1,360 kW per row) to pack compute as densely as possible inside the datacenter. State-of-the-art cooling also helps us maximize utilization of this dense compute in steady-state operations, enabling large training jobs to run performantly at high scale. After cycling through a system of cold plate paths across the GPU fleet, heat is dissipated by one of the largest chiller plants on the planet.
Another way we are driving compute density is with a two-story datacenter building design. Many AI workloads are very sensitive to latency, which means cable run lengths can meaningfully impact cluster performance. Every GPU in Fairwater is connected to every other GPU, so the two-story datacenter building approach allows for placement of racks in three dimensions to minimize cable lengths, which in turn improves latency, bandwidth, reliability and cost.
High-availability, low-cost power
We are pushing the envelope in serving this compute with cost-efficient, reliable power. The Atlanta site was selected with resilient utility power in mind and is capable of achieving 4×9 availability at 3×9 cost. By securing highly available grid power, we can also forgo traditional resiliency approaches for the GPU fleet (such as on-site generation, UPS systems and dual-corded distribution), driving cost savings for customers and faster time-to-market for Microsoft.
We have also worked with our industry partners to codevelop power-management solutions to mitigate power oscillations created by large scale jobs, a growing challenge in maintaining grid stability as AI demand scales. This includes a software-driven solution that introduces supplementary workloads during periods of reduced activity, a hardware-driven solution where the GPUs enforce their own power thresholds and an on-site energy storage solution to further mask power fluctuations without utilizing excess power.
Cutting-edge accelerators and networking systems
Fairwater’s world-class datacenter design is powered by purpose-built servers, cutting-edge AI accelerators and novel networking systems. Each Fairwater datacenter runs a single, coherent cluster of interconnected NVIDIA Blackwell GPUs, with an advanced network architecture that can scale reliably beyond traditional Clos network limits with current-gen switches (hundreds of thousands of GPUs on a single flat network). This required innovation across scale-up networking, scale-out networking and networking protocol.
In terms of scale-up, each rack of AI accelerators houses up to 72 NVIDIA Blackwell GPUs, connected via NVLink for ultra-low-latency communication within the rack. Blackwell accelerators provide the highest compute density available today, with support for low-precision number formats like FP4 to increase total FLOPS and enable efficient memory use. Each rack provides 1.8 TB of GPU-to-GPU bandwidth, with over 14 TB of pooled memory available to each GPU.
These racks then use scale-out networking to create pods and clusters that enable all GPUs to function as a single supercomputer with minimal hop counts. We achieve this with a two-tier, ethernet-based backend network that supports massive cluster sizes with 800 Gbps GPU-to-GPU connectivity. Relying on a broad ethernet ecosystem and SONiC (Software for Open Network in the Cloud – which is our own operating system for our network switches) also helps us avoid vendor lock-in and manage cost, as we can use commodity hardware instead of proprietary solutions.
Improvements across packet trimming, packet spray and high-frequency telemetry are core components of our optimized AI network. We are also working to enable deeper control and optimization of network routes. Together, these technologies deliver advanced congestion control, rapid detection and retransmission and agile load balancing, ensuring ultra-reliable, low-latency performance for modern AI workloads.
Planet scale
Even with these innovations, compute demands for large training jobs (now measured in trillions of parameters) are quickly outpacing the power and space constraints of a single facility. To serve these needs, we have built a dedicated AI WAN optical network to extend Fairwater’s scale-up and scale-out networks. Leveraging our scale and decades of hyperscale expertise, we delivered over 120,000 new fiber miles across the US last year — expanding AI network reach and reliability nationwide.
With this high-performance, high-resiliency backbone, we can directly connect different generations of supercomputers into an AI superfactory that exceeds the capabilities of a single site across geographically diverse locations. This empowers AI developers to tap our broader network of Azure AI datacenters, segmenting traffic based on their needs across scale-up and scale-out networks within a site, as well as across sites via the continent spanning AI WAN.
This is a meaningful departure from the past, where all traffic had to ride the scale-out network regardless of the requirements of the workload. Not only does it provide customers with fit-for-purpose networking at a more granular level, it also helps create fungibility to maximize the flexibility and utilization of our infrastructure.
Putting it all together
The new Fairwater site in Atlanta represents the next leap in the Azure AI infrastructure and reflects our experience running the largest AI training jobs on the planet. It combines breakthrough innovations in compute density, sustainability and networking systems to efficiently serve the massive demand for computational power we are seeing. It also integrates deeply with other AI datacenters and the broader Azure platform to form the world’s first AI superfactory. Together, these innovations provide a flexible, fit-for-purpose infrastructure that can serve the full spectrum of modern AI workloads and empower every person and organization on the planet to achieve more. For our customers, this means easier integration of AI into every workflow and the ability to create innovative AI solutions that were previously unattainable.
Find out more about how Microsoft Azure can help you integrate AI to streamline and strengthen development lifecycles here.
Scott Guthrie is responsible for hyperscale cloud computing solutions and services including Azure, Microsoft’s cloud computing platform, generative AI solutions, data platforms and information and cybersecurity. These platforms and services help organizations worldwide solve urgent challenges and drive long-term transformation.
Editor’s note: An update was made to more clearly explain how we optimize our network.

微软AI最新进展

文章目录


    扫描二维码,在手机上阅读