Alibaba has revealed the layout of its LLM training data center, which apparently consists of an Ethernet-based network in which each host contains eight GPUs and nine NICs, each with two 200GB/s ports.
The tech giant, which also offers one of the best large language models (LLM) through its Qwen model, trained on 110 billion parameters, says this design has been used in production for eight months and aims to maximize the use of a GPU. PCIe capabilities that increase network sending/receiving capacity.
Another feature that increases speed is the use of NVlink for intra-host networking, providing more bandwidth between hosts. Each port on the NICs is connected to a different top switch, avoiding a single point of failure, a design Alibaba calls optimized for rails.
Each module contains 15,000 GPUs
A new type of network is required because traffic patterns in LLM training are different from general cloud computing due to low entropy and bursty traffic. There is also increased sensitivity to single-point failures and glitches.
“Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We should achieve the following goals: scalability, high performance and fault tolerance of a single ToR,” the company stated.
Another part of the infrastructure that was revealed was the cooling mechanism. As no supplier could provide a solution to keep the chips below 105°C, the temperature at which switches start to shut down, Alibaba designed and created its own vapor chamber heat sink along with the use of more wicked pillars in the center of the chips that transport heat more efficiently.
The LLM array design is encapsulated in modules containing 15,000 GPUs, and each module can be located in a single data center. “All data center buildings in operation on Alibaba Cloud have a total power constraint of 18 MW, and an 18 MW building can house approximately 15,000 GPUs. Combined with HPN, each individual building perfectly houses a complete module, making the predominant links located within the same building,” Alibaba wrote.
Alibaba also wrote that it expects model parameters to continue to increase by an order of magnitude in the coming years, from one trillion to 10 trillion parameters, and that its new architecture is planned to be able to support this and scale to 100,000 GPUs.
Through Register