Meta has revealed details about its AI training infrastructure, revealing that it currently relies on nearly 50,000 Nvidia H100 GPUs to train its open source Llama 3 LLM.
The company says it will have more than 350,000 Nvidia H100 GPUs in service by the end of 2024 and computing power equivalent to nearly 600,000 H100s when combined with hardware from other sources.
The figures were revealed as Meta shared details about its 24,576 GPU data center scale clusters.
The company explained: “These groups support our current and next-generation AI models, including Llama 3, the successor to Llama 2, our publicly published LLM, as well as AI research and development in GenAI and other areas.”
The clusters are built on Grand Teton (named after Wyoming National Park), an open, in-house designed GPU hardware platform. Grand Teton integrates power, control, computing and fabric interfaces into a single chassis for better overall performance and scalability.
Clusters also feature high-performance network structures, allowing them to support larger and more complex models than before. Meta says one cluster uses an Arista 7800-based remote direct memory access network fabric solution, while the other features an NVIDIA Quantum2 InfiniBand fabric. Both solutions interconnect 400 Gbps endpoints.
“The efficiency of the high-performance networking structures within these clusters, some of the key storage decisions, combined with the 24,576 NVIDIA Tensor Core H100 GPUs in each, allow both cluster versions to support larger and more complex models of which could be admitted into CSR and pave the way for advances in GenAI product development and AI research,” Meta said.
Storage is another critical aspect of AI training, and Meta has developed a user-space Linux file system backed by a version of its 'Tectonic' distributed storage solution optimized for Flash media. This solution reportedly enables thousands of GPUs to synchronously save and load checkpoints, as well as “providing flexible, high-performance exabyte-scale storage required for data loading.”
While the company's current AI infrastructure relies heavily on Nvidia GPUs, it's unclear how long that will continue. As Meta continues to evolve its AI capabilities, it will inevitably focus on developing and producing more of its own hardware. Meta has already announced plans to use its own AI chips, called Artemis, in servers this year, and the company previously revealed that it was preparing to manufacture custom RISC-V silicon.