The world's most powerful supercomputer has used just over 8% of the GPUs it is equipped with to train a large language model (LLM) containing a trillion parameters, comparable to OpenAI's GPT-4.
Frontier, based at Oak Ridge National Laboratory, used 3,072 of its AMD Radeon Instinct GPUs to train an AI system at a scale of trillions of parameters, and used 1,024 of these GPUs (about 2.5%) to train a model with 175 billion parameters. , essentially the same size as ChatGPT.
The researchers needed a minimum of 14TB of RAM to achieve these results, according to your paper, but each MI250X GPU only had 64GB of VRAM, meaning the researchers had to group multiple GPUs together. However, this introduced another challenge in the form of parallelism, meaning that components had to communicate much better and more effectively as the total size of resources used to train the LLM increased.
Putting the world's most powerful supercomputer to work
LLMs are typically not trained on supercomputers, but rather on specialized servers and require many more GPUs. ChatGPT, for example, was trained on more than 20,000 GPUs, according to trend force. But the researchers wanted to demonstrate whether they could train a supercomputer much more quickly and effectively by taking advantage of several techniques made possible by the supercomputer's architecture.
The scientists used a combination of tensor parallelism (groups of GPUs sharing parts of the same tensor) and pipeline parallelism (groups of GPUs hosting neighboring components). They also employed data parallelism to consume a large number of tokens simultaneously and a greater amount of computing resources. The overall effect was to achieve a much faster time.
For the 22 billion parameter model, they achieved a maximum performance of 38.38% (73.5 TFLOPS), 36.14% (69.2 TFLOPS) for the 175 billion parameter model, and 31.96%. maximum performance (61.2 TFLOPS) for the 1 billion parameter model. .
They also achieved 100% weak scaling efficiency, as well as a strong scaling performance of 89.93% for the 175 billion model and a strong scaling performance of 87.05% for the 1 billion parameter model.
Although the researchers were open about the computing resources used and the techniques involved, they neglected to mention the timelines involved in training an LLM in this way.
TechRadar Pro asked the researchers for the schedules, but they did not respond at the time of writing.