AMD, Broadcom, Cisco, Google, Hewlett Packard Enterprise (HPE), Intel, Meta and Microsoft are combining their expertise to create an open industry standard for an AI chip technology called Ultra Accelerator Link. The setup will improve high-speed, low-latency communications between AI accelerator chips in data centers.
An open standard will improve the performance of AI/ML clusters across the industry, meaning that no single company will disproportionately capitalize on demand for the latest and greatest AI/ML, high-performance computing, and cloud applications.
Notably absent from the so-called UALink Promoter Group are NVIDIA and Amazon Web Services. In fact, Promoter Group likely intends for its new interconnection standard to topple the two companies' dominance in the AI hardware and cloud market, respectively.
UALink Promoter Group expects to establish a consortium of companies that will manage the continued development of the UALink standard in the third quarter of 2024, and will be given access to UALink 1.0 around the same time. A higher bandwidth version is planned to be released in the fourth quarter of 2024.
SEE: Gartner predicts global chip revenue will rise 33% in 2024
What is UALink and who will it benefit?
Ultra Accelerator Link, or UALink, is a defined way of connecting AI accelerator chips into servers to enable faster and more efficient communication between them.
AI accelerator chips, such as GPUs, TPUs, and other specialized AI processors, are the core of all AI technologies. Each can perform a large number of complex operations simultaneously; However, to achieve the high workloads needed to train, run, and optimize AI models, they need to be connected. The faster the data transfer between accelerator chips, the faster they can access and process the necessary data and the more efficiently they can share workloads.
The first standard to be released by UALink Promoter Group, UALink 1.0, will include up to 1,024 GPU AI accelerators, distributed across one or multiple racks on a server, connected to a single Ultra Accelerator Switch. According to UALink Promoter Group, this will “enable direct loads and stores between memory connected to AI accelerators and generally increase speed while reducing data transfer latency compared to existing interconnect specifications.” It will also simplify scaling up workloads as demands increase.
While no details about UALink have yet been released, group members said in a briefing on Wednesday that UALink 1.0 would involve AMD's Infinity Fabric architecture, while the Ultra Ethernet Consortium will cover connecting multiple “pods,” or switches. Its release will benefit system OEMs, IT professionals, and system integrators looking to configure their data centers in a way that supports high speeds, low latency, and scalability.
What companies joined the UALink Promoter Group?
- AMD.
- Broadcom.
- Cisco.
- Google.
- HPE.
- Intel.
- Goal.
- Microsoft.
Microsoft, Meta, and Google have spent billions of dollars on NVIDIA GPUs for their respective cloud and AI technologies, including Meta's Llama, Google Cloud, and Microsoft Azure models. However, supporting NVIDIA's continued hardware dominance does not bode well for their respective futures in the space, so it is wise to study an exit strategy.
A standardized UALink switch will allow vendors other than NVIDIA to offer compatible accelerators, giving AI companies a range of alternative hardware options on which to build their system and not suffer from vendor lock-in.
This benefits many of the group's companies that have developed or are developing their own accelerators. Google has a custom TPU and the Axion processor; Intel has Gaudí; Microsoft has the Maia and Cobalt GPUs; and Meta has MTIA. All of these could be connected using UALink, which will likely be provided by Broadcom.
SEE: Intel Vision 2024 offers a new look at the Gaudi 3 AI chip
Which companies in particular have not joined the UALink Promoter Group?
Nvidia
NVIDIA likely hasn't joined the group for two main reasons: its market dominance in AI-related hardware and its exorbitant amount of power derived from its high value.
The company currently holds approximately 80% of the GPU market share, but is also a major player in interconnect technology with NVLink, Infiniband, and Ethernet. NVLink specifically is a GPU-to-GPU interconnect technology, which can connect accelerators within one or multiple servers, just like UALink. Therefore, it is not surprising that NVIDIA does not want to share that innovation with its closest rivals.
Furthermore, according to its latest financial results, NVIDIA is close to surpassing Apple and becoming the second most valuable company in the world, doubling its value to more than $2 trillion in just nine months.
The company does not expect to gain much from the standardization of AI technology and its current position is also favorable. Time will tell whether NVIDIA's offering will become so integral to data center operations that the first UALink products don't topple its crown.
SEE: Supercomputing '23: NVIDIA's high-performance chips power AI workloads
Amazon Web Services
AWS is the only one of the major public cloud providers not to join the UALink Promoter Group. Like NVIDIA, this could also be related to its influence as the current cloud market leader and the fact that it is working on its own accelerator chip families, such as Trainium and Inferentia. Furthermore, with a strong partnership of over 12 years, AWS could also lend itself to hiding behind NVIDIA in this area.
Why are open standards in AI necessary?
Open standards help prevent disproportionate dominance of the industry by one company that was in the right place at the right time. UALink Promoter Group will allow multiple companies to collaborate on essential hardware for AI data centers so that no one organization can take care of everything.
This is not the first case of this type of revolt in AI; In December, more than 50 organizations partnered to form the Global AI Alliance to promote responsible, open source AI and help prevent closed model developers from gaining too much power.
Knowledge sharing also serves to accelerate advances in AI performance on an industry-wide scale. Demand for AI computing is continually growing, and for technology companies to keep up, they need the best in scalability capabilities. The UALink standard will provide a “robust, low-latency, and efficient scaling network that can easily aggregate computing resources to a single instance,” according to the group.
Forrest Norrod, executive vice president and general manager of AMD's Data Center Solutions Group, said in a press release: “The work companies are doing at UALink to create an open, high-performance and scalable accelerator fabric is critical. for the future. of AI.
“Together, we bring extensive experience in building large-scale artificial intelligence and high-performance computing solutions that are built on open standards, efficiency, and strong ecosystem support. “AMD is committed to contributing our expertise, technologies and capabilities to the group, as well as other open industry efforts to advance all aspects of AI technology and solidify an open AI ecosystem.”