There is a common misconception that one GPU cloud is very similar to another, but that is not the case. They are designed with different technologies and architectures and present their own challenges, advantages and disadvantages.
Today, the most advanced AI cloud operators are developing new GPU data center models that deploy NVIDIA H100 in Kubernetes or other virtualized environments to achieve new levels of performance for AI processing.
For the customer, the specifications are basically the same. AI cloud computing service providers boast about Nvidia HGX H100 matrices and the fact that they have 3.2 terabytes of InfiniBand. But this is because they all use the same network cards. If all clouds look the same from a technical point of view, customers will make decisions based on price.
But technical specifications alone don't tell the whole story. You can buy a Toyota Corolla with 100 kilowatts of power and a Mercedes with 100 kilowatts of power, but they are not the same. The build quality is different, the cost is different, and the user experience is different.
The same is true for data centers. If the CFO oversaw the architecture, we'd probably have the Toyota Corolla of data centers, and that's fine for some, but given the choice, most organizations will choose the Mercedes. A data center built with cost savings as the goal may work for some customers, but it will be slower and/or offer less cloud storage, and may even be less secure.
GPU Clouds
Building GPU clouds varies wildly across data centers. A common misconception is that AI infrastructure can simply be built based on the NVIDIA DGX reference architecture. But that’s the easy part and is the minimum viable baseline. The differentiating factor is how far organizations go beyond that. AI cloud vendors are building highly differentiated solutions by applying management and storage networking that can dramatically accelerate AI computing productivity.
Deploying GPU-based data centers as AI infrastructure is a complex and challenging task that requires a deep understanding of balancing technologies to maximize performance. High-quality management systems and security systems have a clear impact on customer experience.
Another important factor governing the performance of AI clouds is the storage architecture. Using dynamically allocated WEKA architectures, NVMe (non-volatile memory express) disks, and GPU-direct storage can improve execution speed by up to 100% for certain workloads, such as large language models (LLMs) used in machine learning.
WEKA’s data platform delivers unmatched performance and scalability, particularly for feeding data to large-scale GPU environments. By transforming stagnant data silos into dynamic data pipelines, it effortlessly feeds data-starved GPUs, enabling them to operate with up to 20x greater efficiency and sustainability.
Storage Access
The speed at which you access storage is critical in AI, as you are dealing with very large data sets that are likely to contain small chunks. You could be dealing with 100 billion pieces of data spread across a network. Compared to digital media, where you are dealing with a few thousand assets at most, although they could be hundreds of gigabytes each, this is a very different profile. Traditional hard drives will provide good speeds for digital media. Whereas an AI workload is very random in comparison, you have to take a gigabyte here and a gigabyte there and do that millions of times per second.
Another important difference to note regarding AI architecture compared to traditional storage models is the absence of a data caching requirement. Everything is done via a direct request. GPUs communicate directly with disks over the network, they don’t go through CPUs or the TCP/IP stack. GPUs are directly connected to the network fabric. They bypass most of the network layers and go directly to storage. It eliminates network delay.
AI Infrastructure Architecture
AI infrastructure architecture should be designed to maximize processing power for the next wave of AI workloads. Additionally, network architectures should be designed to be completely uncompetitive. Many organizations promise that, but you need a vendor that has enough resources to deliver that level of assurance.
Leading AI users, such as Tesla and Meta, are designing cloud infrastructures to meet the needs of different applications, where cloud AI architectures can be dynamically optimized for specific workloads. However, most cloud providers don’t have the luxury of knowing exactly what they’re building for.
Going back to the automotive analogy, most modern transportation networks in major cities around the world were not built with current traffic volumes in mind. In fact, the problem with building a data center based on a current or even projected target is that data centers will reach capacity sooner than you think. Clouds need to be over-provisioned and extremely scalable.
If you don't know exactly what you're building for, you just need to build the biggest, fastest, most secure, and easiest-to-use platform possible. To optimize performance, data centers need a highly distributed storage architecture, with hundreds of disks generating tens of millions of input/output operations per second across all of your servers.
Support infrastructure
GPU clouds also depend on supporting infrastructure. For example, if you're running Kubernetes, you need master nodes, orchestrator nodes, nodes for data ingestion, and nodes that you can simply log into so you can have dashboards. The cloud provider must provide very significant amounts of non-GPU processing in the same region.
Building true clouds is neither easy nor cheap. Many data center providers call themselves “clouds,” but they are really more of a managed hardware environment. It is certainly less risky from a financial perspective to sign x-year contracts with organizations and then build a facility that meets the contract demands. And there are some benefits, particularly around security and performance, but it is not the cloud.
The cloud is self-service, API-driven – you log in, click a button, and you have access to the processing power you need for as long as you need it. There are many organizations that don’t have the resources or requirements for ongoing data center support – they may only need the processing power for a short time, and the cloud gives them that option. NexGen Cloud is democratizing AI by opening up access to high-performance shared architectures.
A final consideration, which is becoming increasingly important, is energy consumption. Organizations of all sizes are being asked not only to control their emissions, but to improve them. Not just from customers and society at large, but also from a regulatory perspective. Google and Microsoft recently announced an agreement with Nucor for a clean energy initiative to power data centers and ultimately reach net zero for AI processing. ESG performance is also proving to be a critical metric in terms of shareholder value, and AI consumes an incredible amount of energy.
Ultimately, organizations should partner with a vendor they can trust. A partner that can offer guidance, provide engineering and support. Companies that use cloud infrastructure do so to focus on their own core differentiators. They are not in the business of running AI infrastructure in the cloud, they want convenience, security and reliability, and the true cloud provides all of that on demand.
We have the best AI tools.
This article was produced as part of TechRadarPro's Expert Insights channel, where we showcase the brightest and brightest minds in the tech industry today. The views expressed here are those of the author, and not necessarily those of TechRadarPro or Future plc. If you're interested in contributing, find out more here: