Cerebras has unveiled its latest AI inference chip, which is being touted as a formidable rival to Nvidia's DGX100.
The chip features 44GB of high-speed memory, allowing it to handle AI models with billions to trillions of parameters.
For models that exceed the memory capacity of a single wafer, Cerebras can split them across layer boundaries and distribute them across multiple CS-3 systems. A single CS-3 system can accommodate 20 billion parameter models, while 70 billion parameter models can be managed by just four systems.
Support for additional models is coming soon.
Cerebras emphasizes the use of 16-bit model weights to maintain accuracy, unlike some competitors that reduce weight precision to 8-bit, which can degrade performance. According to Cerebras, its 16-bit models perform up to 5% better on multi-turn conversations, math, and reasoning tasks compared to 8-bit models, ensuring more accurate and reliable results.
The Cerebras inference platform is available via chat and API access, and is designed to be easily integrated by developers familiar with OpenAI’s Chat Completions format. The platform features the ability to run Llama3.1 70B models at 450 tokens per second, making it the only solution that achieves instantaneous speedup for such large models. For developers, Cerebras is offering 1 million free tokens per day at launch, and the price for large-scale deployments is said to be significantly lower than popular GPU clouds.
Cerebras will initially launch with the Llama3.1 8B and 70B models, with plans to add support for larger models such as the Llama3 405B and Mistral Large 2 in the near future. The company highlights that fast inference capabilities are crucial to enable more complex AI workflows and improve real-time LLM intelligence, particularly in techniques such as scaffolding, which requires substantial token usage.
Patrick Kennedy of ServingTheHome I saw the product in action at the recent Hot Chips 2024 symposium and noted, “I had the chance to sit down with Andrew Feldman (CEO of Cerebras) before the talk and he showed me the live demos. It’s incredibly fast. The reason this is important is not just for humans to initiate the interaction. Instead, in an agent world where computer AI agents talk to multiple other computer AI agents. Imagine if each agent takes seconds to generate a result and there are multiple steps in that chain. If you think about automated AI agent chains, then you need fast inference to reduce the time for the entire chain.”
Cerebras positions its platform as a new standard in open LLM development and deployment, offering record-breaking performance, competitive pricing, and broad API access. You can try it out by visiting inference.cerebras.ai or scanning the QR code on the slide below.