Meta unveiled its first-generation in-house AI inference accelerator designed to power the ranking and recommendation models that are key components of Facebook and Instagram in 2023.
The Meta Training and Inference Accelerator (MTIA) chip, which can handle inference but not training, was updated in April and doubled the compute and memory bandwidth of the first solution.
At the recent Hot Chips symposium last month, Meta gave a presentation on its next-generation MTIA and admitted that using GPUs for recommendation engines is not without its challenges. The social media giant noted that peak performance does not always translate into effective throughput, large deployments can be resource-intensive, and capacity constraints are exacerbated by the growing demand for generative AI.
Mysterious expansion of memory
With this in mind, Meta’s development goals for the next generation of MTIA include improving performance per TCO and per watt compared to the previous generation, efficiently handling models across multiple Meta services, and improving developer efficiency to quickly achieve high-volume deployments.
Meta’s latest MTIA gains a significant performance boost with GEN-O-GEN, which increases GEMM TOPs by 3.5x to 177 TFLOPS on BF16, hardware-based tensor quantization for FP32-comparable accuracy, and optimized support for PyTorch Eager Mode, enabling sub-1 microsecond job launch times and job replacement in under 0.5 microseconds. Additionally, TBE optimization improves flush and prefetch times for embedding indices, achieving 2–3x faster runtimes compared to the previous generation.
The MTIA chip, built on TSMC’s 5nm process, operates at 1.35GHz with 2.35 billion gate count and delivers GEMM performance of 354 TOPS (INT8) and 177 TOPS (FP16), utilizing 128GB LPDDR5 memory with 204.8GB/s bandwidth, all within a 90W TDP.
The processing elements are built on RISC-V cores, with scalar and vector extensions, and the Meta accelerator module includes dual CPUs. In Hot Chips 2024, ServingTheHome He noted a memory expansion tied to the PCIe switch and CPUs. When asked if this was CXL, Meta said somewhat coyly, “It is an option to add memory in the chassis, but it is not currently being implemented.”