Technology companies are shifting their focus from building larger language models (LLMs) to developing smaller models (SLMs) that can match or even surpass them.
Meta's Llama 3 (400 billion parameters), OpenAI's GPT-3.5 (175 billion parameters), and GPT-4 (about 1.8 trillion parameters) are famously larger models, while Microsoft's Phi-3 family ranges from 3.8 billion to 14 billion parameters, and Apple Intelligence “alone” has about 3 billion parameters.
It may seem like a disadvantage to have models with far fewer parameters, but the appeal of SLMs is understandable. They consume less power, can run locally on devices such as smartphones and laptops, and are a good choice for smaller companies and labs that can't afford expensive hardware setups.
David vs Goliath
As IEEE Spectrum reports, “The rise of SLMs comes at a time when the performance gap between LLMs is rapidly narrowing and technology companies are looking to deviate from standard scaling laws and explore other avenues to improve performance.”
In a recent round of testing conducted by Microsoft, Phi-3-mini, the tech giant’s smallest model with 3.8 billion parameters, rivaled Mixtral (8x 7 billion) and GPT-3.5 in some areas, despite being small enough to fit on a phone. Its success was due to the dataset used for training, which was comprised of “publicly available web data and heavily filtered synthetic data.”
While SLMs achieve a level of language understanding and reasoning similar to much larger models, they are still limited by their size for certain tasks and cannot store much “factual” knowledge. This is a problem that can be solved by combining the SLM with an online search engine.
IEEE SpectrumShubham Agarwal compares SLMs to how children learn language, saying, “By the time children are 13, they are exposed to about 100 million words and are better than chatbots at language, with access to just 0.01 percent of the data.” Although, as Agarwal notes, “no one knows what makes humans so much more efficient,” Alex Warstadt, a computer science researcher at ETH Zurich, suggests that “reverse-engineering efficient human-like learning at small scales could lead to huge improvements when scaled up to LLM scales.”