Google I/O 2024 Musings: The Rise Of Smaller More Responsive Large Language Models

A major theme of both OpenAI's and Google's announcements last week was the pivot from a relentless emphasis on ever slower but more capable models towards lesser capable but faster models that are better able to support the kinds of near-realtime user interaction required of intelligent assistants. To date, model progression has seemingly been in a single direction: the constant expansion of models with ever more features, ever greater token windows, ever more advanced reasoning and ever more multimodal capabilities, but at a steady cost of speed. For enterprise applications that require the most capable models out there today and can tolerate latencies in the tens of seconds to tens of minutes, such models represent the most advanced AI systems in existence. At the same time, all the challenges of today's large models like hallucinations, dropouts and coherence failures means enterprise applications are proceeding more slowly than model vendors had hoped, while the kinds of consumer applications like intelligent assistants that are more tolerant of these limitations are more ready for commercial deployment, but stymied by the slow response times of these larger models. Enter the growing emphasis on miniaturized models designed for on-device and rapid responses.

A massive model that requires several minutes to provide a highly reasoned response isn't necessary for these kinds of consumer-centric assistant applications. Instead, the priority is on speed sufficient to enable realtime conversation and especially two-way natural dialog. Reasoning quality is less important than speed for such use cases and models can be fine-tuned on their more narrowly constrained set of use cases to improve the quality of their responses. Critically, consumer-centric models must operate within a more natural environment than turn-based textual chat apps. Thus, not only are we seeing an emphasis on voice interaction, but critically, an emphasis on emotion and the ability to interrupt the model's output on demand. Rather than the bland machine voices of the past, companies are increasingly emphasizing prosody-like elements that make their models' voices sound more human-like and more readily connectable to their human users. While public demos to date don't show true prosody, they offer a facsimile that is tuned to word choice well enough to make human interlocuters able to suspend disbelief and see the machine as more relatable. Most critically, however, model developers are now offering the ability to interrupt models at any moment and for the moment to immediately resume the dialog based on where it left off and the user's new information in a seamless interaction, much like a manager speaking to a subordinate in a stereotypical interaction. This subtly reinforces the human as in command of the bot, eliminates the annoyance of past loquacious models that often continued speaking for tens of seconds while ignoring commands to stop and allows rapid-fire interaction. All of this depends on models that can begin producing results to a prompt within a second or less, rather than tens to hundreds of seconds.

This bifurcation of models is likely to entrench moving forward, with enterprise backend models emphasizing absolute capability (reasoning, context size, etc), while consumer-facing models increasingly become the public face of AI and emphasize speed and highly-tuned narrow use cases.