The Diamond Age of Voice AI Applications
Recent improvements in voice AI unlock new possibilities for applications and use cases.
Introduction
In Neal Stephenson’s The Diamond Age, a little girl named Nell grows up to become the most powerful woman in the world, with the help of a magical book that teaches her how to survive. While this book has many magical properties, it still relied on a hidden human voice actor to read aloud. Even in the realm of science fiction, digital, human-sounding voice did not exist. But recent breakthroughs in voice synthesis technology stand to change this. The advent of generative AI voice models is bringing a renaissance to both consumer and enterprise applications. These applications are not only relevant for use cases where real humans are used today, but also unlock brand-new possibilities that will transform how we work and play. This article provides an overview of voice AI technology and voice application opportunities.
Technology Overview
Although the first digital voice model, the Voder, was developed by Bell Labs back in the 1930s, voice technologies in the 2010s such as Siri and Alexa fell short of user expectations. Throughout the consumer internet boom and cloud era, there were no multi-billion-dollar apps built on voice model technology. Today, this is changing with new transformer based generative AI text-to-speech (TTS) models developed by organizations such as OpenAI, ElevenLabs, and more.
Transformer architecture enables improvements in voice models in a few ways:
Better language understanding: Transformer architectures excel at capturing relationships across long sequences. This means the model can understand how earlier parts of a text influence pronunciation, stress, or intonation in later parts.
End-to-end generation: Traditional TTS models have separate text front-end, acoustic model, and voice encoders. Transformer models unify these pieces for more natural sounding speech.
Better prosody and style: Transformer models can embed rich semantic and syntactic information to disambiguate words with multiple meanings and capture tone. Ex: “lead” can be a verb or a noun, which has implications on pronunciation with a short or long e sound.
These improvements in accuracy and realism are driving a new wave of killer voice AI applications that have the potential to become public market companies.
Model Selection
What do builders look for when choosing AI voice models? They are constantly making tradeoffs between quality, cost, and speed.
Quality – How natural does a voice model sound? The internet has many ELO arenas for TTS models where people vote on their favorite voice models. Network infrastructure companies such as LiveKit also help enhance quality and real-time-ness of AI voices. Unfortunately, quality tends to be inversely correlated with speed and positively correlated with costs. As of this article’s publication date, OpenAI’s TTS models lead in Artificial Analysis’ arena.
Cost – How much does it cost to serve this application? For the model itself, this is measured in cost per million characters. However, an engineer’s choices around inference engine, infrastructure, etc will have a big impact as well.
Speed – How long does a user have to wait for real-time interactions with an application? There are two components to speed:
Generation speed measures how long it takes to generate a certain duration of audio. Ex: it takes 1 second to generate 10 seconds of speech. Today, <0.1x is considered acceptable, but faster is almost always better.
Latency measures time to first audio byte. This includes the generation time of the first chunk and also serving time. Models served at edge on a user’s device will have the best latency. However, existing phones and computers have limited compute capacity and can only support smaller models. Larger models get served in the cloud, which increases latency, but there are certain design decisions engineers can make to improve speed.
Let’s look at Kokoro for example. This model is small – only 82M parameters – which allows builders to unlock on-device capabilities. It is also open sourced and free! And its quality is outstanding for its size. However, it does not sound as realistic as ElevenLabs’ voices. The tradeoff is scale and price. You can only access ElevenLabs voices via API, and it will cost at least ~$7 an hour. These two models will be relevant for very different applications.
Voice AI Applications
So what are the use cases and applications getting built with voice models today? I bucket them into 3 main categories: existing content creation, existing voice agents, and brand new applications.
1) Existing use case – content creation: Voice models are highly relevant in certain content creation categories such as audiobook recording and advertisements dubbing. Since most voice content today is pre-recorded rather than live, latency is a smaller focus. However, with an entertainment angle, realism, emotions, and expressiveness become much more important, making higher ELO models more relevant.
Using AI voice models could be highly cost efficient for content creation. For example, human audiobook narration costs ~$300 an hour. Now compare that to the ~$7 an hour for an ElevenLabs voice.
2) Existing use case – voice agents: Voice agents are more advanced generative AI products that can deliver human-like real-time spoken interactions. The canonical example for voice agent uses is customer service – can you help me get a refund on my order? This is an existing use case where AI agents could be more effective than human agents by reducing costs, lowering wait time, and increasing resolution rates. Companies in this space include Salesforce, Parloa, and Sierra.
Due to the interactive, real-time nature of many voice agents, latency is a big driver of user experience quality, and a lot of application tech stacks are oriented around reducing latency. If there is a 5 second pause between you sharing an order number and getting feedback from your AI customer service rep, that is not a great user experience. Ideally, being able to run your AI rep on-device in the future will result in the best latency. However, most applications today still depend on cloud computing for more performant models. Additionally, many voice agent use cases are utility related. This means that while voice quality is important, it is secondary to task completion and cost.
The modern voice agent tech stack involves multiple components. In a very basic form, it involves some type of speech input UI, a model that converts the speech to text, an LLM model to generate some type of response, an evaluation mechanism to ensure the response is high quality, a voice model to convert the text to speech, and some type of output UI. Finally, wrapped around all of this are tools that enable lower latency and real-time data transfer.
Recent releases of multi-modal models have sparked conversation around how these models could improve the technology stack by removing intermediary complexity, reducing latency, and improving quality. However, they present challenges around cost. While TTS model parameters are typically measured in millions, multimodal model parameters are measured in billions. For these reasons, I suspect pure TTS models will remain dominant in the near and medium term for voice applications.
3) Brand new applications of voice AI: I am most excited about the new possibilities voice AI unlocks that will change how we work and play. Imagine a world where you can:
Watch any YouTube video in the language you understand, regardless of the original. Today, the vast majority of digital content is only available in English.
Have nuanced conversations with non-playable characters who remember your prior conversations in video games that enhance gaming experiences. Today, non-playable characters give pre-fed cookie-cutter lines that offer limited player interactions.
Practice sales pitches with a virtual customer agent that has your enterprise knowledge ahead of major customer meetings. Today, you would have to find a nice colleague to roleplay for you.
Have non-intrusive AI relationship therapists that help you during conflicts with a partner, in real-time. Today, you would need to remember the conversation and then verbalize it to your therapist in the next session.
Give a blind person a voice agent that will describe to them changes in their surrounding environments in real-time. Today, nothing like this exists.
As models become more performant with multi-lingual capabilities, customizations, and human-ness, life-changing products like Nell’s magical book move closer to reality. The best applications are yet to come!
Special thank you to Hexgrad for ideas and feedback.
Sources: company websites, Artificial Analysis, online articles
Disclaimers: The information presented in this newsletter is the opinion of the author and does not necessarily reflect the view of any other person or entity, including Altimeter Capital Management, LP ("Altimeter"). The information provided is believed to be from reliable sources but no liability is accepted for any inaccuracies. This is for information purposes and should not be construed as an investment recommendation. Past performance is no guarantee of future performance. Altimeter is an investment adviser registered with the U.S. Securities and Exchange Commission. Registration does not imply a certain level of skill or training.
This post and the information presented are intended for informational purposes only. The views expressed herein are the author’s alone and do not constitute an offer to sell, or a recommendation to purchase, or a solicitation of an offer to buy, any security, nor a recommendation for any investment product or service. While certain information contained herein has been obtained from sources believed to be reliable, neither the author nor any of his employers or their affiliates have independently verified this information, and its accuracy and completeness cannot be guaranteed. Accordingly, no representation or warranty, express or implied, is made as to, and no reliance should be placed on, the fairness, accuracy, timeliness or completeness of this information. The author and all employers and their affiliated persons assume no liability for this information and no obligation to update the information or analysis contained herein in the future.