Further adventures with TTS models reveal serious limitations
It was a good morning. The sun was peeking over the horizon. I had a steaming mug of coffee, and my MacOS scripting-and-voiceover app was behaving as expected. I'd even added voices from the ElevenLabs text-to-speech (TTS) model by adapting this Swift package. While it doesn't support MacOS, I was able to clone it, adapt it, and attach it to my project.
So, I was ready to produce my first AI-recorded audio track. I even had a script, also produced by AI, ready to go: a pronoun lesson for the Mandarin for Beginners GPT that sparked my immediate need for a text-to-speech model.
Recording with OpenAI's TTS
I decided to get started with the OpenAI Shimmer voice. I pasted my script into the text window, checked it for errors, and hit play. At first everything sounded great. Maybe a little unemotional, but still perfectly fine for a language lesson. Transitions between English and Mandarin were smooth, and accents in both languages were accurate. But a major issue soon emerged.
The model either ignored SSML tags or read them aloud. Here's a short snippet I used for testing:
Let's first learn about the Mandarin pronouns for I and Me. I or Me is translated as 我 (wǒ).
<break time = "2s"/>
Please, repeat after me - 我 (wǒ).
And here's the recording:
The narration didn't pause. It also dropped the 我 (wǒ) at the end of the third sentence. I played around with various prompting techniques, such as adding dots and dashes, to no avail. After doing a little digging, the answer became clear. The OpenAI TTS does not yet support SSML in any meaningful way.
While it's possible to add pauses by making multiple short recordings in small files and then stringing them together, either programmatically or through audio editing software, I wasn't ready for that kind of brute-force workaround. So, while I appreciated the soothing, rich tones of OpenAI's Shimmer voice, it was time to try a new model.
Recording with ElevenLabs' TTS
It was fortunate that, as mentioned above, I had already connected my MacOS app via API to my free ElevenLabs account. One thing I liked about ElevenLabs upfront was the sheer variety of voices, and I was excited to choose from among more than forty options.
However, while the variety of voices was delightful, I ran up against two major roadblocks. First, the English-speaking voices struggled with Mandarin pronunciation. During testing, I noticed that the model pronounced 我 (wǒ), which means I or me in Mandarin, as whoa.
Even when I updated my code to ensure the API would only call the Multilingual v2 model, the issue remained. Of course, it's likely that ElevenLabs' voices have not been tuned for my very niche-y use case: switching seamlessly between Mandarin and English in the same recording.
Second, ElevenLabs doesn't yet support SSML. Although their docs suggest a limited pause capability, it didn't work in any of my tests. Neither did prompting strategies like series of dots, dashes, or returns.
Time for more exploration
While my initial experiments were promising, and I was pleasantly surprised by the sheer variety of synthetic voices now available, I'm going to keep looking for a natural-sounding TTS that also supports SSML. My nextstep will be Amazon Polly.
The experiment continues...