Text-to-Speech (TTS)

Wrestling with TTS and GPTs

Karen Isbell

14 Feb 2024 — 5 min read

After some long conversations with my Mandarin for Beginners GPT, I decided that allowing it to improvise lessons and deliver them through voice chat was not the best approach. Users could easily burn through their 40 messages per three hours just with some basic pronunciation and conversation practice. I know I did.

So I opted for a new approach, which is working with AI to build out simple lessons on pronouns, common greetings, etc. with prompts for pronunciation practice, recording them through an inexpensive text-to-speech (TTS) model, and then setting up a GPT to offer the recordings to learners in sequence and offer supplemental practice.

Creating the lessons: The easy part

I started by asking ChatGPT to produce a sample lesson for new Mandarin learners that would meet these requirements:

Explain how to say and use basic pronouns, like I/me, he/she/it, this (things that are close), and that (things that are far)
Move slowly through the material
Use Speech Synthesis Markup Language (SSML) to add pauses for learners to practice pronunciation

💡

It turns out that many text-to-speech models do not fully support SSML or predictably recognize it.

Overall, this went well. After a few iterations, I ended up with this:

🎙️

Welcome to Mandarin for Beginners, lesson one! Today, we'll learn about basic pronouns and how to use "this" and "that." Let's start with the pronouns. First, "I" or "me" in Mandarin is <say-as interpret-as="characters">我</say-as> (wǒ). Let's pronounce it together: <prosody pitch="high">wǒ</prosody>. <break time="3s"/> Good job!

Next, "you" in Mandarin is <say-as interpret-as="characters">你</say-as> (nǐ). Let's say it together: <prosody pitch="high">nǐ</prosody>. <break time="3s"/> Excellent!

Now, let's move on to "he," "she," or "it." In Mandarin, they are all represented by the same character: <say-as interpret-as="characters">他</say-as>, <say-as interpret-as="characters">她</say-as>, or <say-as interpret-as="characters">它</say-as> (tā). Let's practice saying <prosody pitch="high">tā</prosody>. <break time="3s"/> Well done!

Let's review what we've learned. <say-as interpret-as="characters">我</say-as> (wǒ) for "I/me," <say-as interpret-as="characters">你</say-as> (nǐ) for "you," and <say-as interpret-as="characters">他/她/它</say-as> (tā) for "he/she/it." <break time="5s"/>

Next, we'll learn "this" and "that." "This" in Mandarin refers to things that are close to you. It is <say-as interpret-as="characters">这</say-as> (zhè). Say it with me: <prosody pitch="high">zhè</prosody>. <break time="3s"/> Nice work!

Lastly, "that" refers to things that are far away from you. It is <say-as interpret-as="characters">那</say-as> (nà). Repeat after me: <prosody pitch="high">nà</prosody>. <break time="3s"/> Great!

Now, let's practice these new words in sentences. Repeat after me:

<prosody pitch="medium">我是学生 (Wǒ shì xuésheng) - I am a student.</prosody> <break time="3s"/>
<prosody pitch="medium">你好吗？(Nǐ hǎo ma?) - How are you?</prosody> <break time="3s"/>
<prosody pitch="medium">他是我的朋友 (Tā shì wǒ de péngyǒu) - He is my friend.</prosody> <break time="3s"/>
<prosody pitch="medium">这是什么？(Zhè shì shénme?) - What is this?</prosody> <break time="3s"/>
<prosody pitch="medium">那是书 (Nà shì shū) - That is a book.</prosody> <break time="3s"/>
Great job today! Keep practicing, and I'll see you in our next lesson. Goodbye!

While there are several ways this lesson could be improved, I felt it was good enough for a sample recording. And this is where the fun begins.

Making the TTS recording: The hard part

I was little disappointed that OpenAI's custom GPT creator does not include automatic access to its text to speech (TTS) model. But I thought I had a work-around. The other day, I'd had good luck using a Zapier AI Action to create Gmail drafts based on AI-written thank you notes. Perhaps, I thought, there was an AI Action for OpenAI's TTS model, and indeed there was.

As a next step, I set up a TTS GPT to run OpenAI's TTS model via this Zapier action. Set up was relatively straightforward. The most complicated part was a short detour to generate a new OpenAI API key. And then I was in business. I fired it up and made a test recording of a short poem.

"I'm Nobody" by Emily Dickinson. Voice by AI.

0:00

/13.2

Great, I thought! This will be easy. But then I tried my full-length Mandarin script, and...things got interesting:

The GPT often required I set up a new AI Action instead of using my existing one.
When the GPT populated the AI Action, it often left out big chunks of the script. Sometimes, no text was copied at all.
When I edited the text manually in the AI Action, the GPT would refuse to run it.
The AI Action would not provide a download link for recordings...and the GPT would suggest I check an audit log that I can't access because I do not have a teams account.
The SSML tags were often ignored by the OpenAI model or read aloud as text.

Eventually, the GPT became as frustrated as I was:

My GPT politely suggests I try a new approach.

Since Zapier's AI Actions are still Alpha tools, it's very possible the OpenAI TTS connection is just not yet robust enough for reliably recording more than a few sentences. For now, I'm going to try recording my Mandarin lesson in Zapier's AI Actions sandbox and look into accessing TTS models via API.

The experiment continues...

p.s., While IMO my TTS GPT is too flaky for the GPT Store, you are welcome to try it.