Let the Robots do the talking — Exploring TTS
Speaking has always been a big part of being a lawyer. You use your voice to make submissions in the highest courts of the land. Even in client meetings, you are also using your voice to persuade. Hell, when I write my emails, I imagine saying what I am writing to make sure it is in my voice.
So, thinking about how a synthesized voice can be useful is going to be controversial. You might think that a computer's voice is soulless and not interesting enough to hold on its own against a lawyer. However, with advances led by smart assistants like Google Home and Siri, Text to Speech (TTS) is certainly worth exploring.
Why use robots?
Talking is really convenient, as you would open your mouth and start talking (though some babies will disagree). However, working from home shows how difficult it can be to record and transmit good quality sound. Feedback and distortions are just some problems people regularly face using basic equipment to have online meetings. It's frustrating.
If you think this is an issue that is resolved by having better equipment, it can get expensive very easily. You might notice that several people are involved in producing your favourite podcast. You are going to need all sorts of equipment, like microphones and DAC mixers. Hire Engineers? What does a mixer do, actually?
Furthermore, human performance can be subject to various imperfections. The pitch or tone is not right here. Sometimes you lose concentration or get interrupted in the middle of your speech. All this means you may have to record something several times and hopefully get the delivery you are happy with. If you aren't confident about your English or would like to say something in another language, getting a computer to voice will help overcome it.
So a synthesized voice can be cheap, fast and consistent. If the quality is good enough, you can focus on the script. For me, I am interested in improving the quality of my online training. Explaining stuff doesn't need Leonard Cohen quality delivery. It's probably far less distracting anyway.
Experiments with TTS
I will take two major Text to Speech (TTS) solutions for a spin — Google Cloud and Mozilla's TTS (open source). The Python code used to write these experiments are contained in my Github.
Google Cloud
It's quite easy to try Google Cloud's TTS. A demo allows you to set the text and then convert it with a click of a button. If you want to know how it sounds, try it!
To generate audio files, you're going to need a Google Cloud account and some programming knowledge. However, it's pretty straightforward, and I mostly copied from the quickstart. You can hear the first two paragraphs of this blog post here.
Here's my personal view of Google Cloud's TTS:
- It ain't free. Pricing for premium voices is free only for the first 1 million characters. After that, it's a hefty USD16 for every 1 million characters. Do note that the first two paragraphs of the blog have 629 characters. If you are converting text, it's hard to bust that limit.
- The voices sound nice, in my opinion. However, if you are listening to it for a long time, it might be not easy.
- Developer experience is great, and as you can see, converting lines of text to speech is straightforward.
Mozilla's TTS
Using Mozilla's TTS, you get much closer to the machine-training aspects of the text to speech. This includes training your own model, that is, if you have roughly 24 hours of recordings of your voice to spare.
However, for this experiment, we don't need that as we will use pre-trained models. Using python's built-in subprocess
module, we can run the command line command that comes with the package. This generates wave files. You can hear the first two paragraphs of this blog post here.
Here's my personal view of Mozilla's TTS:
- It's open-source, and I am partial to open source.
- It also teaches you to how to train a machine using your voice. So, this is a possibility.
- It sounds terrible, but that's because the audio feels a bit more varied than Google's. So... some parts sound louder, making other parts sound softer. There is also quite a lot of noise, which may be due to the recording quality of the source data. I did normalise the loudness for this sample.
- Leaving those two points aside, it sounds more interesting to me. The variation feels a tad more natural to me.
- There aren't characters to choose from (male, female etc.), so this may not be practical.
- Considering I was not doing much more than running a command line, it was OK. Notably, choosing a pre-trained model was confusing at first, and I had to experiment a lot. Also, based on what you choose, the model might take a bit of time and computing power to produce audio. It took roughly about 15 minutes, and my laptop was wheezing throughout.
Conclusion
If you thought robots would replace lawyers in court, this isn't the post to persuade you. However, thinking further, I think some usage cases are certainly worth trying, such as online training courses. In this regard, Google Cloud is production-ready so that you can get the most presentable solutions. Mozilla TTS is open source and definitely far more interesting but needs more time to develop. Do you think there are other ways to use TTS?
#tech #NaturalLanguageProcessing #OpenSource #Programming
Love.Law.Robots. – A blog by Ang Hou Fu
- Discuss... this Post
- If you found this post useful, or like my work, a tip is always appreciated:
- Follow [this blog on the Fediverse]()
- Contact me: