Untangling speech recognition

Dealing with language is so complicated! In this post I want to focus on speech, voice, audio — but bear in mind that text is also language, and unlike humans, a machine must be able to process text if it’s going to do anything at all with language.

The speech part of machine learning goes two ways: The machine can “hear” speech as audio (it receives audio and simultaneously creates a digital representation of it) — but to make sense of it, to use it (to find the answer to your question, for example), the machine must convert the audio into text. On the other hand, before the machine can “speak,” it needs text — and that text must be converted into digital audio. For the machine, these are not just one thing and its reverse.

Until I began researching this, I hadn’t given any thought to accents. I had thought about the differences among languages (and I still don’t know whether it’s harder, easier or the same to train a speech-recognition system in tonal languages such as the Chinese languages, or Vietnamese, as compared with a non-tonal language such as English), but I’d never considered that a person speaking English with an accent might not be “understood” by a speech-recognition system.

Behind the Mic: The Science of Talking with Computers (2014)

This breezy video from Google (7 minutes) does a good job of conveying a bit of the actual science behind how Siri, Alexa or Google Assistant “know” what we are saying when we speak to them. Even though it’s from 2014, there’s nothing outdated (as far as I know). You can see how the machine represents the speech it takes in. Like many explanations I found, however, it kind of mushes the text part and the sound part altogether, leaving the viewer with a general sense of how it all works but still in the dark as to how the parts work, separately. (I don’t like how they show a human brain when they talk about neural networks. That’s very misleading.)

The video provides a quick background on the development of speech recognition, which was pretty awful until just a few years ago when researchers started applying deep neural networks to the acoustics part. Just like image recognition, speech recognition got a tremendous boost from the advances in computer processing hardware that now allow immense quantities of data to be analyzed at super speed.

To get a handle on how the separate parts of a speech-recognition system work, I needed to listen to this podcast from March 2020. It’s a 50-minute interview with Catherine Breslin, a U.K. machine learning scientist who specializes in speech recognition. She worked at Amazon Alexa for four and a half years. There’s a full transcript at the same URL if you’d rather read than listen.

For speech recognition, machine learning is used to train separate models — one for acoustics, and one for language. There’s also a third piece, the lexicon, which indicates the sequence of phones (the tiniest sound segments) that make up a single word. I don’t yet understand how that part is made. (Any program that reads text aloud would need to have a lexicon.)

“So if we put these together, we have an acoustic model, which tells you from some audio which sounds are likely to be spoken at that time; the lexicon tells you how those sounds combine into words, and then the language model tells you how those words combine into sequences of words.”

—Catherine Breslin

The three pieces, Breslin explains, work together in a decoding process that produces text from speech — the most likely representation of what was said. I looked at some further technical explanations of how the decoding is done, and it resembles a system for AI analysis of game moves — giant trees, many layers, lots of nodes. What the system needs to learn is the probabilities for sounds forming words forming sentences.

Note, all this is just to get to where the machine has the text of what was said. It hasn’t yet done any analysis of what was meant. Whew.

However, apart from voice assistants like Siri and Alexa, this process by itself has tremendous value for transcription. It is used to produce transcripts of radio programs, interviews and meetings, as well as to generate subtitles for movies and videos.

Creative Commons License
AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.