{"id":79,"date":"2020-08-17T08:30:57","date_gmt":"2020-08-17T12:30:57","guid":{"rendered":"https:\/\/www.macloo.com\/ai\/?p=79"},"modified":"2020-10-14T10:27:48","modified_gmt":"2020-10-14T14:27:48","slug":"untangling-speech-recognition","status":"publish","type":"post","link":"https:\/\/www.macloo.com\/ai\/2020\/08\/17\/untangling-speech-recognition\/","title":{"rendered":"Untangling speech recognition"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Dealing with language is so complicated! In this post I want to focus on speech, voice, audio \u2014 but bear in mind that text is also language, and <em>unlike humans, <\/em>a machine must be able to process text if it&#8217;s going to do anything <em>at all<\/em> with language.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The speech part of machine learning goes two ways: The machine can &#8220;hear&#8221; speech as audio (it receives audio and simultaneously creates a digital representation of it) \u2014 but to make sense of it, to <em>use<\/em> it (to find the answer to your question, for example), the machine must <em>convert<\/em> the audio into text. On the other hand, before the machine can &#8220;speak,&#8221; it needs text \u2014 and that text must be <em>converted<\/em> into digital audio. For the machine, these are not just one thing and its reverse. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Until I began researching this, I hadn&#8217;t given any thought to <em>accents<\/em>. I had thought about the differences among languages (and I still don&#8217;t know whether it&#8217;s harder, easier or the same to train a speech-recognition system in tonal languages such as the Chinese languages, or Vietnamese, as compared with a non-tonal language such as English), but I&#8217;d never considered that a person speaking English with an accent might not be &#8220;understood&#8221; by a speech-recognition system.<\/p>\n\n\n\n<figure class=\"wp-block-embed-vimeo aligncenter wp-block-embed is-type-video is-provider-vimeo wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"jetpack-video-wrapper\"><iframe loading=\"lazy\" title=\"Behind the Mic: The Science of Talking with Computers\" src=\"https:\/\/player.vimeo.com\/video\/112133045?dnt=1&amp;app_id=122963\" width=\"739\" height=\"416\" frameborder=\"0\" allow=\"autoplay; fullscreen\" allowfullscreen><\/iframe><\/div>\n<\/div><figcaption><em>Behind the Mic: The Science of Talking with Computers (2014)<\/em><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This breezy video from Google (7 minutes) does a good job of conveying a bit of the actual science behind how Siri, Alexa or Google Assistant &#8220;know&#8221; what we are saying when we speak to them. Even though it&#8217;s from 2014, there&#8217;s nothing outdated (as far as I know). You can see how the machine represents the speech it takes in. Like many explanations I found, however, it kind of mushes the text part and the sound part altogether, leaving the viewer with a general sense of how it all works but still in the dark as to how the <em>parts<\/em> work, separately. (I don&#8217;t like how they show a human brain when they talk about neural networks. That&#8217;s very misleading.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The video provides a quick background on the development of speech recognition, which was pretty awful until just a few years ago when researchers started applying deep neural networks to the acoustics part. Just like image recognition, speech recognition got a tremendous boost from the advances in computer processing hardware that now allow immense quantities of data to be analyzed at super speed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To get a handle on how the separate parts of a speech-recognition system work, I needed to listen to <a rel=\"noreferrer noopener\" href=\"https:\/\/changelog.com\/practicalai\/82#t=00:07:57.14\" target=\"_blank\">this podcast<\/a> from March 2020. It&#8217;s a 50-minute interview with <a rel=\"noreferrer noopener\" href=\"https:\/\/www.catherinebreslin.co.uk\/\" target=\"_blank\">Catherine Breslin<\/a>, a U.K. machine learning scientist who specializes in speech recognition. She worked at Amazon Alexa for four and a half years. There&#8217;s a full transcript at the same URL if you&#8217;d rather read than listen.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For speech recognition, machine learning is used to train separate models \u2014 one for <em>acoustics,<\/em> and one for <em>language<\/em>. There&#8217;s also a third piece, the <em>lexicon,<\/em> which indicates the sequence of <em>phones<\/em> (the tiniest sound segments) that make up a single word. I don&#8217;t yet understand how that part is made. (Any program that reads text aloud would need to have a lexicon.)<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>\u201cSo if we put these together, we have an acoustic model, which tells you from some audio which sounds are likely to be spoken at that time; the lexicon tells you how those sounds combine into words, and then the language model tells you how those words combine into sequences of words.&#8221;<\/p><cite>\u2014Catherine Breslin<\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The three pieces, Breslin explains, work together in a decoding process that produces text from speech \u2014 the <em>most likely<\/em> representation of what was said. I looked at some further technical explanations of how the decoding is done, and it resembles a system for AI analysis of game moves \u2014 giant trees, many layers, lots of nodes. What the system needs to <em>learn<\/em> is the probabilities for sounds forming words forming sentences.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note, all this is just to get to where the machine has the <em>text<\/em> of what was said. It hasn&#8217;t yet done any analysis of what was <em>meant<\/em>. Whew.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, <em>apart from<\/em> voice assistants like Siri and Alexa, this process by itself has tremendous value for transcription. It is used to produce transcripts of radio programs, interviews and meetings, as well as to generate subtitles for movies and videos.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\"><img decoding=\"async\" alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\/\/i.creativecommons.org\/l\/by-nc-nd\/4.0\/88x31.png\"><\/a><br>\n<small><span xmlns:dct=\"http:\/\/purl.org\/dc\/terms\/\" property=\"dct:title\"><strong>AI in Media and Society<\/strong><\/span> by <span xmlns:cc=\"http:\/\/creativecommons.org\/ns#\" property=\"cc:attributionName\">Mindy McAdams<\/span> is licensed under a <a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License<\/a>.<br>\nInclude the author&#8217;s name (Mindy McAdams) and a link to the original post in any reuse of this content.<\/small><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dealing with language is so complicated! In this post I want to focus on speech, voice, audio \u2014 but bear in mind that text is also language, and unlike humans, a machine must be able to process text if it&#8217;s going to do anything at all with language. The speech part of machine learning goes&hellip; <a class=\"more-link\" href=\"https:\/\/www.macloo.com\/ai\/2020\/08\/17\/untangling-speech-recognition\/\">Continue reading <span class=\"screen-reader-text\">Untangling speech recognition<\/span> <span class=\"meta-nav\" aria-hidden=\"true\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[2],"tags":[28,26,27],"class_list":["post-79","post","type-post","status-publish","format-standard","hentry","category-nlp","tag-audio","tag-speech_recognition","tag-virtual_assistants"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/79","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/comments?post=79"}],"version-history":[{"count":10,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/79\/revisions"}],"predecessor-version":[{"id":552,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/79\/revisions\/552"}],"wp:attachment":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/media?parent=79"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/categories?post=79"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/tags?post=79"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}