{"id":744,"date":"2021-05-30T10:01:48","date_gmt":"2021-05-30T14:01:48","guid":{"rendered":"https:\/\/www.macloo.com\/ai\/?p=744"},"modified":"2021-05-30T10:12:18","modified_gmt":"2021-05-30T14:12:18","slug":"attention-in-machine-learning-and-nlp","status":"publish","type":"post","link":"https:\/\/www.macloo.com\/ai\/2021\/05\/30\/attention-in-machine-learning-and-nlp\/","title":{"rendered":"Attention, in machine learning and NLP"},"content":{"rendered":"\n<p>Let&#8217;s begin at the beginning, with <a rel=\"noreferrer noopener\" href=\"https:\/\/papers.nips.cc\/paper\/2017\/hash\/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html\" target=\"_blank\">Attention Is All You Need<\/a> (Vaswani et al., 2017). This is a conference paper with eight authors, six of whom then worked at Google. They contended that neither recurrent neural networks nor convolutional neural networks are necessary for machine translation of languages, and hence the <strong>Transformer,<\/strong> &#8220;a new simple network architecture,&#8221; was born. (Note: It relies on feed-forward neural networks.)<\/p>\n\n\n\n<p><a href=\"https:\/\/www.macloo.com\/ai\/2021\/05\/14\/figuring-it-out-transformers-for-nlp\/\">Transformers<\/a> are the basis for machine translation and other tasks relying on language models. <strong>GPT-3<\/strong> has recently become infamous; others include BERT (from Google) and ELMo.<\/p>\n\n\n\n<p>Before the work by Vaswani and his co-equal co-authors, progress in NLP was limited (although it had advanced a lot since 2012) because of the ways in which RNN models depend on the <em>sequence<\/em> and <em>position<\/em> of words in a text. Transformers eliminate those limitations. With recurrent neural networks, there are impediments to parallel processing. Other researchers had previously cracked that nut using ConvNets, but then <em>other<\/em> limitations were inherent (exponential increase in the number of computational operations). Transformers <em>also<\/em> eliminate <em>those<\/em> limitations.<\/p>\n\n\n\n<p>So the Transformer was a first in NLP, a breakthrough. For machine translation, the paper claimed &#8220;a new state of the art&#8221; (p. 10).<\/p>\n\n\n\n<p>I had learned that an encoder and a decoder <em>connected by <\/em>an <strong>attention module<\/strong> is a standard architecture for machine language translation, e.g. Google Translate. This was true <em>before<\/em> 2017, so what is the difference effected by the Transformer? It eliminates RNNs and ConvNets from the architecture, yes (&#8220;our model contains no recurrence and no convolution&#8221;) \u2014 but what else?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Attention used in a new way<\/h2>\n\n\n\n<p>&#8220;<strong>An attention function<\/strong> can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all <strong>vectors<\/strong>. The output is computed as a <strong>weighted sum<\/strong> of the values, where the weight assigned to each value is computed by <strong>a compatibility function<\/strong> of the query with the corresponding key&#8221;(Vaswani et al., 2017, p. 3). I&#8217;m okay with that, although I doubt I would be able to explain it to my non\u2013computer science students. (I do explain weights and features when I introduce neural nets to them, and I explain word vectors when we start NLP. The trouble is they don&#8217;t know how to write a program, and they certainly don&#8217;t understand what a function is.)<\/p>\n\n\n\n<p>There are <strong>different attention functions<\/strong> that could be used. One is additive attention; another is dot-product attention, which is multiplicative rather than additive. Dot-product is &#8220;much faster and more space-efficient in practice.&#8221;  Vaswani et al. used a <em>scaled<\/em> dot-product attention function (p. 4). They also used <em>multi-head attention,<\/em> meaning the model uses eight parallel attention layers, or heads. The explanation was a bit beyond me, but the gist is that the model can look at multiple things at the same time, like juggling more balls simultaneously.<\/p>\n\n\n\n<p>Multi-head attention \u2014 plus the freedom of <em>no-sequence, no-position<\/em> \u2014 enables the Transformer to look at <em>all<\/em> the context for a word, and do it for multiple words <em>at the same time<\/em>.<\/p>\n\n\n\n<p>With <a href=\"https:\/\/www.macloo.com\/ai\/2020\/10\/01\/how-recurrent-neural-networks-read-sequences\/\">my rudimentary understanding of recurrent neural nets<\/a>, I have a fuzzy idea of how this use of attention functions produces better results, mainly by being able to take in and <em>compare<\/em> more of the text, a little closer to the way human brains hold an entire conversation even though it&#8217;s not a literal &#8220;recording&#8221; of the exact conversation. The way we comprehend meaning when we read has to do with millions of associations built up over a lifetime, as well as many associations within that present text. We are not processing separate little slices of a sentence \u2014 our brains handle a text more holistically. <\/p>\n\n\n\n<p>A Transformer <em>does<\/em> use <strong>word embeddings<\/strong> to convert the tokens (both inout and output) to vectors (Vaswani et al., 2017, p. 5). It uses <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Softmax_function\" target=\"_blank\">softmax<\/a> but no <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Long_short-term_memory\" target=\"_blank\">LSTMs<\/a> (because, again, &#8220;no recurrence&#8221;). <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Please help me, YouTube<\/h2>\n\n\n\n<p>I found a video (13:04) that helped me in my struggle to understand the Transformer architecture: <\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"jetpack-video-wrapper\"><iframe loading=\"lazy\" title=\"Transformer Neural Networks - EXPLAINED! (Attention is all you need)\" width=\"739\" height=\"416\" src=\"https:\/\/www.youtube.com\/embed\/TQQlZhbC5ps?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/div><\/figure>\n\n\n\n<p>It was still a tough climb for me, but this video was particularly helpful with how multi-head attention improves the process. (Obviously the speed improvement is huge.)<\/p>\n\n\n\n<p>Another helpful video (5:33) does a nice job summing up the sequence-based limitations of RNNs: &#8220;In general it&#8217;s easier for [RNNs] to capture relationships between points that are <em>close to each other<\/em> than it is to capture relationships between points that are <em>very far from each other <\/em>\u2014 say, several thousand points in the sequence.&#8221; In the paper, this is called &#8220;path length between long-range dependencies in the network&#8221; (Vaswani et al., 2017, p. 6) and identified as one of three motivations for developing the <strong>self-attention layers<\/strong> in Transformer.<\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"jetpack-video-wrapper\"><iframe loading=\"lazy\" title=\"NLP for Developers: Transformers | Rasa\" width=\"739\" height=\"416\" src=\"https:\/\/www.youtube.com\/embed\/KN3ZL65Dze0?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/div><\/figure>\n\n\n\n<p>In fact this second video is much better than the one above, but I liked that one when I watched it first, and maybe (haha!!) <em>the order<\/em> in which I watched them had an effect. The diagrams for <strong>self-attention<\/strong> in this shorter video are <em>very<\/em> good!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Back to Vaswani et al.<\/h2>\n\n\n\n<p>Speaking of <strong>self-attention<\/strong> \u2014 it was interesting that the authors thought it &#8220;could yield more interpretable models.&#8221;  As in any hidden layer in any neural network, features are determined and weights set <em>by the system itself,<\/em> not by the human programmers. This is the &#8220;learning&#8221; in machine learning. The authors noted that the &#8220;individual attention heads clearly learn <em>to perform different tasks,<\/em>&#8221;  and that many of them &#8220;appear to exhibit behavior related to the <em>syntactic and semantic structure<\/em> of the sentences&#8221; (p. 7; <em>my italics<\/em>).<\/p>\n\n\n\n<p>Cool.<\/p>\n\n\n\n<p>The results section of the paper describes performance using <a rel=\"noreferrer noopener\" href=\"https:\/\/towardsdatascience.com\/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b\" target=\"_blank\">BLEU scores<\/a> on two different NLP tasks (WMT 2014 English-to-German translation; WMT 2014 English-to-French translation) \u2014 reported as best-ever at that time \u2014 as well as record-breaking lower <em>training costs,<\/em> which means time to train the model factored by processor power used (number of GPUs, estimate of the number of floating-point operations).<\/p>\n\n\n\n<p>The successor to the <strong>code<\/strong> on which this seminal paper was based is <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/google\/trax\" target=\"_blank\">Trax<\/a>, available on GitHub.<\/p>\n\n\n\n<p>At the end of the paper (pages 13\u201315) there are math-free <strong>visualizations<\/strong> that illustrate what the attention mechanism does. These are well worth a look. <\/p>\n\n\n\n<p>.<\/p>\n\n\n\n<p><a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\"><img decoding=\"async\" alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\/\/i.creativecommons.org\/l\/by-nc-nd\/4.0\/88x31.png\"><\/a><br>\n<small><span xmlns:dct=\"http:\/\/purl.org\/dc\/terms\/\" property=\"dct:title\"><strong>AI in Media and Society<\/strong><\/span> by <span xmlns:cc=\"http:\/\/creativecommons.org\/ns#\" property=\"cc:attributionName\">Mindy McAdams<\/span> is licensed under a <a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License<\/a>.<br>\nInclude the author&#8217;s name (Mindy McAdams) and a link to the original post in any reuse of this content.<\/small><\/p>\n\n\n\n<p>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s begin at the beginning, with Attention Is All You Need (Vaswani et al., 2017). This is a conference paper with eight authors, six of whom then worked at Google. They contended that neither recurrent neural networks nor convolutional neural networks are necessary for machine translation of languages, and hence the Transformer, &#8220;a new simple&hellip; <a class=\"more-link\" href=\"https:\/\/www.macloo.com\/ai\/2021\/05\/30\/attention-in-machine-learning-and-nlp\/\">Continue reading <span class=\"screen-reader-text\">Attention, in machine learning and NLP<\/span> <span class=\"meta-nav\" aria-hidden=\"true\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[2],"tags":[147,97,136],"class_list":["post-744","post","type-post","status-publish","format-standard","hentry","category-nlp","tag-attention","tag-language","tag-transformers"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/744","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/comments?post=744"}],"version-history":[{"count":10,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/744\/revisions"}],"predecessor-version":[{"id":768,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/744\/revisions\/768"}],"wp:attachment":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/media?parent=744"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/categories?post=744"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/tags?post=744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}