{"id":672,"date":"2021-05-14T12:58:59","date_gmt":"2021-05-14T16:58:59","guid":{"rendered":"https:\/\/www.macloo.com\/ai\/?p=672"},"modified":"2021-05-14T13:14:13","modified_gmt":"2021-05-14T17:14:13","slug":"figuring-it-out-transformers-for-nlp","status":"publish","type":"post","link":"https:\/\/www.macloo.com\/ai\/2021\/05\/14\/figuring-it-out-transformers-for-nlp\/","title":{"rendered":"Figuring It Out: Transformers for NLP"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">It was a challenge for me to figure out how to teach non\u2013computer science students about <strong>word vectors.<\/strong> I wanted them to have a clear idea of how words and their meanings are represented for use in an AI system \u2014 otherwise, I worried they would assume something like a written dictionary with text and definitions. I also wanted them to know that it wasn&#8217;t something simple like &#8220;each word has a numerical code assigned to it.&#8221; So we spent some time talking about what a <a rel=\"noreferrer noopener\" href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/28\/encoding-language-for-a-machine-learning-system\/\" target=\"_blank\">vector<\/a> is and what &#8220;<a rel=\"noreferrer noopener\" href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/24\/imagining-words-as-numbers-in-n-dimensional-space\/\" target=\"_blank\">n-dimensional space<\/a>&#8221; means.<\/p>\n\n\n\n<div class=\"wp-block-image is-style-default\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"540\" src=\"https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/word_vectors.png\" alt=\"\" class=\"wp-image-681\" srcset=\"https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/word_vectors.png 960w, https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/word_vectors-300x169.png 300w, https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/word_vectors-768x432.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><figcaption><em>Slide above by Mindy McAdams (copyright \u00a9 2021)<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image is-style-default\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"540\" src=\"https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/complete_vector.png\" alt=\"\" class=\"wp-image-682\" srcset=\"https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/complete_vector.png 960w, https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/complete_vector-300x169.png 300w, https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2021\/05\/complete_vector-768x432.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><figcaption><em>Slide above by Mindy McAdams (copyright \u00a9 2021)<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Now I need to work out how to teach them about <strong>transformers.<\/strong> I found <a rel=\"noreferrer noopener\" href=\"https:\/\/hellofuture.orange.com\/en\/the-gpt-3-language-model-revolution-or-evolution\/\" target=\"_blank\">a surprisingly clear article<\/a> at Orange.com (formerly France T\u00e9l\u00e9com), on their Hello Future website about research and innovation. I&#8217;m going to quote a large section from that article:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cOriginally, in 2013, <strong>word embeddings<\/strong> (such as Word2Vec, Glove, or Fasttext) were able to capture representations of words in the form of <strong>vectors<\/strong> taking into account the context of neighboring words in large volumes of text. Two words appearing in similar contexts were \u2018embedded\u2019 into N-dimensional space, to neighboring points in this space. This approach has led to significant advances in the field of NLP, but also has its limitations. <strong>From 2018 a new way of generating these word vectors emerged.<\/strong> Rather than selecting the vector of a word in a previously learnt static \u2018dictionary,\u2018 a model is responsible for dynamically generating the vector representation of a word. A word is thus projected to a vector not only according to its prior meaning, but also <strong>according to the context in which it appears.<\/strong> The models for effective realization of these contextual projections (BERT, ELMO and derivatives, GPT and its successors) are based on a simple yet powerful architecture called <strong>Transformer<\/strong>.\u201d (Spelling and punctuation edited for American English.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I know that paragraph might not make sense if you haven&#8217;t already learned about word vectors. The key is that transformers are able to build on and enhance the machine accuracy of <em>what a word or sentence means<\/em> by taking into account its context in the current data. So you do have a language model, previously trained on a large <a rel=\"noreferrer noopener\" href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/28\/encoding-language-for-a-machine-learning-system\/\" target=\"_blank\">corpus<\/a>, but the transformer analyzes the present text input in a more holistic way, <em>transforming<\/em> the vectors as it goes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Again quoting from the Orange.com article: \u201cWhile previous approaches &#8230; could model contextual dependencies, they were always constrained by referencing words by their <strong>positions<\/strong> [in the sentence]. <em>Attention<\/em> is about referencing by <strong>content<\/strong>. Instead of looking for relationships with other words in the context at given positions, attention allows you to search for <strong>relationships with all words in the context,<\/strong> and through a very effective implementation, it allows you to rely on the most similar words to improve prediction, whatever their position in context.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The role of the <strong>attention<\/strong> module is explained in a 2017 paper that, according to Google Scholar, has been cited more than 20,000 times: <a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/abs\/1706.03762\" target=\"_blank\">Attention Is All You Need<\/a>. See the PDF for diagrams of the Transformer network architecture.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Language models produced by transformers include <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/BERT_(language_model)\" target=\"_blank\">BERT<\/a> (developed by Google, and which powers Google searches), <a rel=\"noreferrer noopener\" href=\"https:\/\/allennlp.org\/elmo\" target=\"_blank\">ELMo<\/a>, and <a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/abs\/2005.14165v2\" target=\"_blank\">GPT-3<\/a>. These so-called <strong>large language models<\/strong> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.morningbrew.com\/emerging-tech\/stories\/2021\/03\/29\/one-biggest-advancements-ai-also-sparked-fierce-debate-heres\" target=\"_blank\">have raised many concerns<\/a>, particularly around ethics, as their interior processes are a black box, and their immense training data has included biased and toxic texts. The <a rel=\"noreferrer noopener\" href=\"https:\/\/hellofuture.orange.com\/en\/the-gpt-3-language-model-revolution-or-evolution\/\" target=\"_blank\">Orange.com article<\/a> includes two charts that illustrate differences among BERT, ELMo, and three generations of GPT.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An important aspect of transformers is that they produce these large language models from <em>unlabeled<\/em> data, and when developing applications based on transformers and such models, good results can be obtained with only a small amount of additional training data (\u201cfew-shot learning\u201d).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Orange \u2014 like many other companies \u2014 is using large language models for classification and information-extraction tasks such as: &#8220;sentiment analysis, personal data detection, detection and identification of named entities, syntactic dependency analysis, semantic parsing, co-reference resolution,&#8221; and question answering. These tasks involve customer-service applications as well as internal data analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Much of this post is based on the article <a rel=\"noreferrer noopener\" href=\"https:\/\/hellofuture.orange.com\/en\/the-gpt-3-language-model-revolution-or-evolution\/\" target=\"_blank\">The GPT-3 language model, revolution or evolution?<\/a> (February 2021).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\"><img decoding=\"async\" alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\/\/i.creativecommons.org\/l\/by-nc-nd\/4.0\/88x31.png\"><\/a><br>\n<small><span xmlns:dct=\"http:\/\/purl.org\/dc\/terms\/\" property=\"dct:title\"><strong>AI in Media and Society<\/strong><\/span> by <span xmlns:cc=\"http:\/\/creativecommons.org\/ns#\" property=\"cc:attributionName\">Mindy McAdams<\/span> is licensed under a <a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License<\/a>.<br>\nInclude the author&#8217;s name (Mindy McAdams) and a link to the original post in any reuse of this content.<\/small><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It was a challenge for me to figure out how to teach non\u2013computer science students about word vectors. I wanted them to have a clear idea of how words and their meanings are represented for use in an AI system \u2014 otherwise, I worried they would assume something like a written dictionary with text and&hellip; <a class=\"more-link\" href=\"https:\/\/www.macloo.com\/ai\/2021\/05\/14\/figuring-it-out-transformers-for-nlp\/\">Continue reading <span class=\"screen-reader-text\">Figuring It Out: Transformers for NLP<\/span> <span class=\"meta-nav\" aria-hidden=\"true\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[2],"tags":[137,138,102,97,136,93],"class_list":["post-672","post","type-post","status-publish","format-standard","hentry","category-nlp","tag-bert","tag-elmo","tag-gpt3","tag-language","tag-transformers","tag-vectors"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/672","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/comments?post=672"}],"version-history":[{"count":10,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/672\/revisions"}],"predecessor-version":[{"id":691,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/672\/revisions\/691"}],"wp:attachment":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/media?parent=672"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/categories?post=672"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/tags?post=672"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}