{"id":458,"date":"2020-09-28T15:16:42","date_gmt":"2020-09-28T19:16:42","guid":{"rendered":"https:\/\/www.macloo.com\/ai\/?p=458"},"modified":"2020-09-28T15:16:42","modified_gmt":"2020-09-28T19:16:42","slug":"encoding-language-for-a-machine-learning-system","status":"publish","type":"post","link":"https:\/\/www.macloo.com\/ai\/2020\/09\/28\/encoding-language-for-a-machine-learning-system\/","title":{"rendered":"Encoding language for a machine learning system"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The vocabulary of medicine is different from the vocabulary of physics. If you&#8217;re building a vocabulary for use in machine learning, you need to start with a corpus \u2014 a collection of text \u2014 that suits your project. A <em>general-purpose vocabulary<\/em> in English might be derived from, say, 6 million articles from Google News. From this, you could build a vocabulary of, say, the 1 million most common words.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Although I surely do not understand all the math, last week I read <a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/abs\/1301.3781\" target=\"_blank\">Efficient Estimation of Word Representations in Vector Space<\/a>, a 2013 research article written by four Google engineers. They described their work on a then-new, more efficient way of accurately predicting word meanings \u2014 the outcome being <a rel=\"noreferrer noopener\" href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\" target=\"_blank\">word2vec<\/a>, a tool to produce a set of word vectors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After publishing <a href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/24\/imagining-words-as-numbers-in-n-dimensional-space\/\">a related post<\/a> last week, I knew I still didn&#8217;t have a clear picture in my mind of where the word vectors fit into various uses of machine learning. And how do the word vectors get made, anyhow? While <strong>word2vec<\/strong> is <em>not the only system<\/em> you can use to get word vectors, it is well known and widely used. (Other systems: <a rel=\"noreferrer noopener\" href=\"https:\/\/fasttext.cc\/\" target=\"_blank\">fastText<\/a>, <a rel=\"noreferrer noopener\" href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\" target=\"_blank\">GloVe<\/a>.)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How the vocabulary is created<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">First, the <strong>corpus<\/strong>: You might choose a corpus that suits your project (such as a collection of medical texts, or a set of research papers about physics), and <em>feed it into<\/em> word2vec (or one of the other systems). At the end you will have a file \u2014 a dataset. (Note, it should be a <em>very large<\/em> collection.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Alternatively, you might use a dataset that already exists \u2014 such as <strong>3 million words and phrases<\/strong> with 300 vector values, trained on a Google News dataset of about 100 billion words (linked on the <a rel=\"noreferrer noopener\" href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\" target=\"_blank\">word2vec homepage<\/a>): <em>GoogleNews-vectors-negative300<\/em>. This is <strong>a file you can download<\/strong> and use with a neural network or other programs or code libraries. The size of the file is 1.5 gigabytes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What word2vec <em>does<\/em> is compute the vector representations of words. What word2vec <em>produces<\/em> is a single computer file that contains those words and a list of vector values for each word (or phrase).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As an alternative to Google News, you might use the full text of Wikipedia as your corpus, if you wanted a general English-language vocabulary.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The breakthrough of word2vec<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Back to that (surprisingly readable) paper by the Google engineers: They set out to solve a problem, which was \u2014 <em>scale<\/em>. There were already systems that ingested a corpus and produced word vectors, but they were limited. Tomas Mikolov and his colleagues at Google wanted to use a bigger corpus (billions of words) to produce a bigger vocabulary (millions of words) with high-quality vectors, which meant more dimensions, e.g. 300 instead of 50 to 100.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>&#8220;Because of the much lower computational complexity, it is possible to compute very accurate high-dimensional word vectors from a much larger data set.&#8221;<\/p><cite>\u2014Mikolov <em>et al.,<\/em> 2013<\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">With <strong>more vectors per word,<\/strong> the vocabulary represents not only that <em>bigger<\/em> is related to <em>big<\/em> and <em>biggest<\/em> but also that <em>big<\/em> is to <em>bigger<\/em> as <em>small<\/em> is to <em>smaller<\/em>. Algebra can be used on the vector representations to return a correct answer (often, not always) \u2014 leading to a powerful discovery that substitutes for language understanding: Take the vector for <em>king, <\/em>subtract the vector for <em>man, <\/em>and add the vector for <em>woman<\/em>. What is the answer returned? It is the vector for <em>queen<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Algebraic equations are used to test the quality of the vectors. Some imperfections can be seen in the table below.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"507\" src=\"https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2020\/09\/word2vec_table.png\" alt=\"\" class=\"wp-image-471\" srcset=\"https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2020\/09\/word2vec_table.png 1024w, https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2020\/09\/word2vec_table-300x149.png 300w, https:\/\/www.macloo.com\/ai\/wp-content\/uploads\/2020\/09\/word2vec_table-768x380.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption><em>From Mikolov et al., 2013; color and circle added<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Mikolov and his colleagues wanted to reduce the time required for training the system that assigns the vectors to words. If you&#8217;re using only one computer, and the corpus is very large, training on a neural network could take <em>days<\/em> or even <em>weeks<\/em>. They tested various models and concluded that simpler models (<em>not<\/em> neural networks) could be trained faster, thus allowing them to use a larger corpus <em>and<\/em> more vectors (more dimensions).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How do you know if the vectors are good?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers defined <strong>a test set<\/strong> consisting of 8,869 semantic questions and 10,675 syntactic questions. Each question begins with a pair of associated words, as seen in the highlighted &#8220;Relationship&#8221; column in the table above. The circled answer, <em>small: larger,<\/em> is a wrong answer; synonyms are not good enough. The authors noted that &#8220;reaching 100% accuracy is likely to be impossible,&#8221; but even so, a high percentage of answers are correct.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I am not sure how the test set determined correct vs. incorrect answers. <a href=\"https:\/\/aclweb.org\/aclwiki\/Google_analogy_test_set_(State_of_the_art)\" target=\"_blank\" rel=\"noreferrer noopener\">Test sets are complex.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Mikolov <em>et al.<\/em> compared word vectors obtained from two simpler architectures, <strong>CBOW<\/strong> and <strong>Skip-gram,<\/strong> with word vectors obtained from two types of neural networks. One neural net model was superior to the other. CBOW was superior on syntactic tasks and &#8220;about the same&#8221; as the better neural net on the semantic task. Skip-gram was &#8220;slightly worse on the syntactic task&#8221; than CBOW but better than the neural net; CBOW was &#8220;much better on the semantic part of the test than all the other models.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">CBOW and Skip-gram are described in the paper.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another way to test a model for accuracy in semantics is to use the data from the <a rel=\"noreferrer noopener\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/the-microsoft-research-sentence-completion-challenge\/\" target=\"_blank\">Microsoft Research Sentence Completion Challenge<\/a>. It provides 1,040 sentences in which one word has been omitted and four wrong words (&#8220;impostor words&#8221;) provided to replace it, along with the correct one. The task is to choose the correct word from the five given.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>A word2vec model<\/strong> is <em>trained<\/em> using a text corpus. The final model exists as a <strong>file, <\/strong>which you can use in various language-related machine learning tasks. The file contains words and phrases \u2014 likely more than 1 million words and phrases \u2014 together with a unique list of <strong>vectors<\/strong> for each word. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The vectors represent coordinates for the word. Words that are <em>close to one another<\/em> in the vector space are related either semantically or syntactically. If you use a popular already-trained model, the vectors have been rigorously tested. If you use word2vec to build <em>your own model,<\/em> then you need to do the testing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The model \u2014 this collection of <em>word embeddings<\/em> \u2014 is human-language knowledge for a computer to use. It&#8217;s (obviously) not the same as humans&#8217; knowledge of human language, but it&#8217;s proved to be <em>good enough<\/em> to function well in many different applications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\"><img decoding=\"async\" alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\/\/i.creativecommons.org\/l\/by-nc-nd\/4.0\/88x31.png\"><\/a><br>\n<small><span xmlns:dct=\"http:\/\/purl.org\/dc\/terms\/\" property=\"dct:title\"><strong>AI in Media and Society<\/strong><\/span> by <span xmlns:cc=\"http:\/\/creativecommons.org\/ns#\" property=\"cc:attributionName\">Mindy McAdams<\/span> is licensed under a <a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License<\/a>.<br>\nInclude the author&#8217;s name (Mindy McAdams) and a link to the original post in any reuse of this content.<\/small><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The vocabulary of medicine is different from the vocabulary of physics. If you&#8217;re building a vocabulary for use in machine learning, you need to start with a corpus \u2014 a collection of text \u2014 that suits your project. A general-purpose vocabulary in English might be derived from, say, 6 million articles from Google News. From&hellip; <a class=\"more-link\" href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/28\/encoding-language-for-a-machine-learning-system\/\">Continue reading <span class=\"screen-reader-text\">Encoding language for a machine learning system<\/span> <span class=\"meta-nav\" aria-hidden=\"true\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[2],"tags":[97,93,98],"class_list":["post-458","post","type-post","status-publish","format-standard","hentry","category-nlp","tag-language","tag-vectors","tag-word2vec"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/458","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/comments?post=458"}],"version-history":[{"count":10,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/458\/revisions"}],"predecessor-version":[{"id":483,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/458\/revisions\/483"}],"wp:attachment":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/media?parent=458"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/categories?post=458"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/tags?post=458"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}