{"id":276,"date":"2020-09-07T09:00:25","date_gmt":"2020-09-07T13:00:25","guid":{"rendered":"https:\/\/www.macloo.com\/ai\/?p=276"},"modified":"2020-09-07T09:52:57","modified_gmt":"2020-09-07T13:52:57","slug":"comment-moderation-as-a-machine-learning-case-study","status":"publish","type":"post","link":"https:\/\/www.macloo.com\/ai\/2020\/09\/07\/comment-moderation-as-a-machine-learning-case-study\/","title":{"rendered":"Comment moderation as a machine learning case study"},"content":{"rendered":"\n<p>Continuing my summary of the\u00a0lessons in\u00a0<a rel=\"noreferrer noopener\" href=\"https:\/\/newsinitiative.withgoogle.com\/training\/course\/introduction-to-machine-learning\" target=\"_blank\">Introduction to Machine Learning<\/a>\u00a0from the <strong>Google News Initiative,<\/strong> today I&#8217;m looking at Lesson 5 of 8, &#8220;Training your Machine Learning model.&#8221; Previous lessons were covered <a href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/03\/googles-machine-learning-course-for-journalists\/\">here<\/a> and <a href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/04\/examples-of-machine-learning-in-journalism\/\">here<\/a>.<\/p>\n\n\n\n<p>Now we get into the real &#8220;how it works&#8221; details \u2014 but still without looking at any code or computer languages. <\/p>\n\n\n\n<p>The &#8220;lesson&#8221; (actually just a text) covers a common case for news organizations: <strong>comment moderation. <\/strong>If you permit people to comment on articles on your site, machine learning can be used to identify offensive comments and flag them so that human editors can review them.<\/p>\n\n\n\n<p>With <strong>supervised learning<\/strong> (one of three approaches included in machine learning; <a href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/04\/examples-of-machine-learning-in-journalism\/\">see previous post here<\/a>), you need <strong>labeled data<\/strong>. In this case, that means complete comments \u2014 real ones \u2014 that have already been labeled by humans as offensive or not. You need an equally large number of both kinds of comments. Creating this dataset of comments is discussed more fully in the lesson.<\/p>\n\n\n\n<p>You will also need to choose <strong>a machine learning algorithm<\/strong>. Comments are text, obviously, so you&#8217;ll select among the existing algorithms that process language (rather than those that handle images and video). There are many from which to choose. As the lesson comes from Google, it suggests you use a Google algorithm.<\/p>\n\n\n\n<p>In all AI courses and training modules I&#8217;ve looked at, this step is boiled down to &#8220;Here, we&#8217;ll use this one,&#8221; without providing a comparison of the options available. This is something I would expect an experienced ML practitioner to be able to explain \u2014 why are they using X algorithm instead of Y algorithm for this particular job? Certainly there are reasons why one text-analysis algorithm might be better for analyzing comments on news articles than another one.<\/p>\n\n\n\n<p><strong>What is the algorithm doing? <\/strong>It is creating and refining a <strong>model<\/strong>. The more accurate the final model is, the better it will be at <em>predicting<\/em> whether a comment is offensive. Note that the model doesn&#8217;t actually <em>know<\/em> anything. It is a computer&#8217;s representation of a &#8220;world&#8221; of comments in which <em>some<\/em> \u2014 with particular features or attributes perceived in the training data \u2014 are rated as offensive, and <em>others<\/em> \u2014 which lack a sufficient quantity of those features or attributes \u2014 are rated as <em>not likely to be<\/em> offensive.<\/p>\n\n\n\n<p>The lesson goes on to discuss false positives and false negatives, which are possibly unavoidable \u2014 but the fewer, the better. We especially want to eliminate false negatives, which are offensive comments <em>not flagged<\/em> by the system.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>&#8220;The most common reason for bias creeping in is when your training data isn&#8217;t truly representative of the population that your model is making predictions on.&#8221;<\/p><cite>\u2014Lesson 6, Bias in Machine Learning<\/cite><\/blockquote>\n\n\n\n<p>Lesson 6 in the course covers <strong>bias<\/strong> in machine learning. A quick way to understand how ML systems come to be biased is to consider the comment-moderation example above. What if the labeled data (real comments) included a lot of comments <em>offensive to women<\/em> \u2014 but all of the labels were created by a team of men, with no women on the team? Surely the men would miss some offensive comments that women team members would have caught. The training data are flawed because a significant number of comments are <em>labeled incorrectly.<\/em><\/p>\n\n\n\n<p>There&#8217;s a pretty good video attached to this lesson. It&#8217;s only 2.5 minutes, and it illustrates <strong>interaction bias, latent bias,<\/strong> and <strong>selection bias<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-embed-youtube aligncenter wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"jetpack-video-wrapper\"><iframe loading=\"lazy\" title=\"Machine Learning and Human Bias\" width=\"739\" height=\"416\" src=\"https:\/\/www.youtube.com\/embed\/59bMh59JQDo?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/div><\/figure>\n\n\n\n<p>Lesson 6 also includes a list of <strong>questions you should ask<\/strong> to help you recognize potential bias in your dataset.<\/p>\n\n\n\n<p>It was interesting to me that the lesson omits a discussion of how the <strong>accuracy of labels<\/strong> is really just as important as having representative data for training and testing in supervised learning. This issue is covered in <a href=\"https:\/\/www.macloo.com\/ai\/2020\/08\/19\/imagenet-and-labels-for-data\/\">ImageNet and labels for data<\/a>, an earlier post here.<\/p>\n\n\n\n<p><a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\"><img decoding=\"async\" alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\/\/i.creativecommons.org\/l\/by-nc-nd\/4.0\/88x31.png\"><\/a><br>\n<small><span xmlns:dct=\"http:\/\/purl.org\/dc\/terms\/\" property=\"dct:title\"><strong>AI in Media and Society<\/strong><\/span> by <span xmlns:cc=\"http:\/\/creativecommons.org\/ns#\" property=\"cc:attributionName\">Mindy McAdams<\/span> is licensed under a <a rel=\"license\" href=\"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/\">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License<\/a>.<br>\nInclude the author&#8217;s name (Mindy McAdams) and a link to the original post in any reuse of this content.<\/small><\/p>\n\n\n\n<p>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Continuing my summary of the\u00a0lessons in\u00a0Introduction to Machine Learning\u00a0from the Google News Initiative, today I&#8217;m looking at Lesson 5 of 8, &#8220;Training your Machine Learning model.&#8221; Previous lessons were covered here and here. Now we get into the real &#8220;how it works&#8221; details \u2014 but still without looking at any code or computer languages. The&hellip; <a class=\"more-link\" href=\"https:\/\/www.macloo.com\/ai\/2020\/09\/07\/comment-moderation-as-a-machine-learning-case-study\/\">Continue reading <span class=\"screen-reader-text\">Comment moderation as a machine learning case study<\/span> <span class=\"meta-nav\" aria-hidden=\"true\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[43,6,47,5],"tags":[11,56,19,18],"class_list":["post-276","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-ethics-and-bias","category-journalism","category-machine-learning","tag-learning","tag-model","tag-supervised_learning","tag-training"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/276","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/comments?post=276"}],"version-history":[{"count":10,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/276\/revisions"}],"predecessor-version":[{"id":289,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/posts\/276\/revisions\/289"}],"wp:attachment":[{"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/media?parent=276"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/categories?post=276"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.macloo.com\/ai\/wp-json\/wp\/v2\/tags?post=276"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}