AI Is Learning to Identify Toxic Online Content
Machine-learning systems could help flag hateful, threatening or offensive language.
Social platforms large and small are struggling to keep their communities safe from hate speech, extremist content, harassment and misinformation. One solution might be AI: developing algorithms to detect and alert us to toxic and inflammatory comments and flag them for removal. But such systems face big challenges.
The prevalence of hateful or offensive language online has been growing rapidly in recent years. Social media platforms, relying on thousands of human reviewers, are struggling to moderate the ever-increasing volume of harmful content. In 2019, it was reported that Facebook moderators are at risk of suffering from PTSD as a result of repeated exposure to such distressing content. Outsourcing this work to machine learning can help manage the rising volumes of harmful content. Indeed, many tech giants have been incorporating algorithms into their content moderation1 for years.
近年来,网上的仇恨言论或攻击性语言激增。社交媒体平台依靠数千名人工审核员,难以审核持续增长的有害内容。据报道,2019年, 脸书公司的审核员由于反复接触此类令人痛苦的内容,面临罹患创伤后应激障碍的风险。把这项工作交由机器学习完成,有助于解决有害内容数量不断攀升的问题。事实上,近年来,许多大型科技公司已经把算法集成到内容审核中。
One such example is Google’s Jigsaw2, a company focusing on making the internet safer. In 2017, it helped create Conversation AI, a collaborative research project aiming to detect toxic comments online. However, a tool produced by that project, called Perspective, faced substantial criticism. One common complaint was that it created a general “toxicity score” that wasn’t flexible enough to serve the varying needs of different platforms. Some Web sites, for instance, might require detection of threats but not profanity, while others might have the opposite requirements.
谷歌旗下的Jigsaw公司即为一例。Jigsaw是一家专注于提升互联网安全性的公司。2017年, 它帮助创建了Conversation AI。这是一个旨在检测网上恶意评论的合作研究项目。然而,这个项目推出的一款名为Perspective的工具却遭到广泛批评。一条常见的投诉意见是,此工具生成的综合“恶意评分”不够灵活,无法满足不同平台的各种需求。例如,有些网站可能需要检测威胁言论,而非不雅语言,而另一些网站的需求可能正好相反。
Another issue was that the algorithm learned to conflate toxic comments with nontoxic comments that contained words related to gender, sexual orientation, religion or disability. For example, one user reported that simple neutral sentences such as “I am a gay black woman” or “I am a woman who is deaf ” resulted in high toxicity scores, while “I am a man” resulted in a low score.
Following these concerns, the Conversation AI team invited developers to train their own toxicity-detection algorithms and enter them into three competitions (one per year) hosted on Kaggle, a Google subsidiary known for its community of machine learning practitioners, public data sets and challenges. To help train the AI models, Conversation AI released two public data sets containing over one million toxic and nontoxic comments from Wikipedia and a service called Civil Comments. Some comments were seen by many more than 10 annotators (up to thousands), due to sampling and strategies used to enforce rater accuracy.
为回应这些关切,Conversation AI团队邀请开发者训练自己的恶意检测算法,并参加在Kaggle平台举办的三项算法竞赛(每年一项)——Kaggle是谷歌公司的子公司,以旗下的机器学习从业者社区、公共数据集和挑战赛而闻名。为帮助训练人工智能模型,Conversation AI公布了两个公共数据集——包含一百余万条来自维基百科的恶意和非恶意评论,以及一个名为“文明评论”的服务。由于采样和为加强评分者准确率所采用的策略等原因,部分评论由远超十名(最多数千名)的注释者审阅。
The goal of the first Jigsaw challenge was to build a multilabel toxic comment classification model with labels such as “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and “identity hate”. The second and third challenges focused on more specific limitations of their API: minimizing unintended bias towards pre-defined identity groups and training multilingual models on English-only data.
Jigsaw公司第一个挑战的目标是创建一个多标签恶意评论分类模型,其标签包含“恶意”“严重恶意”“威胁”“侮辱”“淫秽”“身份仇恨”等。第二及第三个挑战则专注于解决更加具体的API 限制:最大限度减少对预定义身份群体的无意识偏见,以及训练纯英语数据的多语言模型。
Our team at Unitary, a contentmoderation AI company, took inspiration from the best Kaggle solutions and released three different models corresponding to each of the three Jigsaw challenges. While the top Kaggle solutions for each challenge use model ensembles, which average the scores of multiple trained models, we obtained a similar performance with only one model per challenge.
While these models perform well in a lot of cases, it is important to also note their limitations. First, these models will work well on examples that are similar to the data they have been trained on. But they are likely to fail if faced with unfamiliar examples of toxic language.
Furthermore, we noticed that the inclusion of insults or profanity in a text comment will almost always result in a high toxicity score, regardless of the intent or tone of the author. As an example, the sentence “I am tired of writing this stupid essay” will give a toxicity score of 99.7 percent, while removing the word “stupid” will change the score to 0.05 percent.
Lastly, all three models are still likely to exhibit some bias, which can pose ethical concerns when used off-the-shelf3 to moderate content.
Although there has been considerable progress on automatic detection of toxic speech, we still have a long way to go until models can capture the actual, nuanced, meaning behind our language—beyond the simple memorization of particular words or phrases. Of course, investing in better and more representative datasets would yield incremental improvements, but we must go a step further and begin to interpret data in context, a crucial part of understanding online behavior. A seemingly benign text post on social media accompanied by racist symbolism in an image or video would be easily missed if we only looked at the text. We know that lack of context can often be the cause of our own human misjudgments. If AI is to stand a chance of replacing manual effort on a large scale, it is imperative that we give our models the full picture.
1. content moderation 内容审核,是基于图像、文本、视频的检测技术,可自动检测涉黄、广告、涉政、涉暴、涉及敏感人物等内容,对用户上传的图片、文字、视频进行内容审核,帮助客户降低业务违规风险。
2. 由谷歌建立的一家技术孵化公司(其前身为谷歌智库部门Google Ideas),主要负责创建技术工具来减少并遏制线上虚假信息、骚扰以及其他问题。
3. off the shelf(产品)现成的,不需定制的。文中充当副词,用作状语。
By Laura Hanu et al.
留言与评论(共有 0 条评论) “” |