七爪源码:在 Javascript 中实现词袋

面向 JavaScript 开发人员的机器学习概念

最近我在用机器学习弄脏我的手,我在自然语言处理中遇到了一些很棒的想法

  1. Bag Of Words
  2. IDF
  3. TF-IDF

所以我决定用javascript中的简单例子来做,让我更清楚这些概念

所以让我们从一袋词开始

假设我们有一些产品评论,比如

const reviews = [  {text:"this product is amazing",score:1},  {text:"this product is great",score:1},  {text:"this product is horrible",score:0},  {text:"this product is bad",score:0},];

现在每条评论都有两件事文字和分数,无论分数是正面还是负面

所以理论上我们可以看到正面评论有惊人的好词,而负面评论有可怕的坏词

所以现在我们可以用这个样本数据说,如果评论有诸如惊人或伟大之类的词,我们可以将其分类为正面(1)或负面(0),如果它有可怕或坏之类的词

所以这是理论上的,我们将其理解为人类,但系统(ml算法)目前只理解数字或向量,因此我们需要以某种方式将我们的评论文本生成为数字数据

bag of words 是一种非常简单的表示形式,用于将文本表示为包含自己的单词的 bag

所以首先我们需要将句子转换成单词,然后得到所有唯一的单词

["this ,product ,is ,amazing"]["this ,product ,is ,great"]["this ,product ,is ,horrible"]["this ,product ,is ,bad"]

现在下一步是只计算单词在句子中出现的次数

{'this','product','is','amazing','great','horrible','bad'}

现在对于每个句子,我们将计算单词在句子中出现的次数

for examplefor "this" word we will get { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ]}it says for the first row "this" keyword is only used once [0,1]for the second row "this" keyword is only used once [1,1]for the third row "this" keyword is only used once [2,1]for the fourth row "this" keyword is only used once [3,1]

以同样的方式,我们将获取所有行的数据

this => { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }product => { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }is => { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }amazing => { [ 0, 1 ] }great => { [ 1, 1 ] }horrible => { [ 2, 1 ] }bad => { [ 3, 1 ] }

让我们看一些代码以在 javascript 中获取这些结果

所以我们将有两个主要功能,所有 ML 人都可能在 python 或 sklearn 中看到过,学习 FIT 和 TRANSFORM

const reviews = [  {text:"this product is amazing",score:1},  {text:"this product is great",score:1},  {text:"this product is horrible",score:0},  {text:"this product is bad",score:0},];const fit = (corpus) => { const uniqueWords = new Set(); corpus.forEach((document, index) => {   document.split(" ").forEach((word) => {    uniqueWords.add(word);  }); }); return uniqueWords;};const transform = (corpus, uniqueWords) => {  const wordCounts = new Map();  corpus.forEach((document, index) => {    document.split(" ").forEach((word) => {      if (!wordCounts.has(word)) {       wordCounts.set(word, new Map());      }    const wordMap = wordCounts.get(word);    if (!wordMap.has(index)) {      wordMap.set(index, 0);    }    wordMap.set(index, wordMap.get(index) + 1);    });  });  return wordCounts;};

现在要运行这些函数,我们将它们运行为

const uniqueWords = fit(reviews.map(a=>a.text));const wordCounts = transform(reviews.map(a=>a.text), uniqueWords);wordCounts.forEach((key,value,map)=>{  console.log(value,key.entries());})

输出如下

this [Map Entries] { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }product [Map Entries] { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }is [Map Entries] { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }amazing [Map Entries] { [ 0, 1 ] }great [Map Entries] { [ 1, 1 ] }horrible [Map Entries] { [ 2, 1 ] }bad [Map Entries] { [ 3, 1 ] }

根据这篇文章的反馈和互动,我会写更多解释 IDF 和 TFIDF 的纯 JavaScript 实现

如果我犯了任何错误,请纠正我


关注七爪网,获取更多APP/小程序/网站源码资源!

发表评论
留言与评论(共有 0 条评论) “”
   
验证码:

相关文章

推荐文章