面向 JavaScript 开发人员的机器学习概念
最近我在用机器学习弄脏我的手,我在自然语言处理中遇到了一些很棒的想法
如
所以我决定用javascript中的简单例子来做,让我更清楚这些概念
所以让我们从一袋词开始
假设我们有一些产品评论,比如
const reviews = [ {text:"this product is amazing",score:1}, {text:"this product is great",score:1}, {text:"this product is horrible",score:0}, {text:"this product is bad",score:0},];
现在每条评论都有两件事文字和分数,无论分数是正面还是负面
所以理论上我们可以看到正面评论有惊人的好词,而负面评论有可怕的坏词
所以现在我们可以用这个样本数据说,如果评论有诸如惊人或伟大之类的词,我们可以将其分类为正面(1)或负面(0),如果它有可怕或坏之类的词
所以这是理论上的,我们将其理解为人类,但系统(ml算法)目前只理解数字或向量,因此我们需要以某种方式将我们的评论文本生成为数字数据
bag of words 是一种非常简单的表示形式,用于将文本表示为包含自己的单词的 bag
所以首先我们需要将句子转换成单词,然后得到所有唯一的单词
["this ,product ,is ,amazing"]["this ,product ,is ,great"]["this ,product ,is ,horrible"]["this ,product ,is ,bad"]
现在下一步是只计算单词在句子中出现的次数
{'this','product','is','amazing','great','horrible','bad'}
现在对于每个句子,我们将计算单词在句子中出现的次数
for examplefor "this" word we will get { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ]}it says for the first row "this" keyword is only used once [0,1]for the second row "this" keyword is only used once [1,1]for the third row "this" keyword is only used once [2,1]for the fourth row "this" keyword is only used once [3,1]
以同样的方式,我们将获取所有行的数据
this => { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }product => { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }is => { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }amazing => { [ 0, 1 ] }great => { [ 1, 1 ] }horrible => { [ 2, 1 ] }bad => { [ 3, 1 ] }
让我们看一些代码以在 javascript 中获取这些结果
所以我们将有两个主要功能,所有 ML 人都可能在 python 或 sklearn 中看到过,学习 FIT 和 TRANSFORM
const reviews = [ {text:"this product is amazing",score:1}, {text:"this product is great",score:1}, {text:"this product is horrible",score:0}, {text:"this product is bad",score:0},];const fit = (corpus) => { const uniqueWords = new Set(); corpus.forEach((document, index) => { document.split(" ").forEach((word) => { uniqueWords.add(word); }); }); return uniqueWords;};const transform = (corpus, uniqueWords) => { const wordCounts = new Map(); corpus.forEach((document, index) => { document.split(" ").forEach((word) => { if (!wordCounts.has(word)) { wordCounts.set(word, new Map()); } const wordMap = wordCounts.get(word); if (!wordMap.has(index)) { wordMap.set(index, 0); } wordMap.set(index, wordMap.get(index) + 1); }); }); return wordCounts;};
现在要运行这些函数,我们将它们运行为
const uniqueWords = fit(reviews.map(a=>a.text));const wordCounts = transform(reviews.map(a=>a.text), uniqueWords);wordCounts.forEach((key,value,map)=>{ console.log(value,key.entries());})
输出如下
this [Map Entries] { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }product [Map Entries] { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }is [Map Entries] { [ 0, 1 ], [ 1, 1 ], [ 2, 1 ], [ 3, 1 ] }amazing [Map Entries] { [ 0, 1 ] }great [Map Entries] { [ 1, 1 ] }horrible [Map Entries] { [ 2, 1 ] }bad [Map Entries] { [ 3, 1 ] }
根据这篇文章的反馈和互动,我会写更多解释 IDF 和 TFIDF 的纯 JavaScript 实现
如果我犯了任何错误,请纠正我
关注七爪网,获取更多APP/小程序/网站源码资源!
留言与评论(共有 0 条评论) “” |