问一道题：NO.PZ2015120204000049-有问必答-品职教育专注CFA ESG FRM CPA 考研等财经培训课程

问一道题：NO.PZ2015120204000049

问题如下：

After creating and analyzing the visualization, Steele is concerned that some tokens are likely to be noise features for ML model training; therefore, she wants to remove them.

To address her concern in her exploratory data analysis, Steele should focus on those tokens that have:

选项：

low chi-square statistics.

low mutual information (ML) values.

very low and very high term frequency (TF) values.

解释：

C is correct. Frequency measures can be used for vocabulary pruning to remove noise features by filtering the tokens with very high and low TF values across all the texts. Noise features are both the most frequent and most sparse (or rare) tokens in the dataset. On one end, noise features can be stop words that are typically present frequently in all the texts across the dataset. On the other end, noise features can be sparse terms that are present in only a few text files. Text classification involves dividing text documents into assigned classes. The frequent tokens strain the ML model to choose a decision boundary among the texts as the terms are present across all the texts (an example of underfitting). The rare tokens mislead the ML model into classifying texts containing the rare terms into a specific class (an example of overfitting). Thus, identifying and removing noise features are critical steps for text classification applications.

想问以下，噪声词的chi-square value和MI value有什么特点吗？或者说可以通过这两个值来判断是否是噪声词吗？谢谢

同学你好，
一般判断的方向是看一个特征值是否应该是被选择的特征，很少有特意去判断一个特征是不是噪音的。
卡方检验可以检验独立性，即两个事情的发生是不是有关联的。高卡方统计量代表这个单词在这个类别中出现的更频繁，也就是这个单词对这个类别有指向性，并不相互独立。所以这个单词就是应该选择的一个特征。
互信息衡量一个单词对各个类别的贡献程度。MI取值范围在[0,1]之间，越高的MI代表单词对这个分类的贡献越大，即单词在这个类别中出现的更频繁。MI为0代表单词在所有文本中出现的频率相同，即单词对于各个分类都没有特殊贡献。

问一道题：NO.PZ2015120204000049

1 个答案

1

6

761

相关问题