开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

HG · 2020年06月06日

问一道题:NO.PZ201512020300000611

* 问题详情,请 查看题干

问题如下:

To address her concern in her exploratory data analysis, Steele should focus on those tokens that have:

选项:

A.

low chi-square statistics.

B.

low mutual information (ML) values.

C.

very low and very high term frequency (TF) values.

解释:

C is correct. Frequency measures can be used for vocabulary pruning to remove noise features by filtering the tokens with very high and low TF values across all the texts. Noise features are both the most frequent and most sparse (or rare) tokens in the dataset. On one end, noise features can be stop words that are typically present frequently in all the texts across the dataset. On the other end, noise features can be sparse terms that are present in only a few text files. Text classification involves dividing text documents into assigned classes. The frequent tokens strain the ML model to choose a decision boundary among the texts as the terms are present across all the texts (an example of underfitting). The rare tokens mislead the ML model into classifying texts containing the rare terms into a specific class (an example of overfitting). Thus, identifying and removing noise features are critical steps for text classification applications.

Chi-square test的统计量越大,是不是就代表在着这个词的独立程度越低,应该被选入?

1 个答案

星星_品职助教 · 2020年06月08日

同学你好,

卡方检验统计量越大越应该被选入的结论正确,事实上这道题A选项如果单独来看说的也没问题,但这道题case中背景针对的是TF value,并非另外两项。所以不能选择A