问一道题：NO.PZ2015120204000047

问题如下：

Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:

Step 1 Cleanse the raw text data.

Step 2 Split the cleansed data into a collection of words for them to be normalized.

Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.

The output created in Steele’s Step 3 can be best described as a:

选项：

bag-of-words.

set of n-grams.

document term matrix.

解释：

A is correct. After the cleansed text is normalized, a bag-of-words is created. A bag-of-words (BOW) is a collection of a distinct set of tokens from all the texts in a sample dataset.

如果选 C，题目会如何描述呢？

添加评论

2 个答案

已采纳答案

星星_品职助教 · 2020年02月19日

同学你好，

如果问用final BOW去构建什么，就选择构建DTM。

添加评论

Bluebiubiu · 2021年02月19日

final BOW是指最后一个subset吗

星星_品职助教 · 2021年02月19日

@Bluebiubiu

BOW也需要进行处理。处理完毕，可以进行下一步操作的BOW就叫做final BOW。

只有final BOW才可以用于构建document term matrix（DTM）

添加评论

2
回答
2
关注
801
浏览

我要回答关注问题

问一道题：NO.PZ2015120204000047

2 个答案

2

2

801

相关问题