问题如下:
Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:
Step 1 Cleanse the raw text data.
Step 2 Split the cleansed data into a collection of words for them to be normalized.
Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.
The output created in Steele’s Step 3 can be best described as a:
选项:
A.bag-of-words.
set of n-grams.
document term matrix.
解释:
A is correct. After the cleansed text is normalized, a bag-of-words is created. A bag-of-words (BOW) is a collection of a distinct set of tokens from all the texts in a sample dataset.
如果选 C,题目会如何描述呢?