开发者:上海品职教育科技有限公司隐私政策详情

应用版本:4.2.11(IOS)｜3.2.5(安卓)APP下载

学习体验
App下载
手机上的品职教育

随时随地学习课程，支持音视频下载！
- 扫码下载品职教育APP
进入课程
登录 | 注册

买买要当学霸 · 2019年12月29日

问一道题：NO.PZ2015120204000047

问题如下：

Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:

Step 1 Cleanse the raw text data.

Step 2 Split the cleansed data into a collection of words for them to be normalized.

Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.

The output created in Steele’s Step 3 can be best described as a:

选项：

A.

bag-of-words.

B.

set of n-grams.

C.

document term matrix.

解释：

A is correct. After the cleansed text is normalized, a bag-of-words is created. A bag-of-words (BOW) is a collection of a distinct set of tokens from all the texts in a sample dataset.

B为什么不对呢我看到distinct这个词，以为是有固定词组的保证词序的，感觉更像B

添加评论

1
0

2 个答案

星星_品职助教 · 2020年09月06日

@little_back

BOW的定义为：A bag-of-words (BOW) is a collection of a distinct set of“ tokens ”，“tokens”就是你说的“一个一个”中的个体。BOW是个分token的过程

添加评论

0
0

星星_品职助教 · 2019年12月30日

同学你好，
“distinct”指的是把词分成“一个一个”的，这个描述符合BOW的定义。
如果是N-grams还会增加一些描述，例如保留文本的顺序，体现出单词的顺序和语序，更能代表文本中原本的含义等。加油

添加评论

7
1

little_back · 2020年09月06日

请问老师，分成一个一个的，为什么不是token呢？

2
回答
4
关注
1062
浏览

我要回答关注问题

相关问题

老师可以提供下bow的讲义页面么

2020-06-11 21:49 1 · 回答

如果选 C，题目会如何描述呢？

2020-02-19 17:16 2 · 回答