开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

Jacksonic · 2023年03月18日

请问这个题目的题干应该怎么理解?是说哪一个token group正在经历数据处理的过程吗?

NO.PZ2021083101000012

问题如下:

Achler and Rivera discuss remaining text wrangling tasks—specifically, which tokens to include in the document term matrix (DTM). Achler divides unique tokens into three groups; a sample of each group is shown in Exhibit 1.

Based on Exhibit 1, which token group has most likely undergone the text preparation and wrangling process?

选项:

A.

Token Group 1

B.

Token Group 2

C.

Token Group 3

解释:

A is correct.

Data preparation and wrangling involve cleansing and organizing raw data into a consolidated format.

Token Group 1 includes n-grams (“not_increas_market, ” “sale_decreas”) and the words that have been converted from their inflected forms into their base word (“increas, ” “decreas”), and the currency symbol has been replaced with a “currencysign” token.

N-gram tokens are helpful for keeping negations intact in the text, which is vital for sentiment prediction. The process of converting inflected forms of a word into its base word is called stemming and helps decrease data sparseness, thereby aiding in training less complex ML models.

B is incorrect because Token Group 2 includes inflected forms of words (“increased, ” “decreased”) before conversion into their base words (known as stems).

Stemming (along with lemmatization) decreases data sparseness by aggregating many sparsely occurring words in relatively less sparse stems or lemmas, thereby aiding in training less complex ML models.

C is incorrect because Token Group 3 includes inflected forms of words (“increased, ” “decreased”) before conversion into their base words (known as stems). In addition, the “EUR” currency symbol has not been replaced with the “currencysign” token and the word “Sales” has not been lowercased.

考点:Unstructured Data Wrangling (Preprocessing)



1 个答案

星星_品职助教 · 2023年03月19日

同学你好,

可理解为:表格中的三个group,哪个已经完成了(undergone)数据处理的过程(text preparation and wrangling process)。

如果做完了text preparation and wrangling process,则不会出现increased和decreased这样的词汇。所以可以直接排除group 2和3.