请问这个题目的题干应该怎么理解？是说哪一个token group正在经历数据处理的过程吗？-有问必答-品职教育专注CFA ESG FRM CPA 考研等财经培训课程

请问这个题目的题干应该怎么理解？是说哪一个token group正在经历数据处理的过程吗？

NO.PZ2021083101000012

问题如下：

Achler and Rivera discuss remaining text wrangling tasks—specifically, which tokens to include in the document term matrix (DTM). Achler divides unique tokens into three groups; a sample of each group is shown in Exhibit 1.

Based on Exhibit 1, which token group has most likely undergone the text preparation and wrangling process?

选项：

Token Group 1

Token Group 2

Token Group 3

解释：

A is correct.

Data preparation and wrangling involve cleansing and organizing raw data into a consolidated format.

Token Group 1 includes n-grams (“not_increas_market, ” “sale_decreas”) and the words that have been converted from their inflected forms into their base word (“increas, ” “decreas”), and the currency symbol has been replaced with a “currencysign” token.

N-gram tokens are helpful for keeping negations intact in the text, which is vital for sentiment prediction. The process of converting inflected forms of a word into its base word is called stemming and helps decrease data sparseness, thereby aiding in training less complex ML models.

B is incorrect because Token Group 2 includes inflected forms of words (“increased, ” “decreased”) before conversion into their base words (known as stems).

Stemming (along with lemmatization) decreases data sparseness by aggregating many sparsely occurring words in relatively less sparse stems or lemmas, thereby aiding in training less complex ML models.

C is incorrect because Token Group 3 includes inflected forms of words (“increased, ” “decreased”) before conversion into their base words (known as stems). In addition, the “EUR” currency symbol has not been replaced with the “currencysign” token and the word “Sales” has not been lowercased.

考点：Unstructured Data Wrangling (Preprocessing)

请问这个题目的题干应该怎么理解？是说哪一个token group正在经历数据处理的过程吗？

1 个答案

1

1

392

相关问题