step1不是之说要清理数据，也没说清理掉全部的数据啊，为什么是错的？-有问必答-品职教育专注CFA ESG FRM CPA 考研等财经培训课程

step1不是之说要清理数据，也没说清理掉全部的数据啊，为什么是错的？

NO.PZ2015120204000045

问题如下：

Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:

Step 1 Cleanse the raw text data.

Step 2 Split the cleansed data into a collection of words for them to be normalized.

Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.

With respect to Step 1, Steele tells Schultz: “I believe I should remove all html tags, punctuations, numbers, and extra white spaces from the data before normalizing them.

Is Steele’s statement regarding Step 1 of the preprocessing of raw text data correct?

选项：

Yes

No, because her suggested treatment of punctuation is incorrect.

No, because her suggested treatment of extra white spaces is incorrect.

解释：

B is correct. Although most punctuations are not necessary for text analysis and should be removed, some punctuations (e.g., percentage signs, currency symbols, and question marks) may be useful for ML model training. Such punctuations should be substituted with annotations (e.g., /percentSign/, /dollarSign/, and /questionMark/) to preserve their grammatical meaning in the text. Such annotations preserve the semantic meaning of important characters in the text for further text processing and analysis stages.

如题

step1不是之说要清理数据，也没说清理掉全部的数据啊，为什么是错的？

1 个答案

1

1

355

相关问题