NO.PZ2015120204000045
问题如下:
Steele and Schultz then discuss how to preprocess the raw text data. Steele tells Schultz that the process can be completed in the following three steps:
Step 1 Cleanse the raw text data.
Step 2 Split the cleansed data into a collection of words for them to be normalized.
Step 3 Normalize the collection of words from Step 2 and create a distinct set of tokens from the normalized words.
With respect to Step 1, Steele tells Schultz: “I believe I should remove all html tags, punctuations, numbers, and extra white spaces from the data before normalizing them.
Is Steele’s statement regarding Step 1 of the preprocessing of raw text data correct?
选项:
A.Yes
No, because her suggested treatment of punctuation is incorrect.
No, because her suggested treatment of extra white spaces is incorrect.
解释:
B is correct. Although most punctuations are not necessary for text analysis and should be removed, some punctuations (e.g., percentage signs, currency symbols, and question marks) may be useful for ML model training. Such punctuations should be substituted with annotations (e.g., /percentSign/, /dollarSign/, and /questionMark/) to preserve their grammatical meaning in the text. Such annotations preserve the semantic meaning of important characters in the text for further text processing and analysis stages.
如题