NO.PZ202108310100000202
问题如下:
Based on the source of the data, as part of the data cleansing and wrangling process, Achler most likely needs to remove:
选项:
A. html tags and perform scaling
B. numbers and perform lemmatization
C. white spaces and perform winsorization
解释:
B is correct.
Achler uses a web spidering program that extracts unstructured raw content from social media webpages. Raw text data are a sequence of characters and contain other non-useful elements including html tags, punctuation, and white spaces (including tabs, line breaks, and new lines).
Removing numbers is one of the basic operations in the text cleansing/preparation process for unstructured data. When numbers (or digits) are present in the text, they should be removed or substituted with the annotation “/number/. ”
Lemmatization, which takes places during the text wrangling/preprocessing process for unstructured data, is the process of converting inflected forms of a word into its morphological root (known as lemma).
Lemmatization reduces the repetition of words occurring in various forms while maintaining the semantic structure of the text data, thereby aiding in training less complex ML models.
A is incorrect because although html tag removal is part of text cleansing/ preparation for unstructured data, scaling is a data wrangling/preprocessing process applied to structured data.
Scaling adjusts the range of a feature by shifting and changing the scale of data; it is performed on numeric variables, not on text data.
C is incorrect because although raw text contains white spaces (including tabs, line breaks, and new lines) that need to be removed as part of the data cleansing/preparation process for unstructured data, winsorization is a data wrangling/preprocessing task performed on values of data points, not on text data.
Winsorization is used for structured numerical data and replaces extreme values and outliers with the maximum (for large-value outliers) and minimum (for small-value outliers) values of data points that are not outliers.
如何判断问的是unstructed data呢