NO.PZ2021083101000011
问题如下:
Achler uses a web spidering program to obtain the data for the text-based model.
The program extracts raw content from social media webpages, which contains English language sentences and special characters. After curating the text, Achler removes unnecessary elements from the raw text using regular expression software and completes additional text cleansing and preprocessing tasks.
Based on the source of the data, as part of the data cleansing and wrangling process, Achler most likely needs to remove:
选项:
A.
html tags and perform scaling
B.
numbers and perform lemmatization
C.
white spaces and perform winsorization
解释:
B is correct.
Achler uses a web spidering program that extracts unstructured raw content from social media webpages. Raw text data are a sequence of characters and contain other non-useful elements including html tags, punctuation, and white spaces (including tabs, line breaks, and new lines).
Removing numbers is one of the basic operations in the text cleansing/preparation process for unstructured data. When numbers (or digits) are present in the text, they should be removed or substituted with the annotation “/number/. ”
Lemmatization, which takes places during the text wrangling/preprocessing process for unstructured data, is the process of converting inflected forms of a word into its morphological root (known as lemma).
Lemmatization reduces the repetition of words occurring in various forms while maintaining the semantic structure of the text data, thereby aiding in training less complex ML models.
A is incorrect because although html tag removal is part of text cleansing/ preparation for unstructured data, scaling is a data wrangling/preprocessing process applied to structured data.
Scaling adjusts the range of a feature by shifting and changing the scale of data; it is performed on numeric variables, not on text data.
C is incorrect because although raw text contains white spaces (including tabs, line breaks, and new lines) that need to be removed as part of the data cleansing/preparation process for unstructured data, winsorization is a data wrangling/preprocessing task performed on values of data points, not on text data.
Winsorization is used for structured numerical data and replaces extreme values and outliers with the maximum (for large-value outliers) and minimum (for small-value outliers) values of data points that are not outliers.
考点:Unstructured Data Preparation (Cleansing)
怎么判断题目中的是structured data 还是unstructured data呢?