开发者:上海品职教育科技有限公司 隐私政策详情

应用版本:4.2.11(IOS)|3.2.5(安卓)APP下载

alina319 · 2022年05月04日

怎么判断题目中的是structured data 还是unstructured data呢?

NO.PZ2021083101000011

问题如下:

Achler uses a web spidering program to obtain the data for the text-based model.

The program extracts raw content from social media webpages, which contains English language sentences and special characters. After curating the text, Achler removes unnecessary elements from the raw text using regular expression software and completes additional text cleansing and preprocessing tasks.

Based on the source of the data, as part of the data cleansing and wrangling process, Achler most likely needs to remove:

选项:

A.

html tags and perform scaling

B.

numbers and perform lemmatization

C.

white spaces and perform winsorization

解释:

B is correct.

Achler uses a web spidering program that extracts unstructured raw content from social media webpages. Raw text data are a sequence of characters and contain other non-useful elements including html tags, punctuation, and white spaces (including tabs, line breaks, and new lines).

Removing numbers is one of the basic operations in the text cleansing/preparation process for unstructured data. When numbers (or digits) are present in the text, they should be removed or substituted with the annotation “/number/. ”

Lemmatization, which takes places during the text wrangling/preprocessing process for unstructured data, is the process of converting inflected forms of a word into its morphological root (known as lemma).

Lemmatization reduces the repetition of words occurring in various forms while maintaining the semantic structure of the text data, thereby aiding in training less complex ML models.

A is incorrect because although html tag removal is part of text cleansing/ preparation for unstructured data, scaling is a data wrangling/preprocessing process applied to structured data.

Scaling adjusts the range of a feature by shifting and changing the scale of data; it is performed on numeric variables, not on text data.

C is incorrect because although raw text contains white spaces (including tabs, line breaks, and new lines) that need to be removed as part of the data cleansing/preparation process for unstructured data, winsorization is a data wrangling/preprocessing task performed on values of data points, not on text data.

Winsorization is used for structured numerical data and replaces extreme values and outliers with the maximum (for large-value outliers) and minimum (for small-value outliers) values of data points that are not outliers.

考点:Unstructured Data Preparation (Cleansing)

怎么判断题目中的是structured data 还是unstructured data呢?

1 个答案
已采纳答案

星星_品职助教 · 2022年05月05日

同学你好,

 text-based model或者text data就是unstructured data。

  • 1

    回答
  • 1

    关注
  • 602

    浏览
相关问题

NO.PZ2021083101000011 问题如下 Achler uses a web spiring progrto obtain the ta for the text-basemol. The progrextracts rcontent from socimea webpages, whicontains English language sentences anspecicharacters. After curating the text, Achler removes unnecessary elements from the rtext using regulexpression software ancompletes aitiontext cleansing anpreprocessing tasks.Baseon the sourof the tpart of the ta cleansing anwrangling process, Achler most likely nee to remove: A.html tags anperform scaling B.numbers anperform lemmatization C.white spaces anperform winsorization B is correct. Achler uses a web spiring progrthextracts unstructurercontent from socimea webpages. Rtext ta are a sequenof characters ancontain other non-useful elements inclung html tags, punctuation, anwhite spaces (inclung tabs, line breaks, annew lines). Removing numbers is one of the basic operations in the text cleansing/preparation process for unstructuretWhen numbers (or gits) are present in the text, they shoulremoveor substitutewith the annotation “/number/. ” Lemmatization, whitakes places ring the text wrangling/preprocessing process for unstructuretis the process of converting inflecteforms of a worinto its morphologicroot (known lemma). Lemmatization reces the repetition of wor occurring in various forms while maintaining the semantic structure of the text tthereaing in training less complex ML mols.A is incorrebecause although html tremovis part of text cleansing/ preparation for unstructuretscaling is a ta wrangling/preprocessing process applieto structuretScaling austs the range of a feature shifting anchanging the scale of tit is performeon numeric variables, not on text ta.C is incorrebecause although rtext contains white spaces (inclung tabs, line breaks, annew lines) thneeto removepart of the ta cleansing/preparation process for unstructuretwinsorization is a ta wrangling/preprocessing task performeon values of ta points, not on text tWinsorization is usefor structurenumericta anreplaces extreme values anoutliers with the maximum (for large-value outliers) anminimum (for small-value outliers) values of ta points thare not outliers. 考点Unstructureta Preparation (Cleansing) numbers可以remove吗?课上老师说的是numbers要用注释替代

2023-02-19 11:29 1 · 回答

NO.PZ2021083101000011 B应该也错了吧,因为数字不应该删除,而是用注释替代

2021-12-29 10:10 1 · 回答

NO.PZ2021083101000011 请老师讲下差别

2021-10-04 22:52 1 · 回答