NO.PZ2024030508000093
问题如下:
A quantitative analyst supporting the acquisitions team of a European corporate real estate firm is using the decision tree technique to create a model for forecasting property prices. The analyst compiles a training data set comprised of information from 10 recent property sales, as shown in the following table:
The table also includes the target variable of the model: a class label indicating whether the property was sold for a price greater than EUR 8,000,000. The analyst selects the occupancy status as the feature that is used as the root node of the decision tree. What is the estimated information gain of the split put forward by this root node?
选项:
A.0.09 B.0.37 C.0.44 D.0.82解释:
Explanation: A is correct. Before we can calculate the information gain as Ginibase − Giniweighted, we first calculate for the base-level Gini measure by looking at the output variable being considered before we know anything about the features.
There are 5 properties that sold above EUR 8,000,000 and 5 that sold below.
Ginibase =
Using the feature “occupancy status” as the root node, we examine this feature and find that for the 4 properties that were occupied, 3 sold above the amount and only 1 sold below.
Ginioccupied =
In a similar fashion, we find that for the 6 properties that were not occupied, 2 sold above the amount and 4 sold below.
Gininotoccupied =
Thus, the weighted Gini measure for this feature is obtained as:
Giniweighted =
Therefore, Information Gain = Ginibase − Giniweighted = 0.50-0.4097 = 0.0902 or approximately 0.09.
B is incorrect. This is just the Gini measure for the sold properties that were occupied.
C is incorrect. This is just the Gini measure for the sold properties that were not occupied.
D is incorrect. This is the unweighted sum of the Gini measure for the sold properties that were occupied and the Gini measure for the sold properties that weren’t occupied (0.375 + 0.444).
Learning Objective: Show how a decision tree is constructed and interpreted.
Reference: Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023, Chapter 15, Machine Learning and Prediction [QA-15].
还是不太明白为什么weight要用5/10
讲义里面的例题权重是按照feature的个数来做的
讲义485页,当我们weight large cap时候使用 large cap/total 和 非large cap/total 并不使用paid dividend/total 和no dividend/total
那为什么这道题不是用同一个思路呢?