NO.PZ2023091601000110
问题如下:
An insurance
company specializing in inexperienced drivers is building a decision-tree model
to classify drivers that it has previously insured as to whether they made a
claim or not in their first year as policyholders. They have the following data
on whether a claim was made (“Claim_made”) and two features (for the label and
the features, in each case, “yes” = 1 and “no” = 0): whether the policyholder
is a car owner and on whether they have a college degree:
a. Calculate the
“base entropy” of the Claim_made series.
b. Build a
decision tree for this problem.
解释:
a. The base
entropy is the entropy of the output series before any splitting. There are
four policyholders who made claims and six who did not. The base entropy is
therefore:
b. Both of the
features are binary, so there are no issues with having to determine a
threshold as there would be for a continuous series. The first stage is to
calculate the entropy if the split was made for each of the two features.
Examining the
Car_owner feature first, among owners (feature = 1), two made a claim while
four did not, leading to entropy for this sub-set of:
Among non-car
owners (feature = 0), two made a claim and two did not, leading to an entropy
of 1. The weighted entropy for splitting by car ownership is therefore given by
and the
information gain is information gain = 0.971 - 0.951 = 0.020
We repeat this
process by calculating the entropy that would occur if the split was made via
the College_degree feature. If we did so, we would observe that the weighted
entropy would be 0.551, with an information gain of 0.420. Therefore, because
the entropy is maximized when the sample is first split by College_degree, this
becomes the root node of the decision tree.
For policyholders
with a college degree (i.e., the feature=1), there is already a pure split as
four of them have not made claims while none have made claims (in other words,
nobody with college degrees made claims). This means that no further splits are
required along this branch. The other branch can be split using the
Car_ownership feature, which is the only one remaining.
The tree structure
is given below:
老师好,1、entropy有两种算法,这道题为什么不用gini的算法呢?一般什么表述是用log的,什么表述是用gini呢?
2、log2X用计算器怎么计算呢?