决策树中选择哪一个特征进行分裂,称之为特征选择。
特征选择是找出某一个特征使得分裂后两边的样本都有最好的“归宿”,即左边分支的样本属于一个类别、右边分支的样本属于另外一个类别,左边和右边分支包含的样本尽可能分属同一类别,此时分裂节点的纯度(purity
)高,能够表征这种纯度高低的常用指标是信息熵(information entropy
)。
假设有一个数据集
D
D
D,包含
N
N
N个样本(
n
=
1
,
2
,
3
,
.
.
.
,
N
n=1,2,3,...,N
n=1,2,3,...,N),每一个样本有
k
k
k个属性(
k
=
1
,
2
,
3
,
.
.
.
,
K
k=1,2,3,...,K
k=1,2,3,...,K),样本共计有
C
C
C个类别(
c
=
1
,
2
,
3
,
.
.
.
,
C
c=1,2,3,...,C
c=1,2,3,...,C),则
D
D
D的信息熵可定义为:
E
n
t
r
o
p
y
(
D
)
=
−
∑
c
=
1
C
p
c
log
p
c
=
−
∑
c
=
1
C
N
c
N
log
N
c
N
(1)
Entropy(D)=-\sum_{c=1}^{C}p_c \log {p_c}=-\sum_{c=1}^{C}\frac{N_c}{N} \log \frac{N_c}{N}\tag{1}
Entropy(D)=−c=1∑Cpclogpc=−c=1∑CNNclogNNc(1)
式
(
1
)
(1)
(1)中,
p
c
p_c
pc表示数据集
D
D
D中第
c
c
c类样本所占的比例,
N
c
N_c
Nc表示第
c
c
c类样本的数量,
E
n
t
r
o
p
y
(
D
)
Entropy(D)
Entropy(D)的值越小,则
D
D
D的纯度越高。
假设离散属性
A
A
A有个取值
M
M
M(
m
=
1
,
2
,
3
,
.
.
.
,
M
m=1,2,3,...,M
m=1,2,3,...,M),若使用属性A对样本集
D
D
D进行分裂,则会将数据集划分为
M
M
M个子集
D
m
D^m
Dm,每个子集包含的样本数记为
N
m
N_m
Nm,根据式
(
1
)
(1)
(1)计算出
D
m
D^m
Dm的信息熵,考虑到不同的子集所包含的样本数不同,分别给每个子集赋予权重
N
m
/
N
N_m/N
Nm/N,计算属性A对于样本集
D
D
D进行划分所得的信息增益(information gain
):
G
a
i
n
(
D
,
A
)
=
E
n
t
r
o
p
y
(
D
)
−
∑
m
=
1
M
N
m
N
E
n
t
r
o
p
y
(
D
m
)
=
E
n
t
r
o
p
y
(
D
)
−
∑
m
=
1
M
N
m
N
∑
c
=
1
C
(
−
N
m
c
N
m
log
N
m
c
N
m
)
(2)
\begin{aligned} Gain(D,A)&=Entropy(D)-\sum_{m=1}^{M}\frac{N_m}{N}Entropy(D^m)\\ &=Entropy(D)-\sum_{m=1}^{M}\frac{N_m}{N}\sum_{c=1}^{C}(-\frac{N_{mc}}{N_m}\log\frac{N_{mc}}{N_m})\tag{2} \end{aligned}
Gain(D,A)=Entropy(D)−m=1∑MNNmEntropy(Dm)=Entropy(D)−m=1∑MNNmc=1∑C(−NmNmclogNmNmc)(2)
式
(
2
)
(2)
(2)中,
N
m
c
N_{mc}
Nmc表示子集
D
m
D^m
Dm中类别为
c
c
c的样本的个数。
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 好瓜 |
---|---|---|---|---|---|---|---|
1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 是 |
7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 是 |
8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 是 |
9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 否 |
11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 否 |
12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 否 |
13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 否 |
14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 否 |
15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 否 |
16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 否 |
17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
以表1中的西瓜数据集为例,数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有6个属性( k = 1 , 2 , 3 , . . . , 6 k=1,2,3,...,6 k=1,2,3,...,6),样本共计有2个类别( c = 是 , 否 c=是,否 c=是,否)。
1.17个样本中,好瓜样本有8个、差瓜样本有9个,数据集
D
D
D的信息熵为:
E
n
t
r
o
p
y
(
D
)
=
−
8
17
log
8
17
−
9
17
log
9
17
=
0.9975
(3)
Entropy(D)=-\frac{8}{17} \log \frac{8}{17}-\frac{9}{17} \log \frac{9}{17}=0.9975\tag{3}
Entropy(D)=−178log178−179log179=0.9975(3)
2.计算属性集合{色泽, 根蒂, 敲声, 纹理, 脐部, 触感}中每个属性的信息增益,以属性"触感"为例,有{硬滑, 软粘}两个取值:
-
D 1 D^1 D1(触感=硬滑):包含编号为{1,2,3,4,5,8,9,11,13,14,16,17}的12个样本,其中好瓜有{1,2,3,4,5,8}的6个样本、差瓜有{9,11,13,14,16,17}的6个样本;
-
D 2 D^2 D2(触感=软粘):{6,7,10,12,15}的5个样本,其中好瓜有{6,7}的2个样本、差瓜有{10,12,15}的3个样本。
根据式 ( 1 ) (1) (1)计算上述两个子集的信息熵:
E n t r o p y ( D 1 ) = − 6 12 log 6 12 − 6 12 log 6 12 = 1.000 E n t r o p y ( D 2 ) = − 2 5 log 2 5 − 3 5 log 3 5 = 0.9709 \begin{aligned} Entropy(D^1)&=-\frac{6}{12} \log \frac{6}{12}-\frac{6}{12} \log \frac{6}{12}=1.000\\ Entropy(D^2)&=-\frac{2}{5} \log \frac{2}{5}-\frac{3}{5} \log \frac{3}{5}=0.9709 \end{aligned} Entropy(D1)Entropy(D2)=−126log126−126log126=1.000=−52log52−53log53=0.9709
3.根据式
(
2
)
(2)
(2)计算属性"触感"的信息增益:
G
a
i
n
(
D
,
触感
)
=
E
n
t
r
o
p
y
(
D
)
−
∑
m
=
1
M
N
m
N
E
n
t
r
o
p
y
(
D
m
)
=
0.9975
−
(
12
17
×
1.0000
+
5
17
×
0.9709
)
=
0.0061
\begin{aligned} Gain(D,触感)&=Entropy(D)-\sum_{m=1}^{M}\frac{N_m}{N}Entropy(D^m)\\ &=0.9975-(\frac{12}{17}\times1.0000+\frac{5}{17}\times0.9709)\\ &=0.0061 \end{aligned}
Gain(D,触感)=Entropy(D)−m=1∑MNNmEntropy(Dm)=0.9975−(1712×1.0000+175×0.9709)=0.0061
4.根据式
(
2
)
(2)
(2)计算其他属性的信息增益:
G
a
i
n
(
D
,
色泽
)
=
0.9975
−
[
6
17
×
E
n
t
r
o
p
y
(
色泽
=
青绿
)
+
6
17
×
E
n
t
r
o
p
y
(
色泽
=
乌黑
)
+
5
17
×
E
n
t
r
o
p
y
(
色泽
=
浅白
)
]
=
0.9975
−
[
6
17
×
(
−
(
3
6
log
3
6
+
3
6
log
3
6
)
)
+
6
17
×
(
−
(
4
6
log
4
6
+
2
6
log
2
6
)
)
+
5
17
×
(
−
(
1
5
log
1
5
+
4
5
log
4
5
)
)
]
=
0.1081
G
a
i
n
(
D
,
根蒂
)
=
0.9975
−
[
8
17
×
E
n
t
r
o
p
y
(
根蒂
=
蜷缩
)
+
7
17
×
E
n
t
r
o
p
y
(
根蒂
=
稍蜷
)
+
2
17
×
E
n
t
r
o
p
y
(
根蒂
=
硬挺
)
]
=
0.9975
−
[
8
17
×
(
−
(
5
8
log
5
8
+
3
8
log
3
8
)
)
+
7
17
×
(
−
(
3
7
log
3
7
+
4
7
log
4
7
)
)
+
2
17
×
(
−
(
0
2
log
0
2
+
2
2
log
2
2
)
)
]
=
0.1426
G
a
i
n
(
D
,
敲声
)
=
0.9975
−
[
10
17
×
E
n
t
r
o
p
y
(
敲声
=
浊响
)
+
5
17
×
E
n
t
r
o
p
y
(
敲声
=
沉闷
)
+
2
17
×
E
n
t
r
o
p
y
(
敲声
=
清脆
)
]
=
0.9975
−
[
10
17
×
(
−
(
6
10
log
6
10
+
4
10
log
4
10
)
)
+
5
17
×
(
−
(
2
5
log
2
5
+
3
5
log
3
5
)
)
+
2
17
×
(
−
(
0
2
log
0
2
+
2
2
log
2
2
)
)
]
=
0.1407
G
a
i
n
(
D
,
纹理
)
=
0.9975
−
[
9
17
×
E
n
t
r
o
p
y
(
纹理
=
清晰
)
+
5
17
×
E
n
t
r
o
p
y
(
纹理
=
稍糊
)
+
3
17
×
E
n
t
r
o
p
y
(
纹理
=
模糊
)
]
=
0.9975
−
[
9
17
×
(
−
(
7
9
log
7
9
+
2
9
log
2
9
)
)
+
5
17
×
(
−
(
1
5
log
1
5
+
4
5
log
4
5
)
)
+
3
17
×
(
−
(
0
3
log
0
3
+
3
3
log
3
3
)
)
]
=
0.3805
G
a
i
n
(
D
,
脐部
)
=
0.9975
−
[
7
17
×
E
n
t
r
o
p
y
(
脐部
=
凹陷
)
+
6
17
×
E
n
t
r
o
p
y
(
脐部
=
稍凹
)
+
3
17
×
E
n
t
r
o
p
y
(
脐部
=
平坦
)
]
=
0.9975
−
[
7
17
×
(
−
(
5
7
log
5
7
+
2
7
log
2
7
)
)
+
6
17
×
(
−
(
3
6
log
3
6
+
3
6
log
3
6
)
)
+
4
17
×
(
−
(
0
4
log
0
4
+
4
4
log
4
4
)
)
]
=
0.2891
\begin{aligned} Gain(D,色泽)&=0.9975-[\frac{6}{17}\times Entropy(色泽=青绿)+\frac{6}{17}\times Entropy(色泽=乌黑)+\frac{5}{17}\times Entropy(色泽=浅白)]\\ &=0.9975-[\frac{6}{17}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+\frac{6}{17}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{5}{17}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))]\\ &=0.1081\\ Gain(D,根蒂)&=0.9975-[\frac{8}{17}\times Entropy(根蒂=蜷缩)+\frac{7}{17}\times Entropy(根蒂=稍蜷)+\frac{2}{17}\times Entropy(根蒂=硬挺)]\\ &=0.9975-[\frac{8}{17}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+\frac{7}{17}\times(-(\frac{3}{7}\log\frac{3}{7}+\frac{4}{7}\log\frac{4}{7}))+\frac{2}{17}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\\ &=0.1426\\ Gain(D,敲声)&=0.9975-[\frac{10}{17}\times Entropy(敲声=浊响)+\frac{5}{17}\times Entropy(敲声=沉闷)+\frac{2}{17}\times Entropy(敲声=清脆)]\\ &=0.9975-[\frac{10}{17}\times(-(\frac{6}{10}\log\frac{6}{10}+\frac{4}{10}\log\frac{4}{10}))+\frac{5}{17}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+\frac{2}{17}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\\ &=0.1407\\ Gain(D,纹理)&=0.9975-[\frac{9}{17}\times Entropy(纹理=清晰)+\frac{5}{17}\times Entropy(纹理=稍糊)+\frac{3}{17}\times Entropy(纹理=模糊)]\\ &=0.9975-[\frac{9}{17}\times(-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9}))+\frac{5}{17}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+\frac{3}{17}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\\ &=0.3805\\ Gain(D,脐部)&=0.9975-[\frac{7}{17}\times Entropy(脐部=凹陷)+\frac{6}{17}\times Entropy(脐部=稍凹)+\frac{3}{17}\times Entropy(脐部=平坦)]\\ &=0.9975-[\frac{7}{17}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+\frac{6}{17}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+\frac{4}{17}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\\ &=0.2891\\ \end{aligned}
Gain(D,色泽)Gain(D,根蒂)Gain(D,敲声)Gain(D,纹理)Gain(D,脐部)=0.9975−[176×Entropy(色泽=青绿)+176×Entropy(色泽=乌黑)+175×Entropy(色泽=浅白)]=0.9975−[176×(−(63log63+63log63))+176×(−(64log64+62log62))+175×(−(51log51+54log54))]=0.1081=0.9975−[178×Entropy(根蒂=蜷缩)+177×Entropy(根蒂=稍蜷)+172×Entropy(根蒂=硬挺)]=0.9975−[178×(−(85log85+83log83))+177×(−(73log73+74log74))+172×(−(20log20+22log22))]=0.1426=0.9975−[1710×Entropy(敲声=浊响)+175×Entropy(敲声=沉闷)+172×Entropy(敲声=清脆)]=0.9975−[1710×(−(106log106+104log104))+175×(−(52log52+53log53))+172×(−(20log20+22log22))]=0.1407=0.9975−[179×Entropy(纹理=清晰)+175×Entropy(纹理=稍糊)+173×Entropy(纹理=模糊)]=0.9975−[179×(−(97log97+92log92))+175×(−(51log51+54log54))+173×(−(30log30+33log33))]=0.3805=0.9975−[177×Entropy(脐部=凹陷)+176×Entropy(脐部=稍凹)+173×Entropy(脐部=平坦)]=0.9975−[177×(−(75log75+72log72))+176×(−(63log63+63log63))+174×(−(40log40+44log44))]=0.2891
上述计算过程中有一种特殊情况:某属性(根蒂=硬挺、敲声=清脆、纹理=模糊、脐部=平坦)分裂时,属性某个取值对应的样本全是反例,正例数量为0,此时其信息熵为:
E
=
−
(
1
×
log
(
1
)
+
0
×
log
(
0
)
)
E=-(1\times \log(1)+0\times \log(0))
E=−(1×log(1)+0×log(0))
在数学上,由于
lim
x
→
0
x
log
(
x
)
=
0
\lim_{x→0}x\log(x)=0
limx→0xlog(x)=0,因此上述情况的信息熵为0
5.根据最大信息增益选择分裂属性,即选择属性"纹理"进行分裂,分裂后的样本分布:
属性: 取值 | 样本 | 好瓜 | 差瓜 | 信息熵 |
---|---|---|---|---|
纹理:清晰 | {1,2,3,4,5,6,8,10,15} | {1,2,3,4,5,6,8} | {10,15} | E ( 纹理 = 清晰 ) = − ( 7 9 log 7 9 + 2 9 log 2 9 ) = 0.7642 E(纹理=清晰)=-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})=0.7642 E(纹理=清晰)=−(97log97+92log92)=0.7642 |
纹理:稍糊 | {7,9,13,14,17} | {7} | {9,13,14,17} | E ( 纹理 = 稍糊 ) = − ( 1 5 log 1 5 + 4 5 log 4 5 ) = 0.7219 E(纹理=稍糊)=-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5})=0.7219 E(纹理=稍糊)=−(51log51+54log54)=0.7219 |
纹理:模糊 | {11,12,16} | {} | {11,12,16} | E ( 纹理 = 模糊 ) = − ( 0 3 log 0 3 + 3 3 log 3 3 ) = 0.0000 E(纹理=模糊 )=-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})=0.0000 E(纹理=模糊)=−(30log30+33log33)=0.0000 |
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 好瓜 |
---|---|---|---|---|---|---|---|
1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 是 |
8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 是 |
10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 否 |
15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 否 |
G
a
i
n
(
D
纹理
=
清晰
,
色泽
)
=
0.7642
−
[
4
9
×
E
n
t
r
o
p
y
(
色泽
=
青绿
)
+
4
9
×
E
n
t
r
o
p
y
(
色泽
=
乌黑
)
+
1
9
×
E
n
t
r
o
p
y
(
色泽
=
浅白
)
]
=
0.7642
−
[
4
9
×
(
−
(
3
4
log
3
4
+
1
4
log
1
4
)
)
+
4
9
×
(
−
(
3
4
log
3
4
+
1
4
log
1
4
)
)
+
1
9
×
(
−
(
1
1
log
1
1
+
0
1
log
0
1
)
)
]
=
0.0430
G
a
i
n
(
D
纹理
=
清晰
,
根蒂
)
=
0.7642
−
[
5
9
×
E
n
t
r
o
p
y
(
根蒂
=
蜷缩
)
+
3
9
×
E
n
t
r
o
p
y
(
根蒂
=
稍蜷
)
+
1
9
×
E
n
t
r
o
p
y
(
根蒂
=
硬挺
)
]
=
0.7642
−
[
5
9
×
(
−
(
5
5
log
5
5
+
0
0
log
0
0
)
)
+
3
9
×
(
−
(
2
3
log
2
3
+
1
3
log
1
3
)
)
+
1
9
×
(
−
(
0
1
log
0
1
+
1
1
log
1
1
)
)
]
=
0.4581
G
a
i
n
(
D
纹理
=
清晰
,
敲声
)
=
0.7642
−
[
6
9
×
E
n
t
r
o
p
y
(
敲声
=
浊响
)
+
2
9
×
E
n
t
r
o
p
y
(
敲声
=
沉闷
)
+
1
9
×
E
n
t
r
o
p
y
(
敲声
=
清脆
)
]
=
0.7642
−
[
6
9
×
(
−
(
5
6
log
5
6
+
1
6
log
1
6
)
)
+
2
9
×
(
−
(
2
2
log
2
2
+
0
0
log
0
0
)
)
+
1
9
×
(
−
(
0
0
log
0
0
+
1
1
log
1
1
)
)
]
=
0.3308
G
a
i
n
(
D
纹理
=
清晰
,
脐部
)
=
0.7642
−
[
5
9
×
E
n
t
r
o
p
y
(
脐部
=
凹陷
)
+
3
9
×
E
n
t
r
o
p
y
(
脐部
=
稍凹
)
+
1
9
×
E
n
t
r
o
p
y
(
脐部
=
平坦
)
]
=
0.7642
−
[
5
9
×
(
−
(
5
5
log
5
5
+
0
5
log
0
5
)
)
+
3
9
×
(
−
(
2
3
log
2
3
+
1
3
log
1
3
)
)
+
1
9
×
(
−
(
0
0
log
0
0
+
1
1
log
1
1
)
)
]
=
0.4581
G
a
i
n
(
D
纹理
=
清晰
,
触感
)
=
0.7642
−
[
6
9
×
E
n
t
r
o
p
y
(
触感
=
硬滑
)
+
3
9
×
E
n
t
r
o
p
y
(
触感
=
软粘
)
]
=
0.7642
−
[
6
9
×
(
−
(
6
6
log
6
6
+
0
6
log
0
6
)
)
+
3
9
×
(
−
(
2
3
log
2
3
+
1
3
log
1
3
)
)
]
=
0.4581
\begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.7642-[\frac{4}{9}\times Entropy(色泽=青绿)+\frac{4}{9}\times Entropy(色泽=乌黑)+\frac{1}{9}\times Entropy(色泽=浅白)]\\ &=0.7642-[\frac{4}{9}\times(-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4}))+\frac{4}{9}\times(-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4}))+\frac{1}{9}\times(-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1}))]\\ &=0.0430\\ Gain(D^{纹理=清晰},根蒂)&=0.7642-[\frac{5}{9}\times Entropy(根蒂=蜷缩)+\frac{3}{9}\times Entropy(根蒂=稍蜷)+\frac{1}{9}\times Entropy(根蒂=硬挺)]\\ &=0.7642-[\frac{5}{9}\times(-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{0}\log\frac{0}{0}))+\frac{3}{9}\times(-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3}))+\frac{1}{9}\times(-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.4581\\ Gain(D^{纹理=清晰},敲声)&=0.7642-[\frac{6}{9}\times Entropy(敲声=浊响)+\frac{2}{9}\times Entropy(敲声=沉闷)+\frac{1}{9}\times Entropy(敲声=清脆)]\\ &=0.7642-[\frac{6}{9}\times(-(\frac{5}{6}\log\frac{5}{6}+\frac{1}{6}\log\frac{1}{6}))+\frac{2}{9}\times(-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{0}\log\frac{0}{0}))+\frac{1}{9}\times(-(\frac{0}{0}\log\frac{0}{0}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.3308\\ Gain(D^{纹理=清晰},脐部)&=0.7642-[\frac{5}{9}\times Entropy(脐部=凹陷)+\frac{3}{9}\times Entropy(脐部=稍凹)+\frac{1}{9}\times Entropy(脐部=平坦)]\\ &=0.7642-[\frac{5}{9}\times(-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5}))+\frac{3}{9}\times(-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3}))+\frac{1}{9}\times(-(\frac{0}{0}\log\frac{0}{0}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.4581\\ Gain(D^{纹理=清晰},触感)&=0.7642-[\frac{6}{9}\times Entropy(触感=硬滑)+\frac{3}{9}\times Entropy(触感=软粘)]\\ &=0.7642-[\frac{6}{9}\times(-(\frac{6}{6}\log\frac{6}{6}+\frac{0}{6}\log\frac{0}{6}))+\frac{3}{9}\times(-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3}))]\\ &=0.4581\\ \end{aligned}
Gain(D纹理=清晰,色泽)Gain(D纹理=清晰,根蒂)Gain(D纹理=清晰,敲声)Gain(D纹理=清晰,脐部)Gain(D纹理=清晰,触感)=0.7642−[94×Entropy(色泽=青绿)+94×Entropy(色泽=乌黑)+91×Entropy(色泽=浅白)]=0.7642−[94×(−(43log43+41log41))+94×(−(43log43+41log41))+91×(−(11log11+10log10))]=0.0430=0.7642−[95×Entropy(根蒂=蜷缩)+93×Entropy(根蒂=稍蜷)+91×Entropy(根蒂=硬挺)]=0.7642−[95×(−(55log55+00log00))+93×(−(32log32+31log31))+91×(−(10log10+11log11))]=0.4581=0.7642−[96×Entropy(敲声=浊响)+92×Entropy(敲声=沉闷)+91×Entropy(敲声=清脆)]=0.7642−[96×(−(65log65+61log61))+92×(−(22log22+00log00))+91×(−(00log00+11log11))]=0.3308=0.7642−[95×Entropy(脐部=凹陷)+93×Entropy(脐部=稍凹)+91×Entropy(脐部=平坦)]=0.7642−[95×(−(55log55+50log50))+93×(−(32log32+31log31))+91×(−(00log00+11log11))]=0.4581=0.7642−[96×Entropy(触感=硬滑)+93×Entropy(触感=软粘)]=0.7642−[96×(−(66log66+60log60))+93×(−(32log32+31log31))]=0.4581
“根蒂”、“脐部”、"触感"3个属性的信息增益均达到最大,可任选其一继续分裂。
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 好瓜 |
---|---|---|---|---|---|---|---|
7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 是 |
9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 否 |
14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 否 |
17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
G
a
i
n
(
D
纹理
=
稍糊
,
色泽
)
=
0.7219
−
[
4
5
×
E
n
t
r
o
p
y
(
色泽
=
青绿
)
+
4
5
×
E
n
t
r
o
p
y
(
色泽
=
乌黑
)
+
1
5
×
E
n
t
r
o
p
y
(
色泽
=
浅白
)
]
=
0.7219
−
[
2
5
×
(
−
(
0
2
log
0
2
+
2
2
log
2
2
)
)
+
2
5
×
(
−
(
1
2
log
1
2
+
1
2
log
1
2
)
)
+
1
5
×
(
−
(
0
1
log
0
1
+
1
1
log
1
1
)
)
]
=
0.3219
G
a
i
n
(
D
纹理
=
稍糊
,
根蒂
)
=
0.7219
−
[
4
5
×
E
n
t
r
o
p
y
(
根蒂
=
稍蜷
)
+
1
5
×
E
n
t
r
o
p
y
(
根蒂
=
蜷缩
)
]
=
0.7219
−
[
4
5
×
(
−
(
1
4
log
1
4
+
3
4
log
3
4
)
)
+
1
5
×
(
−
(
0
1
log
0
1
+
1
1
log
1
1
)
)
]
=
0.0728
G
a
i
n
(
D
纹理
=
稍糊
,
敲声
)
=
0.7219
−
[
2
3
×
E
n
t
r
o
p
y
(
敲声
=
浊响
)
+
3
5
×
E
n
t
r
o
p
y
(
敲声
=
沉闷
)
]
=
0.7219
−
[
2
5
×
(
−
(
1
2
log
1
2
+
1
2
log
1
2
)
)
+
3
5
×
(
−
(
0
3
log
0
3
+
3
3
log
3
3
)
)
]
=
0.3219
G
a
i
n
(
D
纹理
=
稍糊
,
脐部
)
=
0.7219
−
[
3
5
×
E
n
t
r
o
p
y
(
脐部
=
稍凹
)
+
2
5
×
E
n
t
r
o
p
y
(
脐部
=
凹陷
)
]
=
0.7219
−
[
3
5
×
(
−
(
1
3
log
1
3
+
2
3
log
2
3
)
)
+
2
5
×
(
−
(
0
2
log
0
2
+
2
2
log
2
2
)
)
]
=
0.1709
G
a
i
n
(
D
纹理
=
稍糊
,
触感
)
=
0.7219
−
[
1
5
×
E
n
t
r
o
p
y
(
触感
=
软粘
)
+
4
5
×
E
n
t
r
o
p
y
(
触感
=
硬滑
)
]
=
0.7219
−
[
1
5
×
(
−
(
1
1
log
1
1
+
0
1
log
0
1
)
)
+
4
5
×
(
−
(
0
4
log
0
4
+
4
4
log
4
4
)
)
]
=
0.7219
\begin{aligned} Gain(D^{纹理=稍糊},色泽)&=0.7219-[\frac{4}{5}\times Entropy(色泽=青绿)+\frac{4}{5}\times Entropy(色泽=乌黑)+\frac{1}{5}\times Entropy(色泽=浅白)]\\ &=0.7219-[\frac{2}{5}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))+\frac{2}{5}\times(-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2}))+\frac{1}{5}\times(-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.3219\\ Gain(D^{纹理=稍糊},根蒂)&=0.7219-[\frac{4}{5}\times Entropy(根蒂=稍蜷)+\frac{1}{5}\times Entropy(根蒂=蜷缩)]\\ &=0.7219-[\frac{4}{5}\times(-(\frac{1}{4}\log\frac{1}{4}+\frac{3}{4}\log\frac{3}{4}))+\frac{1}{5}\times(-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1}))]\\ &=0.0728\\ Gain(D^{纹理=稍糊},敲声)&=0.7219-[\frac{2}{3}\times Entropy(敲声=浊响)+\frac{3}{5}\times Entropy(敲声=沉闷)]\\ &=0.7219-[\frac{2}{5}\times(-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2}))+\frac{3}{5}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\\ &=0.3219\\ Gain(D^{纹理=稍糊},脐部)&=0.7219-[\frac{3}{5}\times Entropy(脐部=稍凹)+\frac{2}{5}\times Entropy(脐部=凹陷)]\\ &=0.7219-[\frac{3}{5}\times(-(\frac{1}{3}\log\frac{1}{3}+\frac{2}{3}\log\frac{2}{3}))+\frac{2}{5}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\\ &=0.1709\\ Gain(D^{纹理=稍糊},触感)&=0.7219-[\frac{1}{5}\times Entropy(触感=软粘)+\frac{4}{5}\times Entropy(触感=硬滑)]\\ &=0.7219-[\frac{1}{5}\times(-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1}))+\frac{4}{5}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\\ &=0.7219\\ \end{aligned}
Gain(D纹理=稍糊,色泽)Gain(D纹理=稍糊,根蒂)Gain(D纹理=稍糊,敲声)Gain(D纹理=稍糊,脐部)Gain(D纹理=稍糊,触感)=0.7219−[54×Entropy(色泽=青绿)+54×Entropy(色泽=乌黑)+51×Entropy(色泽=浅白)]=0.7219−[52×(−(20log20+22log22))+52×(−(21log21+21log21))+51×(−(10log10+11log11))]=0.3219=0.7219−[54×Entropy(根蒂=稍蜷)+51×Entropy(根蒂=蜷缩)]=0.7219−[54×(−(41log41+43log43))+51×(−(10log10+11log11))]=0.0728=0.7219−[32×Entropy(敲声=浊响)+53×Entropy(敲声=沉闷)]=0.7219−[52×(−(21log21+21log21))+53×(−(30log30+33log33))]=0.3219=0.7219−[53×Entropy(脐部=稍凹)+52×Entropy(脐部=凹陷)]=0.7219−[53×(−(31log31+32log32))+52×(−(20log20+22log22))]=0.1709=0.7219−[51×Entropy(触感=软粘)+54×Entropy(触感=硬滑)]=0.7219−[51×(−(11log11+10log10))+54×(−(40log40+44log44))]=0.7219
"触感"属性的信息增益达到最大,选择"触感"属性继续分裂。
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 好瓜 |
---|---|---|---|---|---|---|---|
11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 否 |
12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 否 |
16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 否 |
当前节点包含的样本全属于同一类别,即差瓜,因此无需再分。