信息增益筛选变量和python实现

熵的定义

在了解信息增益之前需要清楚信息熵的定义,熵起源于物理学,后来被香农引入信息论中。熵是信息论中对信息量的一个衡量值,指接收的每条消息中包含的信息的平均量。这里的消息代表了样本,特征。熵意义可以解释为不确定性的度量,特征的随机性或者混乱程度越大那么熵值越大。因此熵被量化特征包含信息量的大小,在机器学习领域被广泛引用。

熵的计算

H(X)=-SUM(P(x(i))log(p(x(i)), b))

这里的b为对数使用的底数,通常为2,或者自然常熟e, p(i)为X中第i个取值在所在样本中的概率。

信息增益IG(Y|X)

衡量每个属性X区分目标样本Y的能力,当新增一个属性或特征X时,信息熵的变化大小即为信息增益。IG(Y|X)的值越大那么X对于Y越重要。 条件熵:H(Y|X),当X条件下Y的信息熵。

IG(Y|X)=H(Y)-H(Y|X)


下面是通过python实现计算样本特征的信息增益:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# -*-coding:utf-8-*-

import numpy as np
from collections import Counter

class EntropyGain(object):
"""
The datasets information entropy, it's contains feature entropy and information gain. Use to select the bests
feature, if the gain value the bigger the better.
Parameters
-------------------
X: array, the train datasets.
Y: array, class label
alpha: float or int, the best of feature's threshold value.
Attributes
-------------------
X, y, alpha, like the parameter.
Methods
-------------------
get_feature_class_label: get feature's value map to class label and apply to entropy value.
select_k_best_feature: select the best of feature apply to alpha.
Samples
-------------------
>> eg = EntropyGain(X, Y, alpha=10)
>> eg.select_k_best_feature()
[(0, 0.8321), (2, 0.7681), (5, 0.7163), ..., (1, 0.4908)]
"""
def __init__(self, X, Y, alpha=0.80):
self.x = X
self.y = Y
self.alpha = alpha

def __prob_y(self, y_=None):
if y_ is None:
y_ = self.y
if not self.y.size:
print "The variable %s is NoneType object." % 'y_'
n_sample = len(y_)
label_prob = {v: float(c)/n_sample for v, c in Counter(y_.flatten()).iteritems()}
return label_prob

def __get_label_entropy(self, class_prob=None):
"""
param class_prob: the class label probability value, it's 'collections.Counter' object.
return: float, the class label entropy
"""
if not class_prob:
print "This variable class_prob is NoneType object."
class_prob = self.__prob_y(y_=self.y)
return reduce(lambda x1, x2: x1 + x2, [-p * np.log2(p) for v, p in class_prob.iteritems()])

def get_feature_class_label(self, feature_or_index):
"""
parameter
x: the model train data x
feature_or_index: the datasets x feature or index.
y: class label datasets.
return:
filter label data's label counter.
"""
x_ = self.x[:, feature_or_index]
feature_prob_dict = self.__prob_y(y_=x_)
feature_entropy = {v: self.__get_label_entropy(self.__prob_y(y_=self.y[x_ == v])) for v in np.unique(x_)}
return reduce(lambda x1, x2: x1 + x2, [feature_prob_dict[k] * entropy for k, entropy in
feature_entropy.iteritems()])

def select_k_best_feature(self):
"""
parameter:
x: array, the feature datasets.
y: array, class label
alpha: float or int, filter threshold value.
return:
feature index and entropy gain.
"""
row_num, col_num = self.x.shape
index_map_entropy = dict()

entropy_h0 = self.__get_label_entropy(self.__prob_y(y_=self.y))
print u"初始信息熵为:", round(entropy_h0, 4)
for _index_ in range(col_num):
index_map_entropy[_index_] = round(entropy_h0 - self.get_feature_class_label(_index_), 4)
if isinstance(self.alpha, int) and self.alpha <= col_num:
return sorted(index_map_entropy.iteritems(), key=lambda d: d[1], reverse=True)[:self.alpha]

if isinstance(self.alpha, float) and 0.0 < self.alpha <= 1:
index_num = int(col_num * self.alpha)
return sorted(index_map_entropy.iteritems(), key=lambda d: d[1], reverse=True)[:index_num]