一、KNN算法核心思想和原理
1.1、怎么想出来的?
近朱者赤,近墨者黑!
距离决定一切、民主集中制
1.2、基本原理 —— 分类
- k个最近的邻居
- 民主集中制投票
- 分类表决与加权分类表决
1.3、基本原理 —— 回归
- 计算未知点的值
- 决策规则不同
- 均值法与加权均值法
1.4、如何选择K值?
- K太小导致“过拟合”(过分相信某个数据),容易把噪声学进来
- K太大导致“欠拟合”,决策效率低
- K不能太小也不能太大
- Fit = 最优拟合(找三五个熟悉的人问问),通过超参数调参实现 ~
1.5、距离的度量
- 明氏距离 Minkowski Distance
- p为距离的阶数,n为特征空间的维度
- p=1时,即曼哈顿距离;p=2时,即欧式距离
- p趋向于无穷时,为切比雪夫距离
- ·p=1时,曼哈顿距离 Manhattan Distance
- ·p=2时,欧式距离 Euclidean Distance
- 空间中两点的直线距离
1.6、特征归一化的重要性
简单来讲,就是统一坐标轴比例
二、代码实现 KNN 预测
KNN 预测的过程
- 1. 计算新样本点与已知样本点的距离
- 2. 按距离排序
- 3. 确定k值
- 4. 距离最近的k个点投票
若不使用scikit-learn:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
# 样本特征
data_X = [
[1.3, 6],
[3.5, 5],
[4.2, 2],
[5, 3.3],
[2, 9],
[5, 7.5],
[7.2, 4 ],
[8.1, 8],
[9, 2.5]
]
# 样本标记
data_y = [0,0,0,0,1,1,1,1,1]
# 训练集
X_train = np.array(data_X)
y_train = np.array(data_y)
# 新的样本点
data_new = np.array([4,5])
# 1. 计算新样本点与已知样本点的距离
distance = [np.sqrt(np.sum(data - data_new)**2) for data in X_train]
# 2. 按距离排序
sort_index = np.argsort(distance)
# 3. 确定k值
k = 5
# 4. 距离最近的k个点投票
first_k = [y_train[i] for i in sort_index[:k]]
predict_y = Counter(first_k).most_common(1)[0][0]
print(predict_y)
若使用sklearn:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# 样本特征
data_X = [
[1.3, 6],
[3.5, 5],
[4.2, 2],
[5, 3.3],
[2, 9],
[5, 7.5],
[7.2, 4 ],
[8.1, 8],
[9, 2.5]
]
# 样本标记
data_y = [0,0,0,0,1,1,1,1,1]
# 训练集
X_train = np.array(data_X)
y_train = np.array(data_y)
# 新的样本点
data_new = np.array([4,5])
# 创造类的实例
knn_classifier = KNeighborsClassifier(n_neighbors=5)
# fit
knn_classifier.fit(X_train,y_train)
# sklearn支持预测多个数据,而我们只有一个数据,所以需要将其转为二维
data_new.reshape(1,-1)
predict_y = knn_classifier.predict(data_new.reshape(1,-1))
print(predict_y)
三、划分数据集:训练集与预测集
为什么要划分数据集?
评价模型性能
防止过拟合
提升泛化能力
3.1、划分数据集代码实现
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
x, y = make_blobs(
n_samples = 300, # 样本总数
n_features = 2,
centers = 3,
cluster_std = 1, # 类內标准差
center_box = (-10, 10),
random_state = 233,
return_centers = False
)
plt.scatter(x[:,0], x[:,1], c = y,s = 15)
plt.show()
划分数据集
index = np.arange(20)
index
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
np.random.shuffle(index)
index
array([13, 16, 2, 19, 7, 14, 9, 0, 1, 11, 8, 6, 15, 10, 4, 18, 3, 12, 17, 5])
np.random.permutation(20)
array([12, 19, 6, 7, 11, 10, 4, 8, 16, 3, 2, 15, 18, 5, 9, 0, 1, 14, 13, 17])
np.random.seed(233)
shuffle = np.random.permutation(len(x))
shuffle
array([ 23, 86, 204, 287, 206, 170, 234, 94, 146, 180, 263, 22, 3, 264, 194, 290, 229, 177, 208, 202, 10, 188, 262, 120, 148, 121, 98, 160, 267, 136, 294, 2, 34, 142, 271, 133, 127, 12, 29, 49, 112, 218, 36, 57, 45, 11, 25, 151, 212, 289, 157, 19, 275, 176, 144, 82, 161, 77, 51, 152, 135, 16, 65, 189, 298, 279, 37, 187, 44, 210, 178, 165, 6, 162, 66, 32, 198, 43, 108, 211, 67, 119, 284, 90, 89, 56, 217, 158, 228, 248, 191, 47, 296, 123, 181, 200, 40, 87, 232, 97, 113, 122, 220, 153, 173, 68, 99, 61, 273, 269, 281, 209, 4, 110, 259, 95, 205, 288, 8, 283, 231, 291, 171, 111, 242, 216, 285, 54, 100, 38, 185, 235, 174, 201, 107, 223, 222, 196, 268, 114, 147, 166, 85, 39, 58, 256, 258, 74, 251, 15, 150, 137, 70, 91, 52, 14, 169, 21, 184, 207, 238, 128, 219, 125, 293, 134, 27, 265, 96, 270, 18, 109, 126, 203, 88, 249, 92, 213, 60, 227, 5, 59, 9, 138, 236, 280, 124, 199, 225, 149, 145, 246, 192, 102, 48, 73, 20, 31, 63, 237, 78, 62, 233, 118, 277, 28, 50, 64, 117, 197, 140, 7, 105, 252, 71, 190, 76, 103, 93, 183, 72, 0, 278, 79, 172, 214, 182, 292, 139, 260, 30, 195, 13, 244, 240, 297, 257, 245, 143, 186, 243, 266, 286, 168, 179, 81, 215, 129, 167, 106, 261, 42, 276, 69, 224, 253, 247, 155, 154, 17, 132, 24, 141, 239, 80, 101, 75, 159, 116, 46, 272, 226, 83, 156, 33, 115, 282, 299, 55, 250, 221, 254, 255, 41, 130, 104, 26, 53, 84, 274, 1, 163, 230, 35, 241, 164, 193, 175, 131, 295])
shuffle.shape
(300,)
train_size = 0.7
train_index = shuffle[:int(len(x) * train_size)]
test_index = shuffle[int(len(x) * train_size):]
train_index.shape, test_index.shape
((210,), (90,))
x[train_index].shape, y[train_index].shape
((210, 2), (210,))
x[test_index].shape, y[test_index].shape
((90, 2), (90,))
def my_train_test_split(x, y, train_size = 0.7, random_state = None):
if random_state:
np.random.seed(random_state)
shuffle = np.random.permutation(len(x))
train_index = shuffle[:int(len(x) * train_size)]
test_index = shuffle[int(len(x) * train_size):]
return x[train_index], x[test_index], y[train_index], y[test_index]
x_train, x_test, y_train, y_test = my_train_test_split(x, y, train_size = 0.7, random_state = 233)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
((210, 2), (90, 2), (210,), (90,))
plt.scatter(x_train[:, 0], x_train[:, 1], c = y_train, s = 15)
plt.show()
plt.scatter(x_test[:, 0], x_test[:, 1], c = y_test, s = 15)
plt.show()
3.2、sklearn划分数据集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state = 233)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
((210, 2), (90, 2), (210,), (90,))
from collections import Counter
Counter(y_test)
Counter({2: 34, 0: 25, 1: 31})
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state = 233, stratify = y)
Counter(y_test)
Counter({2: 30, 0: 30, 1: 30})
四、模型评价
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# 1、加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
# 2、拆分数据集,首先需乱序处理
# 2.1、自己拆分不调包 ~
shuffle_index = np.random.permutation(len(y))
train_ratio = 0.8
train_size = int(len(y)*train_ratio)
train_index = shuffle_index[:train_size]
test_index = shuffle_index[train_size:]
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
# 2.2、调包 ~
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=666)
# 3、预测
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
# 若不关注预测结果只关注预测精度
# accuracy_score(X_test,y_test)
y_predict = knn_classifier.predict(X_test)
print(y_predict)
# 4、评价
accutacy = np.sum(y_predict == y_test) / len(y_test)
# 或使用
accuracy_score(y_test,y_predict)
五、超参数 Hyperpatameter
人为设置的参数 / 经验值 / 参数搜索
KNN的三个超参数:
k个最近的邻居
分类表决与加权分类表决
明氏距离中的p
首先加载数据
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris = load_iris()
x = iris.data
y = iris.target
x.shape, y.shape
((150, 4), (150,))
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=233, stratify=y)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
((105, 4), (45, 4), (105,), (45,))
5.1、超参数
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(
n_neighbors=3,
weights='distance',#'uniform',
p = 2
)
neigh.fit(x_train, y_train)
KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3, weights='distance')
neigh.score(x_test, y_test)
0.9777777777777777
best_score = -1
best_n = -1
best_weight = ''
best_p = -1
for n in range(1, 20):
for weight in ['uniform', 'distance']:
for p in range(1, 7):
neigh = KNeighborsClassifier(
n_neighbors=n,
weights=weight,
p = p
)
neigh.fit(x_train, y_train)
score = neigh.score(x_test, y_test)
if score > best_score:
best_score = score
best_n = n
best_weight = weight
best_p = p
print("n_neighbors:", best_n)
print("weights:", best_weight)
print("p:", best_p)
print("score:", best_score)
n_neighbors: 5 weights: uniform p: 2 score: 1.0
5.2、sklearn 超参数搜索
from sklearn.model_selection import GridSearchCV
params = {
'n_neighbors': [n for n in range(1, 20)],
'weights': ['uniform', 'distance'],
'p': [p for p in range(1, 7)]
}
grid = GridSearchCV(
estimator=KNeighborsClassifier(),
param_grid=params,
n_jobs=-1
)
grid.fit(x_train, y_train)
GridSearchCV
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1, param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'p': [1, 2, 3, 4, 5, 6], 'weights': ['uniform', 'distance']})
estimator: KNeighborsClassifier
KNeighborsClassifier()
KNeighborsClassifier
KNeighborsClassifier()
grid.best_params_
{'n_neighbors': 9, 'p': 2, 'weights': 'uniform'}
grid.best_score_
0.961904761904762
grid.best_estimator_
KNeighborsClassifier
KNeighborsClassifier(n_neighbors=9)
grid.best_estimator_.predict(x_test)
array([2, 2, 0, 1, 1, 1, 2, 0, 2, 0, 0, 1, 0, 2, 1, 1, 0, 2, 2, 1, 0, 1, 1, 2, 2, 0, 0, 1, 1, 0, 2, 2, 0, 1, 1, 2, 1, 1, 0, 0, 0, 2, 0, 1, 1])
grid.best_estimator_.score(x_test, y_test)
0.9555555555555556
六、特征归一化
特征量纲不同。 为了消除数据特征量纲之间的影响,使得不同指标具有一定程度的可比性,能够同时反应每个指标的重要程度。
6.1、最值归一化方法
适用于数据分布在有限范围的情况。但受特殊数值影响很大。
X[:,0] = (X[:,0] - np.min(X[:,0])) / (np.max(X[:,0]) - np.min(X[:,0]))
X[:5,0]
array([0.22222222, 0.16666667, 0.11111111, 0.08333333, 0.19444444])
6.2、零均值归一化
X[:,0] = (X[:,0] - np.mean(X[:,0]))/np.std(X[:,0])
X[:5,0]
array([-0.90068117, -1.14301691, -1.38535265, -1.50652052, -1.02184904])
scikit-learn 中的StandardScaler
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
standard_scaler.fit(X)
standard_scaler.mean_
array([5.84333333, 3.05733333, 3.758 , 1.19933333])
standard_scaler.scale_
array([0.82530129, 0.43441097, 1.75940407, 0.75969263])
注意要重新赋值给X!
X = standard_scaler.transform(X)
** 测试集如何归一化?
不是用测试集的均值和标准差,而是用训练集的!
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,train_size=0.8,random_state=666)
standard_scaler = StandardScaler()
standard_scaler.fit(X_train)
X_train_standard = standard_scaler.transform(X_train)
X_test_standard = standard_scaler.transform(X_test)
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train_standard,y_train)
knn_classifier.score(X_test_standard, y_test)
七、KNN 回归任务实现
import numpy as np
import matplotlib.pyplot as plt
# 样本特征
data_X = [
[1.3, 6],
[3.5, 5],
[4.2, 2],
[5, 3.3],
[2, 9],
[5, 7.5],
[7.2, 4 ],
[8.1, 8],
[9, 2.5]
]
data_y = [0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
X_train = np.array(data_X)
y_train = np.array(data_y)
data_new = np.array([4,5])
plt.scatter(X_train[:,0],X_train[:,1],color='black')
plt.scatter(data_new[0], data_new[1],color='b', marker='^')
for i in range(len(y_train)):
plt.annotate(y_train[i], xy=X_train[i], xytext=(-15,-15), textcoords='offset points')
plt.show()
distances = [np.sqrt(np.sum((data - data_new)**2)) for data in X_train]
sort_index = np.argsort(distances)
k = 5
first_k = [y_train[i] for i in sort_index[:k]]
from collections import Counter
Counter(first_k).most_common(1)
predict_y = Counter(first_k).most_common(1)[0][0]
predict_y
0.3
k = 5
first_k = [y_train[i] for i in sort_index[:k]]
np.mean(first_k)
0.54
7.2、KNN回归 Scikit learn 实现
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)
KNeighborsRegressor
KNeighborsRegressor()
predict_y = knn_reg.predict(data_new.reshape(1,-1))
predict_y
array([0.54])
7.3、Boston 数据集
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import warnings
warnings.filterwarnings("ignore")
boston = load_boston()
x = boston.data
y = boston.target
x.shape, y.shape
((506, 13), (506,))
print(boston.DESCR)
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
x_train ,x_test, y_train, y_test = train_test_split(x, y ,train_size = 0.7, random_state=233)
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=5, weights='distance', p=2)
knn_reg.fit(x_train, y_train)
KNeighborsRegressor
KNeighborsRegressor(weights='distance')
knn_reg.score(x_test, y_test)
0.49308828546554706
归一化
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(x_train)
StandardScaler
StandardScaler()
x_train = standardScaler.transform(x_train)
x_test = standardScaler.transform(x_test)
knn_reg.fit(x_train, y_train)
KNeighborsRegressor
KNeighborsRegressor(weights='distance')
knn_reg.score(x_test, y_test)
0.8315777292735131
代码参考于
Chapter-04/4-7 特征归一化.ipynb · 梗直哥/Machine-Learning - Gitee.com