K近邻法算法思想

K近邻法（K-Nearest Neighbor，KNN）是一种基本的分类和回归方法，是监督学习方法里的一种常用方法。K近邻算法用一句通俗的古语来说就是：“物以类聚，人以群分”。有人说看一个人什么样，看他身边的朋友什么样就知道了。在机器学习中你要看一个实例的类别，你就可以看它附近都是什么类别。你要看你自己在北京哪个区，看理你最近的几个人（可以看K个人）是哪个区的，那么大概率你也是这个区的。
在这里插入图片描述

距离度量

欧式距离

小初高学的空间中两点距离公式欧式距离。

曼哈顿距离

在曼哈顿街区从一个十字路口开车到另一个十字路口，驾驶距离显然不是两点间直线距离。这个驾驶距离就是曼哈顿距离，曼哈顿距离也被称为城市街区距离。
在这里插入图片描述

切比雪夫距离

闵可夫斯基距离

标准化欧式距离

余弦距离

汉明距离

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

连续特征和离散特征的距离计算

导入数据

iris_data = load_iris()
iris_data.data[0:5, :]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

# 特征值名称
iris_data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

# 分类标签
print(iris_data.target_names)
pd.DataFrame(iris_data.target).value_counts()

['setosa' 'versicolor' 'virginica']





0    50
1    50
2    50
dtype: int64

简单统计分析

X = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
y = iris_data.target
X.describe()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

plt.figure(figsize=(3,3))
sns.heatmap(X.corr(), annot=True)

<Axes: >

在这里插入图片描述

plt.figure(figsize=(4,4))
sns.pairplot(X)

<seaborn.axisgrid.PairGrid at 0x16b89fbf640>




<Figure size 400x400 with 0 Axes>

在这里插入图片描述

划分数据集

x_train, x_test , y_train,y_test = train_test_split(X, y, test_size=.2 , random_state=47)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((120, 4), (120,), (30, 4), (30,))

特征工程

归一化/标准化

transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
x_train.shape, x_test.shape

((120, 4), (30, 4))

模型训练

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(x_train, y_train)

KNeighborsClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

KNeighborsClassifier

KNeighborsClassifier()

模型评估

y_pred = knn_model.predict(x_test)
y_pred == y_test

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True])