机器学习-基础分类算法-KNN详解

news2024/9/24 1:26:08

KNN-k近邻算法

k-Nearest Neighbors

  • 思想极度简单
  • 应用数学只是少
  • 效果好
  • 可以解释机器学习算法使用过程中的很多细节问题
  • 更完整的刻画机器学习应用的流程

image.png
image.png
image.png
创建简单测试用例

import numpy as np
import matplotlib.pyplot as plt
raw_data_X = [[3.393533211, 2.331273381],
              [3.110073483, 1.781539638],
              [1.343808831, 3.368360954],
              [3.582294042, 4.679179110],
              [2.280362439, 2.866990263],
              [7.423436942, 4.696522875],
              [5.745051997, 3.533989803],
              [9.172168622, 2.511101045],
              [7.792783481, 3.424088941],
              [7.939820817, 0.791637231]
             ]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], color='g')
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], color='r')
plt.show()

无标题.png
预测

x = np.array([8.093607318, 3.365731514])

plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], color='g')
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], color='r')
plt.scatter(x[0], x[1], color='b')
plt.show()

无标题.png
KNN过程

from math import sqrt
distances = []
for x_train in X_train:
    d = sqrt(np.sum((x_train - x)**2))
    distances.append(d)

image.png
image.png

distances = [sqrt(np.sum((x_train - x)**2))
             for x_train in X_train]

数组排序返回索引

np.argsort(distances)
nearest = np.argsort(distances)

最近k个点相近的y坐标

k = 6
topK_y = [y_train[neighbor] for neighbor in nearest[:k]]
topK_y

image.png
统计

from collections import Counter
votes = Counter(topK_y)
votes.most_common(1)
predict_y = votes.most_common(1)[0][0]

image.png

封装

import numpy as np
from math import sqrt
from collections import Counter


def kNN_classify(k, X_train, y_train, x):

    assert 1 <= k <= X_train.shape[0], "k must be valid"
    assert X_train.shape[0] == y_train.shape[0], \
        "the size of X_train must equal to the size of y_train"
    assert X_train.shape[1] == x.shape[0], \
        "the feature number of x must be equal to X_train"

    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
    nearest = np.argsort(distances)

    topK_y = [y_train[i] for i in nearest[:k]]
    votes = Counter(topK_y)

    return votes.most_common(1)[0][0]

  • k近邻算法是非常特殊的,可以被认为是没有模型的算法
  • 为了和其他算法统一,可以认为训练数据集就是模型本身

使用scikit-learn中的kNN

raw_data_X = [[3.393533211, 2.331273381],
              [3.110073483, 1.781539638],
              [1.343808831, 3.368360954],
              [3.582294042, 4.679179110],
              [2.280362439, 2.866990263],
              [7.423436942, 4.696522875],
              [5.745051997, 3.533989803],
              [9.172168622, 2.511101045],
              [7.792783481, 3.424088941],
              [7.939820817, 0.791637231]
             ]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)

x = np.array([8.093607318, 3.365731514])
from sklearn.neighbors import KNeighborsClassifier
kNN_classifier = KNeighborsClassifier(n_neighbors=6)
kNN_classifier.fit(X_train, y_train)
#预测
kNN_classifier.predict(x)
#转化为矩阵
X_predict = x.reshape(1, -1)
kNN_classifier.predict(X_predict)
y_predict = kNN_classifier.predict(X_predict)

image.png
优化封装的KNN

import numpy as np
from math import sqrt
from collections import Counter


class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least k."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None, \
                "must fit before predict!"
        assert X_predict.shape[1] == self._X_train.shape[1], \
                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x,返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))
                     for x_train in self._X_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):
        return "KNN(k=%d)" % self.k

image.png

判断机器学习算法的性能-训练数据分割测试数据

image.png
分一部分数据设置为测试数据
image.png
train test split
加载数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets 
iris = datasets.load_iris()
iris.keys()
X = iris.data
y = iris.target
X.shape
y.shape

image.png
**分离出一部分数据做训练,另外一部分数据做测试 **
将索引随机排列

shuffled_indexes = np.random.permutation(len(X))
shuffled_indexes

设置测试数据百分比
获取测试数据

test_ratio = 0.2
test_size = int(len(X) * test_ratio)
test_indexes = shuffled_indexes[:test_size]
train_indexes = shuffled_indexes[test_size:]
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]

image.png
封装算法

import numpy as np
def train_test_split(X, y, test_ratio=0.2, seed=None):
    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""
    assert X.shape[0] == y.shape[0], \
        "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0, \
        "test_ration must be valid"

    if seed:
        np.random.seed(seed)

    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffled_indexes[:test_size]
    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

image.png
sklearn中的train_test_split
image.png

分类准确度

加载手写数字数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
digits.keys()
X = digits.data
X.shape
y = digits.target
y.shape

image.png
查看数据

some_digit = X[666]
some_digit_image = some_digit.reshape(8, 8)
import matplotlib
import matplotlib.pyplot as plt
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary)
plt.show()

image.png
预测准确率

from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_ratio=0.2)
from playML.kNN import KNNClassifier

my_knn_clf = KNNClassifier(k=3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)
sum(y_predict == y_test) / len(y_test)

image.png
封装自己的accuracy_score

import numpy as np


def accuracy_score(y_true, y_predict):
    '''计算y_true和y_predict之间的准确率'''
    assert y_true.shape[0] == y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"

    return sum(y_true == y_predict) / len(y_true)

image.png
scikit-learn中的accuracy_score

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)

超参数和模型参数

  • 超参数:在算法运行前需要决定的参数
  • 模型参数:算法过程中学习的参数

KNN算法没有模型参数
KNN算法中的K是典型的超参数
寻找好的超参数

  • 领域知识
  • 经验数值
  • 实验搜索

加载手写数字数据集

import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

寻找最好的k、

best_score = 0.0
best_k = -1
for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
        
print("best_k =", best_k)
print("best_score =", best_score)

image.png
image.png
image.png
加入是否考虑距离

best_score = 0.0
best_k = -1
best_method = ""
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
        
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)

明可夫斯基距离
image.png

best_score = 0.0
best_k = -1
best_p = -1

for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_p = p
            best_score = score
        
print("best_k =", best_k)
print("best_p =", best_p)
print("best_score =", best_score)

网格搜索 Grid Search

定义搜索参数

param_grid = [
    {
        'weights': ['uniform'], 
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)], 
        'p': [i for i in range(1, 6)]
    }
]
knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(knn_clf, param_grid)
grid_search.fit(X_train, y_train)

查看最佳分类器参数和准确度

grid_search.best_estimator_
grid_search.best_score_

image.png

更多的距离定义

  • 向量空间余弦相似度 Cosine Similarity
  • 调整余弦相似度 Adjusted Cosine Similarity
  • 皮尔森相关系数 Pearson Correlation Coefficient
  • Jaccard相似吸收 Jaccard Coeffcient

数据归一化 Feature Scaling

将所有的数据映射到同一尺度
最值归一化 normalization:把所有数据映射到0-1之间
image.png
适用于分布有明显边界的情况;受outlier影响较大

x = np.random.randint(0, 100, 100) 
(x - np.min(x)) / (np.max(x) - np.min(x))

image.png

X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float)
X[:,0] = (X[:,0] - np.min(X[:,0])) / (np.max(X[:,0]) - np.min(X[:,0]))
X[:,1] = (X[:,1] - np.min(X[:,1])) / (np.max(X[:,1]) - np.min(X[:,1]))
X[:10,:]

image.png
均值方差归一化 standardization
数据分布没有明显的边界,有可能存在极端数据值
均值方差归一化:把所有数据归一到均值为0方差为1的分布中
image.png

X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)
X2[:,0] = (X2[:,0] - np.mean(X2[:,0])) / np.std(X2[:,0])
X2[:,1] = (X2[:,1] - np.mean(X2[:,1])) / np.std(X2[:,1])

image.png

测试数据集的归一化

测试数据说模拟真实环境

  • 真实环境很有可能无法得到所有测试数据的均值和方差
  • 对数据的归一化也是算法的一部分

(x_test-mean_train)/std_train

scikit-learn中的Scaler

image.png
加载数据

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

数据分割

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=666)

数据归一化

from sklearn.preprocessing import StandardScaler 
standardScalar = StandardScaler() 
standardScalar.fit(X_train)
standardScalar.transform(X_train)#归一化

image.png
归一化结果赋值

X_train = standardScalar.transform(X_train)
X_test_standard = standardScalar.transform(X_test) 

使用归一化后的数据进行knn分类

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test_standard, y_test)

实现自己的standardScaler

import numpy as np


class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None, \
               "must fit before transform!"
        assert X.shape[1] == len(self.mean_), \
               "the feature number of X must be equal to mean_ and std_"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX

有关k近邻算法

解决分类问题
天然可以解决多分类问题
思想简单,效果强大
最大缺点:效率低下
如果训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n)
优化 使用树结构:KD-Tree,Ball-Tree
缺点2:高度数据相关
缺点3:预测结果不具有可解释性
维数灾难
随着维度的增加,“看似相近”的两个点之间的距离越来越大
解决方法:降维、

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1430616.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

单片机学习笔记---定时器和中断系统如何连起来工作

前面两节我们分别讲了中断系统和定时器&#xff0c;这节我们看看这两者连起来工作的原理。 说明&#xff1a;看这一节之前一定要先把前两节给看明白了再仔细琢磨这一节的每一张图&#xff01; 前两节&#xff1a; 单片机学习笔记---中断系统&#xff08;含外部中断&#xff…

Python基础知识:Python注释及print函数、input函数

在Python中&#xff0c;注释是对相应代码的解释&#xff0c;以增加代码的可读性&#xff0c;让用户能够更好地理解相应代码的含义。注释通过在相应代码后面加上“#”号来实现。比如以下代码 data.describe()#对数据集进行描述性分析 其中data.describe()为需要被执行的代码&a…

网络安全之漏洞扫描

漏洞是在硬件、软件、协议的具体实现或系统安全策略上存在的缺陷&#xff0c;从而可以使攻击者能够在未授权的情况下访问或破坏系统。这些缺陷、错误或不合理之处可能被有意或无意地利用&#xff0c;从而对一个组织的资产或运行造成不利影响&#xff0c;如信息系统被攻击或控制…

【高阶数据结构】红黑树

文章目录 前言什么是红黑树红黑树的性质红黑树结点的定义红黑树的插入情况一情况二情况三插入代码总结 验证是否为红黑树红黑树的删除 前言 前面我们学习了 AVL 树——高度平衡的二叉搜索树&#xff0c;AVL 树保证了结点的左右子树的高度差的绝对值不超过 1&#xff0c;也就是…

Nebula Siwi:基于图数据库的智能问答助手思路分析

本文重点分析 Nebula Siwi 智能问答思路&#xff0c;具体代码可参考[2]&#xff0c;使用的数据集为 Basketballplayer[3]。部分数据和 schema 如下所示&#xff1a; 一.智能问答可实现的功能 1.Nebula Siwi 源码整体结构 主要包括前段&#xff08;Vue&#xff09;和后端&#…

Unity3d C# 在WebGL平台加载并解析xml文件实现总结

前言 xml是可扩展标记语言&#xff0c;由一系列的元素、属性、值节点等构成的一个树形结构&#xff0c;除了可读性差一点&#xff0c;别的用于存储一些结构化的数据还是比较方便的。这个功能在Unity3d端的实现是比较方便快捷的&#xff1a; void GetXML1() {string filePath …

【力扣hot100】刷题笔记Day3

前言 以撒真是一不小心就玩太久了&#xff0c;终于解锁骨哥嘞&#xff0c;抓紧来刷题&#xff0c;今天是easy双指针&#xff01; 283. 移动零 - 力扣&#xff08;LeetCode&#xff09; 一个指针遍历&#xff0c;一个指针用于交换前面的0 class Solution(object):def moveZer…

简单说说mysql的日志

今天我们通过mysql日志了解mysqld的错误日志、慢查询日志、二进制日志&#xff0c;redolog, undolog等。揭示它们的作用和用途&#xff0c;让我们工作中更能驾驭mysql。 redo 日志 如果mysql事务提交后发生了宕机现象&#xff0c;那怎么保证数据的持久性与完整性&#xff1f;…

《计算机网络简易速速上手小册》第6章:网络性能优化(2024 最新版)

文章目录 6.1 带宽管理与 QoS - 让你的网络不再拥堵6.1.1 基础知识6.1.2 重点案例&#xff1a;提高远程办公的视频会议质量实现步骤环境准备Python 脚本示例注意事项 6.1.3 拓展案例1&#xff1a;智能家居系统的网络优化实现思路Python 脚本示例 6.1.4 拓展案例2&#xff1a;提…

挑战杯 LSTM的预测算法 - 股票预测 天气预测 房价预测

0 简介 今天学长向大家介绍LSTM基础 基于LSTM的预测算法 - 股票预测 天气预测 房价预测 这是一个较为新颖的竞赛课题方向&#xff0c;学长非常推荐&#xff01; &#x1f9ff; 更多资料, 项目分享&#xff1a; https://gitee.com/dancheng-senior/postgraduate 1 基于 Ke…

Megatron-LM源码系列(七):Distributed-Optimizer分布式优化器实现Part2

1. 使用入口 DistributedOptimizer类定义在megatron/optimizer/distrib_optimizer.py文件中。创建的入口是在megatron/optimizer/__init__.py文件中的get_megatron_optimizer函数中。根据传入的args.use_distributed_optimizer参数来判断是用DistributedOptimizer还是Float16O…

QSlider使用笔记

最近做项目使用到QSlider滑动条控件&#xff0c;在使用过的过程中&#xff0c;发现一个问题就是点滑动条上的一个位置&#xff0c;滑块并没有移动到鼠标点击的位置&#xff0c;体验感很差&#xff0c;于是研究了下&#xff0c;让鼠标点击后滑块移动到鼠标点击的位置。 1、event…

this指针详细总结 | static关键字 | 静态成员

文章目录 1.this指针引入2.this指针的特性3.静态成员3.1.C语言中static的基本用法3.2.C中的static关键字 1.this指针引入 class student { public:student(const string& name){ _name name; }void print(){// _name<>this->_name<>(*this)._name// 说一下…

【Linux】打包压缩跨系统/网络传输文件常用指令完结

Hello everybody!在今天的文章中我会把剩下的3-4个常用指令讲完&#xff0c;然后开始权限的讲解。那废话不多说&#xff0c;咱们直接进入正题&#xff01; 1.zip/unzip&tar命令 1.zip/unzip 在windows系统中&#xff0c;经常见到带有zip后缀的文件。那个东西就是压缩包。…

携程网首页案例制作

背景线性渐变 语法&#xff1a; background&#xff1a;linear-gradient&#xff08;起始方向&#xff0c;颜色1&#xff0c;颜色2&#xff0c;...&#xff09;&#xff1b; background&#xff1a;-webkit-linear-gradient&#xff08;left&#xff0c;red&#xff0c;blue&a…

使用Python的turtle模块实现简单的烟花效果

import turtle import random import math# 设置窗口大小 width, height 800, 600 screen turtle.Screen() screen.title("Fireworks Explosion") screen.bgcolor("black") screen.setup(width, height)# 定义烟花粒子类 class Particle(turtle.Turtle):…

ES6-let

一、基本语法 ES6 中的 let 关键字用于声明变量&#xff0c;并且具有块级作用域。 - 语法&#xff1a;let 标识符;let 标识符初始值; - 规则&#xff1a;1.不能重复声明let不允许在相同作用域内重复声明同一个变量2.不存在变量提升在同一作用域内&#xff0c;必须先声明才能试…

论文阅读-一种用于大规模分布式文件系统中基于深度强化学习的自适应元数据管理方案

名称&#xff1a; An Adaptive Metadata Management Scheme Based on Deep Reinforcement Learning for Large-Scale Distributed File Systems I. 引言 如今&#xff0c;大型集群文件系统的规模已达到PB甚至EB级别&#xff0c;由此产生的数据呈指数级增长。系统架构师不断设…

算法学习——华为机考题库7(HJ41 - HJ45)

算法学习——华为机考题库7&#xff08;HJ41 - HJ45&#xff09; HJ41 称砝码 描述 现有n种砝码&#xff0c;重量互不相等&#xff0c;分别为 m1,m2,m3…mn &#xff1b; 每种砝码对应的数量为 x1,x2,x3…xn 。现在要用这些砝码去称物体的重量(放在同一侧)&#xff0c;问能称…

STM32--揭秘中断(简易土货版)

抢占优先级响应优先级 视频学习--中断​​​​​​​