一、机器学习的概念
机器学习的概念: 重点在于学习 ,区别于让机器去执行我们定义好的规则
我们让机器去学习,也就是具备一定的预测能力,需要我们给机器大量的数据,以及给定对于这些数据 机器如何去看待的规则(算法)
最终得到一个模型,这个模型 具备一定的预测能力
机器学习就是从数据中自动分析获得模型,例用模型对未知数据进行预测
最早的机器学习: 垃圾邮件的分辨
传统思路:编写规则,定义垃圾邮件,让计算机执行,输入一封邮件,输出是否是垃圾邮件
鸢尾花
课程中对于数学知识的要求程度:高中数学水平,本科高等数学,线性代数,概率论至少及格水平
机器学习的典型应用: 图像识别: 简单的比如二分类问题 人脸识别
MNIST数据集
alpha go zero
课程会讲解到的内容: 学习算法的底层原理,我们也会使用代码实现部分算法
也会使用一些真实的数据集 来模拟解决实际的问题
对不同的算法进行对比试验,验证不同算法的好坏
对于同一个算法的不同参数 也会进行对比试验,不断的调参
需要掌握算法背后的思想,比如简单的告诉大家,逻辑回归可以解决分类问题,
过拟合和欠拟合的问题:
调库:不反对调库
课程包含: 算法原理学习,部分算法底层原理实现,scikit learn机器学习库的使用
课程环境搭建 :
语言:python3
库:scikit learn
工具: numpy matplotlib
IDE: jupyter
一些概念:数据集中每一列表达样本的一个特征(feature),每一行代表一个样本(sample)
样本的特征经常使用X来代表,代表矩阵
样本的目标值 y
分类任务和回归任务
二分类:猫还是狗,是否是垃圾邮件,信用贷款是否有风险,股票是涨还是跌
多分类:数字识别, 图像识别 ,使用卡风险级别
很多复杂的任务也可以转换成多分类任务,比如下围棋
回归任务:根据房屋的特征,预测房屋的价格
监督学习和无监督学习
监督学习:给机器的训练数据拥有标记或者说答案
二、knn算法
# 在sklearn中,对于数据的拟合,创建模型,是放在fit方法中
import numpy as np
from math import sqrt
from collections import Counter
class Knn:
def __init__(self, n_neighbor=3): # n_neighbor是超参数
self.X_train = None
self.y_train = None
self.n_neighbor = n_neighbor
def fit(self, X_train, y_train):
# 给定x_train和y_train,得到训练模型
assert X_train.shape[0] == len(y_train)
self.X_train = X_train
self.y_train = y_train
return self
def predict(self, X):
# 对于给定的待预测数据,返回预测结果
assert self.X_train is not None
assert self.y_train is not None
assert self.X_train.shape[1] == X.shape[1]
# distance = [] # 保存和其他所有点的距离
return np.array([self._predict(x) for x in X])
def _predict(self, x):
# 给定一个样本,求出一个结果
distance = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self.X_train]
nearest = np.argsort(distance)
nearest = [i for i in nearest[:self.n_neighbor]]
top_K = [i for i in self.y_train[nearest]]
votes = Counter(top_K)
y_predict = votes.most_common(1)[0][0]
return y_predict
def __repr__(self):
return "KnnClassifier(n_neighbor=3)"
if __name__ == '__main__':
knn = Knn()
print(knn)
raw_data_X = [[3.3935,2.3312],
[3.1101,1.7815]]
raw_data_Y = [0,1] # 0表示良性 1表示恶性
X_train = np.array(raw_data_X)
Y_train = np.array(raw_data_Y)
knn.fit(X_train,Y_train)
knn.predict(np.array([[2, 4], [1, 3], [3, 5]]))
三、数据集训练(jupyter的写法)
有封装好的数据集测试类型【内置的score可以直接计算准确率】
1.jupyter格式简单的knn算法
#!/usr/bin/env python
# coding: utf-8
# In[1]:
# knn:对于带预测的样本,我们去取理他距离最近的k个点,然后根据这些点分类做投票,投票数量最多的即是待预测数据的分类结果
# In[2]:
from sklearn.neighbors import KNeighborsClassifier
# In[3]:
knn_clf = KNeighborsClassifier()
# In[4]:
from sklearn.datasets import load_iris
# In[5]:
iris = load_iris()
# In[7]:
iris.data.shape
# In[8]:
iris.data
# In[9]:
X = iris.data
y = iris.target
knn_clf.fit(X,y)
# In[ ]:
knn_clf.predict()
2.jupyter简单的测试训练数据集
#!/usr/bin/env python
# coding: utf-8
# ### 实现train_test_split
# In[1]:
import numpy as np
from sklearn.datasets import load_iris
# In[2]:
iris = load_iris()
# In[3]:
X = iris.data
y = iris.target
# In[4]:
y
# In[5]:
shuffle_indexs = np.random.permutation(len(X))
# In[6]:
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
test_indexs = shuffle_indexs[:test_size]
train_indexs = shuffle_indexs[test_size:]
# In[7]:
test_indexs.shape
# In[8]:
train_indexs.shape
# In[9]:
X_train = X[train_indexs]
y_train = y[train_indexs]
X_test = X[test_indexs]
y_test = y[test_indexs]
# In[10]:
len(X_train)
# In[11]:
len(y_train)
# In[12]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train,y_train)
# In[13]:
y_predict = knn_clf.predict(X_test)
# In[14]:
y_predict
# In[15]:
y_test
# In[16]:
np.sum(np.array(y_predict == y_test,dtype='int'))/len(X_test)
# ### 使用sklearn中封装好的train_test_split
# In[17]:
from sklearn.model_selection import train_test_split
# In[18]:
X_train,X_test,y_train,y_test = train_test_split(X,y) # 调用sklearn中的train_test_split,返回四个结果
# In[19]:
X_train.shape
# In[20]:
X_test.shape
# In[21]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train,y_train)
# In[22]:
knn_clf.predict(X_test) # 传入待预测的样本特征,得到预测结果
# In[23]:
knn_clf.score(X_test,y_test) # 传入X_test和y_test,得到预测的准确率
# In[ ]:
KNeighborsClassifier(n_neighbors=6)
3.超参数
#!/usr/bin/env python
# coding: utf-8
# In[1]:
# 超参数:在执行程序之前需要确定的参数
# knn中有没有其他的超参数呢?
# weights :权重 uniform:不考虑距离带来的权重问题 distance: 距离做作为计算的权重
# In[2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
# In[3]:
knn_clf = KNeighborsClassifier(weights='distance') # 倒数
# In[4]:
from sklearn.model_selection import train_test_split
iris = load_iris()
X = iris.data
y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y)
# In[5]:
get_ipython().run_cell_magic('time', '', 'best_k = 0\nbest_score = 0.0\nbest_clf = None\nfor k in range(1,21):\n knn_clf = KNeighborsClassifier(n_neighbors=k)\n knn_clf.fit(X_train,y_train)\n score = knn_clf.score(X_test,y_test)\n if score>best_score:\n best_score = score\n best_k = k\n best_clf = knn_clf\nprint(best_k)\nprint(best_score)\nprint(best_clf)\n')
# In[6]:
get_ipython().run_cell_magic('time', '', "best_k = 0\nbest_score = 0.0\nbest_clf = None\nbest_method = None\nfor weight in ['uniform','distance']:\n for k in range(1,21):\n knn_clf = KNeighborsClassifier(n_neighbors=k,weights=weight)\n knn_clf.fit(X_train,y_train)\n score = knn_clf.score(X_test,y_test)\n if score>best_score:\n best_score = score\n best_k = k\n best_clf = knn_clf\n best_method = weight\nprint(best_k)\nprint(best_score)\nprint(best_clf)\nprint(best_method)\n")
# In[8]:
get_ipython().run_cell_magic('time', '', "best_k = 0\nbest_score = 0.0\nbest_clf = None\nbest_p = None\nfor p in range(1,6):\n for k in range(1,21):\n knn_clf = KNeighborsClassifier(n_neighbors=k,weights='distance',p=p)\n knn_clf.fit(X_train,y_train)\n score = knn_clf.score(X_test,y_test)\n if score>best_score:\n best_score = score\n best_k = k\n best_clf = knn_clf\n best_p = p\n \nprint(best_k)\nprint(best_score)\nprint(best_clf)\nprint(best_p)\n")
# In[ ]:
4.grid search
#!/usr/bin/env python
# coding: utf-8
# In[1]:
from sklearn.datasets import load_digits
import numpy as np
from matplotlib import pyplot as plt
# In[2]:
digits = load_digits()
# In[4]:
print(digits.DESCR)
# In[5]:
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
# In[7]:
X_train[1000]
# In[8]:
y_train[1000]
# In[10]:
x = X_train[1000].reshape(8,-1)
plt.imshow(x,cmap=plt.cm.binary)
plt.show()
# In[18]:
x1 = np.arange(1,17).reshape(-1,4)
# In[19]:
x1
# ### 使用sklearn中的grid search
# In[20]:
param_grid = [
{'weights':['uniform'],
'n_neighbors':[i for i in range(1,21)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,21)],
'p':[i for i in range(1,6)]
}
] # 创建网格参数,每一组参数放在一个字典中
# In[21]:
from sklearn.model_selection import GridSearchCV
# In[22]:
from sklearn.neighbors import KNeighborsClassifier
# In[23]:
knn_clf = KNeighborsClassifier()
# In[25]:
get_ipython().run_cell_magic('time', '', '# 尝试寻找最佳参数\ngrid_search = GridSearchCV(knn_clf,param_grid)\ngrid_search.fit(X_train,y_train)\n')
# In[30]:
knn_clf = grid_search.best_estimator_
# In[27]:
grid_search.best_score_
# In[28]:
grid_search.best_params_
# In[32]:
knn_clf.score(X_test,y_test)
# In[34]:
get_ipython().run_cell_magic('time', '', '# 尝试寻找最佳参数\ngrid_search = GridSearchCV(knn_clf,param_grid,verbose=2,n_jobs=-1) # verbose越大越详细,n_jobs调用几个cpu进行计算\ngrid_search.fit(X_train,y_train)\n')
# In[35]:
grid_search.best_estimator_
# In[36]:
grid_search.best_params_
# In[ ]:
5.数据归一化
import numpy as np
# In[3]:
X =np.random.randint(0,100,size=100)
# In[4]:
X = (X-np.min(X))/(np.max(X)-np.min(X))
# In[5]:
X =np.random.randint(0,100,size=100).reshape(-1,2)
# In[6]:
X = np.array(X,dtype=)
# In[11]:
X
# In[7]:
X[:,0] = (X[:,0]-np.min(X[:,0]))/(np.max(X[:,0])-np.min(X[:,0]))
备注
① K近邻算法
②sklearn中的scaler
③均值方差归一化
④曼哈顿距离
⑤明可夫斯基距离
⑥ 欧拉距离
⑦ 特征被时间主导
⑧ 最值归一化