传统机器学习(七)支持向量机(2)sklearn中的svm

2 sklearn中的svm

2.1 LinearSVC及SVC参数详解

2.1.1 SVC参数

class sklearn.svm.SVC(*,
                      C=1.0, 
                      kernel='rbf', 
                      degree=3, 
                      gamma='scale', 
                      coef0=0.0, 
                      shrinking=True, 
                      probability=False, 
                      tol=0.001, 
                      cache_size=200, 
                      class_weight=None, 
                      verbose=False, 
                      max_iter=-1, 
                      decision_function_shape='ovr', 
                      break_ties=False, 
                      random_state=None
                     )

用于分类，用libsvm实现，参数如下：

C : 惩罚项，默认为1.0，C越大容错空间越小；C越小，容错空间越大
kernel : 核函数的类型，可选参数为：
- “linear” : 线性核函数
- “poly” : 多项式核函数
- “rbf” : 高斯核函数(默认)
- “sigmod” : sigmod核函数
- “precomputed” : 核矩阵，表示自己提前计算好核函数矩阵
degree : 多项式核函数的阶，默认为3，只对多项式核函数生效
gamma : 核函数系数，可选，float类型，默认为auto。只对’rbf’ ,’poly’ ,’sigmod’有效。如果gamma为auto，代表其值为样本特征数的倒数，即1/n_features
coef0 ：核函数中的独立项，float类型，可选，默认为0.0。只有对’poly’ 和,’sigmod’核函数有用，是指其中的参数c
probability : 是否启用概率估计，bool类型，可选参数，默认为False，这必须在调用fit()之前启用，并且会fit()方法速度变慢
tol ：svm停止训练的误差精度，float类型，可选参数，默认为1e^-3
cache_size ：内存大小，float，可选，默认200。指定训练所需要的内存，单位MB
class_weight：类别权重，dict类型或str类型，可选，默认None。给每个类别分别设置不同的惩罚参数C，如果没有，则会给所有类别都给C=1，即前面指出的C。如果给定参数’balance’，则使用y的值自动调整与输入数据中的类频率成反比的权重
max_iter：最大迭代次数，int类型，默认为-1，不限制
decision_function_shape ：决策函数类型，可选参数’ovo’和’ovr’，默认为’ovr’。’ovo’表示one vs one，’ovr’表示one vs rest。（多分类）
random_state ：数据洗牌时的种子值，int类型，可选，默认为None

2.1.2 LinearSVC参数

class sklearn.svm.LinearSVC(
    penalty='l2', 
    loss='squared_hinge', 
    *, 
    dual=True, 
    tol=0.0001, 
    C=1.0, 
    multi_class='ovr', 
    fit_intercept=True, 
    intercept_scaling=1, 
    class_weight=None, 
    verbose=0, 
    random_state=None, 
    max_iter=1000
)

LinearSVC（Linear Support Vector Classification）线性支持向量机，核函数是 linear，不是基于libsvm实现的
参数：

C：目标函数的惩罚系数C，默认C = 1.0；
loss：指定损失函数. squared_hinge(默认), squared_hinge
penalty ：惩罚方式，str类型，l1, l2
dual ：选择算法来解决对偶或原始优化问题。当nsamples>nfeatures时dual=false
tol ：svm结束标准的精度，默认是 1e - 3
multi_class：如果y输出类别包含多类，用来确定多类策略， ovr表示一对多，“crammer_singer”优化所有类别的一个共同的目标。如果选择“crammer_singer”，损失、惩罚和优化将会被被忽略。
max_iter : 要运行的最大迭代次数。int，默认1000

2.2 LinearSVC及SVC使用案例

2.2.1 鸢尾花数据集在线性SVM上的应用

线性SVM有两种方式实现：

LinearSVC(C=1)
SVC(C=1, kernel="linear")

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')


from sklearn.svm import SVC
from sklearn.datasets import load_iris

# 选择鸢尾花数据集的两列，方便画图
iris = load_iris()
X = iris['data'][:, (2,3)]
y = iris['target']

# 只选择两个种类的鸢尾花数据
setosa_versi = (y == 0) | (y == 1)
X = X[setosa_versi]
y = y[setosa_versi]


# 训练，常数C越大，容错空间越小，上下支持面之间的距离越小；常数C越小，容错空间越大，上下支持面之间的距离越大
svm_clf = SVC(kernel='linear',C=100000000)
svm_clf.fit(X,y)

# 绘制决策边界
def plot_svc_decision_boundary(svm_clf,xmin,xmax,sv=True):
    # linearsvc.coef_是所有特征的权重
    w = svm_clf.coef_[0]
    # linearsvc.intercept_是截距
    b = svm_clf.intercept_[0]
    
    x0 = np.linspace(xmin, xmax, 200)
    # 决策边界方程为：w[0]x0 + w[1]x1 = 0
    # 这里计算x1的值
    decision_boundary = - w[0]/w[1] * x0 - b/w[1]
    # 支持面距离判别面之间的距离margin
    margin = 1 / w[1]
    # 上支持面，在这里是x1 + margin
    gutter_up = decision_boundary + margin
    # 下支持面，在这里是x1 - margin
    gutter_dw = decision_boundary - margin
    
    # 是否画出支持向量
    if sv:
        svs = svm_clf.support_vectors_  # 支持向量
        plt.scatter(svs[:,0],svs[:,1],s=180)
    plt.plot(x0,decision_boundary,'k-',linewidth=2)
    plt.plot(x0,gutter_up,'k--',linewidth=2)
    plt.plot(x0,gutter_dw,'k--',linewidth=2)

plt.figure(figsize=(12,8))
plot_svc_decision_boundary(svm_clf,0,5.5)
plt.plot(X[:,0][y==1],X[:,1][y==1],'bs')
plt.plot(X[:,0][y==0],X[:,1][y==0],'yo')
plt.axis([0,5.5,0,2])
plt.show()

在这里插入图片描述

# 在这里把C值调小一些，我们可以看到容错空间越大，上下支持面之间的距离越大
# 此时的支撑向量：指的是落在支持面上的样本，及支持面没支持住的样本
svm_clf_2 = SVC(kernel='linear',C=0.01)
svm_clf_2.fit(X,y)


plt.figure(figsize=(12,8))
plot_svc_decision_boundary(svm_clf_2,0,5.5)
plt.plot(X[:,0][y==1],X[:,1][y==1],'bs')
plt.plot(X[:,0][y==0],X[:,1][y==0],'yo')
plt.axis([0,5.5,0,2])
plt.show()

在这里插入图片描述

2.2.2 多项式核函数

这里可以手动使用 PolynomialFeature将数据升维，再用LinearSVC进行分类。也可以直接使用SVC指定多项式核函数，即：

1、用LinearSVC，需要用PolynomialFeatures升维

from sklearn.datasets import make_moons

# 使用sklearn自带的moon数据
X, y = make_moons(n_samples=100,noise=0.15,random_state=42)

def plot_dataset(X,y,axis):
    plt.plot(X[:,0][y == 0],X[:,1][y == 0],'bs')
    plt.plot(X[:,0][y == 1],X[:,1][y == 1],'go')
    plt.axis(axis)
    plt.grid(True,which='both')


plot_dataset(X,y,[-1.5,2.5,-1,1.5])
plt.show()

在这里插入图片描述

from sklearn.preprocessing import PolynomialFeatures

# 使用LinearSVC分类，使用pipeline封装一下
poly_pipeline = Pipeline(steps=[
    ("ploy_features",PolynomialFeatures(degree=3)),
    ("scaler",StandardScaler()),
    ("svm_clf",LinearSVC(C=10,loss='hinge'))
])

poly_pipeline.fit(X,y)

# 画出决策边界
def plot_pred(clf,axes):
    x0s = np.linspace(axes[0],axes[1],100)
    x1s = np.linspace(axes[2],axes[3],100)
    x0,x1 = np.meshgrid(x0s,x1s)
    # x0 和 x1 被拉成一列，然后拼接成10000行2列的矩阵，表示所有点
    X = np.c_[x0.ravel(),x1.ravel()]
    # 二维点集才可以用来预测
    y_pred = clf.predict(X).reshape(x0.shape)
    # 等高线
    plt.contourf(x0,x1,y_pred,alpha=0.2)

plot_pred(poly_pipeline,[-1.5,2.5,-1,1.5])
plot_dataset(X,y,[-1.5,2.5,-1,1.5])
plt.show()

在这里插入图片描述

2、用SVC() 指定kernel=poly

ploy_kernel_svm_clf1 = Pipeline(
    steps=[
        ("scaler",StandardScaler()),
        ("svm_clf",SVC(kernel='poly',degree=3,coef0=1,C=5))
    ]
)

ploy_kernel_svm_clf1.fit(X,y)


# 可以看到degree越大，分类效果越好，但是也容易造成模型过拟合、模型复杂度提高
ploy_kernel_svm_clf2 = Pipeline(
    steps=[
        ("scaler",StandardScaler()),
        ("svm_clf",SVC(kernel='poly',degree=10,coef0=100,C=5))
    ]
)

ploy_kernel_svm_clf2.fit(X,y)

plt.figure(figsize=(12,5))
plt.subplot(121)
plot_pred(ploy_kernel_svm_clf1,[-1.5,2.5,-1,1.5])
plot_dataset(X,y,[-1.5,2.5,-1,1.5])

plt.subplot(122)
plot_pred(ploy_kernel_svm_clf2,[-1.5,2.5,-1,1.5])
plot_dataset(X,y,[-1.5,2.5,-1,1.5])
plt.show()

在这里插入图片描述

2.2.3 高斯核函数

在这里插入图片描述

gamma1,gamma2 = 0.1,5
C1,C2 = 0.001,1000

params = (gamma1,C1),(gamma1,C2),(gamma2,C1),(gamma2,C2)

svm_clfs = []

for gamma,C in params:
    rbf_pipeline = Pipeline(
        steps=[
             ("scaler",StandardScaler()),
             ("svm_clf",SVC(kernel='rbf',gamma=gamma,C=C))
        ]
    )
    rbf_pipeline.fit(X,y)
    svm_clfs.append(rbf_pipeline)

plt.figure(figsize=(11,7))
for  i,svm_clf in enumerate(svm_clfs):
    plt.subplot(221+i)
    plot_pred(svm_clf,[-1.5,2.5,-1,1.5])
    plot_dataset(X,y,[-1.5,2.5,-1,1.5])
    gamma,C = params[i]
    plt.title("gamma = {},C = {}".format(gamma,C))

plt.show()