Kaggle房价预测 特征工程模型聚合

news2024/11/22 16:41:34

目录

一:Kaggle数据集准备

二:数据集分析

三:空值处理 

四:空值填充 

五:查找所有字符列

六:实例化独热编码对象 

七:方差过滤

八:特征数据提取

九:查看特征之间的相关系数

十:皮尔逊系数重排

十一:热力图绘制

十二:数据集划分

十三:网格模型 超参调优

十四:训练模型 预测实际比对

十五:完整源码分享


一:Kaggle数据集准备

Kaggle官网下载

二:数据集分析

2-1 表头数据分析

table_head = "Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice"
head_list = table_head.split(',')
print(head_list, len(head_list))
['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'] 81

得出,feature特征有 80列,label标签 有1列 

2-2 数据集加载,读取csv

import pandas as pd
import numpy as np

data_df = pd.read_csv("train.csv", sep=',')
print(data_df.head(), data_df.shape)
   Id  MSSubClass MSZoning  ...  SaleType  SaleCondition SalePrice
0   1          60       RL  ...        WD         Normal    208500
1   2          20       RL  ...        WD         Normal    181500
2   3          60       RL  ...        WD         Normal    223500
3   4          70       RL  ...        WD        Abnorml    140000
4   5          60       RL  ...        WD         Normal    250000

[5 rows x 81 columns] (1460, 81)

2-2-1 可以设置显示所有列

# 设置显示所有行、列
pd.set_option("display.max_columns", None)

2-2-2 设置显示所有行

# 设置显示所有行、列
pd.set_option("display.max_rows", None)

1460*80   即1460行80列  (只看特征数据,不看标签数据)

三:空值处理 

 空值处理 超过1/3列的直接剔除

# 空值处理
# print(data_df.isnull().sum())
# 超过1/3列的剔除
data_df.drop(columns=['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'],
             axis=1, inplace=True)
print(data_df.shape)
LotFrontage       259
Alley            1369  --剔除
MasVnrType          8
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinType2       38
Electrical          1
FireplaceQu       690  --剔除
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageQual         81
GarageCond         81
PoolQC           1453  --剔除
Fence            1179  --剔除
MiscFeature      1406  --剔除

feature  +label

特征数据+标签数据

结果如下

(1460, 76)

只计算feature特征数据则为

(1460, 75)

四:空值填充 

空值填充的填充方式,分为两种,数字列与字符列分别处理,处理方式为:

数字列 [均值、众数、中位数]

字符列 [众数]

首先,可以通过打印出每一列,找出数字列和字符列,结果如下

LotFrontage       259    --数字列
Alley            1369  --剔除
MasVnrType          8  --字符列
MasVnrArea          8    --数字列
BsmtQual           37  --字符列
BsmtCond           37  --字符列
BsmtExposure       38  --字符列
BsmtFinType1       37  --字符列
BsmtFinType2       38  --字符列
Electrical          1  --字符列
FireplaceQu       690  --剔除
GarageType         81  --字符列
GarageYrBlt        81    --数字列
GarageFinish       81  --字符列
GarageQual         81  --字符列
GarageCond         81  --字符列
PoolQC           1453  --剔除
Fence            1179  --剔除
MiscFeature      1406  --剔除

4-1 数字列填充 

# 2 空值填充
# 2-1 [数字列]填充
data_df['LotFrontage'].fillna(data_df['LotFrontage'].mean(), inplace=True)
data_df['GarageYrBlt'].fillna(data_df['GarageYrBlt'].median(), inplace=True)
data_df['MasVnrArea'].fillna(data_df['MasVnrArea'].median(), inplace=True)
print(data_df.isnull().sum())

结果如下,LotFrontage、GarageYrBlt、MasVnrArea三个数字列,填充完成

Id                0
MSSubClass        0
MSZoning          0
LotFrontage       0
LotArea           0
Street            0
LotShape          0
LandContour       0
Utilities         0
LotConfig         0
LandSlope         0
Neighborhood      0
Condition1        0
Condition2        0
BldgType          0
HouseStyle        0
OverallQual       0
OverallCond       0
YearBuilt         0
YearRemodAdd      0
RoofStyle         0
RoofMatl          0
Exterior1st       0
Exterior2nd       0
MasVnrType        8
MasVnrArea        0
ExterQual         0
ExterCond         0
Foundation        0
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinSF1        0
BsmtFinType2     38
BsmtFinSF2        0
BsmtUnfSF         0
TotalBsmtSF       0
Heating           0
HeatingQC         0
CentralAir        0
Electrical        1
1stFlrSF          0
2ndFlrSF          0
LowQualFinSF      0
GrLivArea         0
BsmtFullBath      0
BsmtHalfBath      0
FullBath          0
HalfBath          0
BedroomAbvGr      0
KitchenAbvGr      0
KitchenQual       0
TotRmsAbvGrd      0
Functional        0
Fireplaces        0
GarageType       81
GarageYrBlt       0
GarageFinish     81
GarageCars        0
GarageArea        0
GarageQual       81
GarageCond       81
PavedDrive        0
WoodDeckSF        0
OpenPorchSF       0
EnclosedPorch     0
3SsnPorch         0
ScreenPorch       0
PoolArea          0
MiscVal           0
MoSold            0
YrSold            0
SaleType          0
SaleCondition     0
SalePrice         0
dtype: int64

4-2 字符列处理

4-2-1 获取缺失列名字  

# 2-2 剩余的[字符列]在miss_col_list中
# 获取缺失列的名字
miss_col_list = data_df.isnull().any()[data_df.isnull().any().values == True].index.tolist()
# print(data_df.isnull().any())
print(miss_col_list)
['MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']

4-2-2 获取缺失列对应的列值

获取缺失列对应的列值 :单个查看

查看不出什么,需要查看全部信息才便于分析数据

# 获取缺失列对应的列值--list
for i in miss_col_list:
    print(data_df[i].values.reshape(-1, 1))
    break
[['BrkFace']
 ['None']
 ['BrkFace']
 ...
 ['None']
 ['None']
 ['None']]

获取缺失列对应的列值 :全部查看

接下来众数填充

# 获取缺失列对应的列值--list
miss_list = []

for i in miss_col_list:
    miss_list.append(data_df[i].values.reshape(-1, 1))  # 任意行 列数为1
print(miss_list)

这样查看缺失值对应的列值,思路就比较清晰

[array([['BrkFace'],
       ['None'],
       ['BrkFace'],
       ...,
       ['None'],
       ['None'],
       ['None']], dtype=object), array([['Gd'],
       ['Gd'],
       ['Gd'],
       ...,
       ['TA'],
       ['TA'],
       ['TA']], dtype=object), array([['TA'],
       ['TA'],
       ['TA'],
       ...,
       ['Gd'],
       ['TA'],
       ['TA']], dtype=object), array([['No'],
       ['Gd'],
       ['Mn'],
       ...,
       ['No'],
       ['Mn'],
       ['No']], dtype=object), array([['GLQ'],
       ['ALQ'],
       ['GLQ'],
       ...,
       ['GLQ'],
       ['GLQ'],
       ['BLQ']], dtype=object), array([['Unf'],
       ['Unf'],
       ['Unf'],
       ...,
       ['Unf'],
       ['Rec'],
       ['LwQ']], dtype=object), array([['SBrkr'],
       ['SBrkr'],
       ['SBrkr'],
       ...,
       ['SBrkr'],
       ['FuseA'],
       ['SBrkr']], dtype=object), array([['Attchd'],
       ['Attchd'],
       ['Attchd'],
       ...,
       ['Attchd'],
       ['Attchd'],
       ['Attchd']], dtype=object), array([['RFn'],
       ['RFn'],
       ['RFn'],
       ...,
       ['RFn'],
       ['Unf'],
       ['Fin']], dtype=object), array([['TA'],
       ['TA'],
       ['TA'],
       ...,
       ['TA'],
       ['TA'],
       ['TA']], dtype=object), array([['TA'],
       ['TA'],
       ['TA'],
       ...,
       ['TA'],
       ['TA'],
       ['TA']], dtype=object)]

4-2-3 对每一列进行众数填充

# 对每一列进行众数填充
for i in range(0, len(miss_list)):
    im_most = SimpleImputer(strategy='most_frequent')
    most = im_most.fit_transform(miss_list[i])
    data_df.loc[:, miss_col_list[i]] = most

print(data_df.isnull().sum())

填充完毕,检查输出结果 ,没有问题

Id               0
MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
LotShape         0
LandContour      0
Utilities        0
LotConfig        0
LandSlope        0
Neighborhood     0
Condition1       0
Condition2       0
BldgType         0
HouseStyle       0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
RoofStyle        0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
MasVnrArea       0
ExterQual        0
ExterCond        0
Foundation       0
BsmtQual         0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinSF1       0
BsmtFinType2     0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
Heating          0
HeatingQC        0
CentralAir       0
Electrical       0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
KitchenQual      0
TotRmsAbvGrd     0
Functional       0
Fireplaces       0
GarageType       0
GarageYrBlt      0
GarageFinish     0
GarageCars       0
GarageArea       0
GarageQual       0
GarageCond       0
PavedDrive       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
MiscVal          0
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
dtype: int64

以上,就完成了对所有缺失值的处理

五:查找所有字符列

目标找到所有的字符列

# 目标找到所有的字符列
ob_feature = data_df.select_dtypes(include=['object']).columns.tolist()
print(ob_feature, len(ob_feature))
['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition'] 38

分析结果,由38列字符列 

可以更详细地查看一下,方法及结果如下:

ob_df_data = data_df.loc[:, ob_feature]
print(ob_df_data.head())

六:实例化独热编码对象 

6-1 实例化独热编码对象 获取列名

# 实例化独热编码对象
OneHot = OneHotEncoder()
# numpy ndarray
result = OneHot.fit_transform(ob_df_data).toarray()
# 获取列名
OneHotNames = OneHot.get_feature_names().tolist()
print(OneHotNames, len(OneHotNames))

输出结果如下,共计234列的列名

6-2 独热编码过后的DataFrame  【字符列转数字列

# 独热编码过后的dataframe
OneHot_df = pd.DataFrame(result, columns=OneHotNames)
print(OneHot_df.head())

查看全部的独热编码后的dataframe

6-3 删除原来的38列字符数据、行合并

# 删除原来的38列字符数据  75-38=37
data_df.drop(columns=ob_feature, inplace=True)
# 行合并 37+234[独热编码出来的新列]=271列+label列=272
data_df = pd.concat([OneHot_df, data_df], axis=1)
print(data_df.head(), data_df.shape)  # (1460, 272)

得出结果(1460, 272) ,即以上操作的规模为1460行、272列,

feature特征数据太多了,计算量太大,需要解决这个问题

七:方差过滤

方差过滤(方差小于0.1进行过滤) + 数据提纯

# feature太多 计算量过大
# 方差过滤
var_index = VarianceThreshold(threshold=0.1)
data = var_index.fit_transform(data_df)
# 获取留下了的索引
index = var_index.get_support(True).tolist()
# print(index)
data_df = data_df.iloc[:, index]
print(data_df.head(), data_df.shape)  # (1460, 84)

数据量降低处理操作:由272降至84 

(1460, 84)

八:特征数据提取

# 相关系数分析--皮尔逊相关系数(根据权重)
features = data_df.columns.tolist()
# print(features)

f_names = []  # pearsonr>0.5的列名
# 每一列都要计算与最后一列的pearsonr>0.5
for i in range(0, len(features)):
    if abs(pearsonr(data_df[features[i]], data_df[features[-1]])[0]) > 0.5:
        f_names.append(features[i])
print(f_names, len(f_names))  # 14

数据量降低处理操作:由84降至14

['x17_TA', 'x29_TA', 'x32_Unf', 'OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'SalePrice'] 14

九:查看特征之间的相关系数

f_names = []  # pearsonr>0.5的列名
# 存储皮尔逊相关系数的值
pear_num = []
# 每一列都要计算与最后一列的pearsonr>0.5
for i in range(0, len(features) - 1):
    if abs(pearsonr(data_df[features[i]], data_df[features[-1]])[0]) > 0.5:
        f_names.append(features[i])
        pear_num.append(pearsonr(data_df[features[i]], data_df[features[-1]])[0])
# print(f_names, len(f_names))  # 13

# 查看特征之间的相关系数
import matplotlib.pyplot as plt
import seaborn as sns

# 根据相关系数的大小--数据封装,封装成DataFrame
pear_dict = {
    'features': f_names,
    'pearData': pear_num
}
hotPear = pd.DataFrame(pear_dict)
print(hotPear)
        features  pearData
0         x17_TA -0.589044
1         x29_TA -0.519298
2        x32_Unf -0.513906
3    OverallQual  0.790982
4      YearBuilt  0.522897
5   YearRemodAdd  0.507101
6    TotalBsmtSF  0.613581
7       1stFlrSF  0.605852
8      GrLivArea  0.708624
9       FullBath  0.560664
10  TotRmsAbvGrd  0.533723
11    GarageCars  0.640409
12    GarageArea  0.623431

十:皮尔逊系数重排

# 皮尔逊相关系数按照从大到小
hotPear.sort_values(by=['pearData'], ascending=False, inplace=True)
# 重置index
hotPear.reset_index(drop=True, inplace=True)
print(hotPear)
        features  pearData
0    OverallQual  0.790982
1      GrLivArea  0.708624
2     GarageCars  0.640409
3     GarageArea  0.623431
4    TotalBsmtSF  0.613581
5       1stFlrSF  0.605852
6       FullBath  0.560664
7   TotRmsAbvGrd  0.533723
8      YearBuilt  0.522897
9   YearRemodAdd  0.507101
10       x32_Unf -0.513906
11        x29_TA -0.519298
12        x17_TA -0.589044

十一:热力图绘制

首先,热力图相关安装命令如下,pip install seaborn,没有安装过的可以先安装一下

# 开始绘图
plt.figure(figsize=(12, 12))
sns.set(font_scale=0.8)
cor = np.corrcoef(data_df[f_names].values.T)
# 数据填充
sns.heatmap(cor, cbar=False, annot=True, square=True, fmt='0.2f', yticklabels=f_names,
            xticklabels=f_names)
plt.show()

分析出相似度极高的列名组合,

需要舍弃相似度高的列组合其中的某一列

如上图,分析得出:

需要舍弃包括有如下三组(相似度超过80)

GarageArea GarageCars  88
TotRmsAbvGrd GrLivArea 83
1stFIrSF TotalBsmtSF   82
#  舍弃相似度高的三列
f_names.append("SalePrice")
data_df = data_df[f_names]
data_df.drop(['GarageArea', 'TotRmsAbvGrd', '1stFlrSF'], inplace=True, axis=1)
print(data_df.head(), data_df.shape)

 数据量降低处理操作:由14降至11,除去标签列,特征列为10列

(1460, 11)

十二:数据集划分

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 分出feature->array  label->array
feature_data = data_df.iloc[:, :-1].values
# print(type(feature_data))
label_data = np.ravel(data_df.iloc[:, -1].values)
# print(label_data,type(label_data))
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.3, random_state=666)
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.fit_transform(X_test)
print(X_train_std)
[[ 0.78554988  1.          1.05431032 ... -0.97998523 -1.0304426
  -0.98880127]
 [ 0.78554988  1.          1.05431032 ... -1.1228905  -1.0304426
  -0.98880127]
 [ 0.78554988 -1.          1.05431032 ... -0.27879671 -1.0304426
  -2.28771502]
 ...
 [ 0.78554988  1.         -0.94848735 ... -0.20829678 -1.0304426
   0.31011248]
 [ 0.78554988  1.          1.05431032 ... -0.68845848 -1.0304426
  -2.28771502]
 [-1.27299365 -1.         -0.94848735 ... -0.1663779   0.80424788
   0.31011248]]

十三:网格模型 超参调优

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
param_list = [
    {
        'n_neighbors': list(range(1, 38)),
        'weights': ['uniform']
    },
    {
        'n_neighbors': list(range(1, 38)),
        'weights': ['distance'],
        'p': [i for i in range(1, 21)]
    }
]
grid = GridSearchCV(model, param_grid=param_list, cv=10)
grid.fit(X_train_std, y_train)
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
best_model = grid.best_estimator_

输出如下结果,程序跑的时间有点长,需要耐心等待 

0.8368525468413448
{'n_neighbors': 13, 'p': 1, 'weights': 'distance'}
KNeighborsRegressor(n_neighbors=13, p=1, weights='distance')

十四:训练模型 预测实际比对

best_model = KNeighborsRegressor(n_neighbors=13, p=1, weights='distance')
# 训练模型
best_model.fit(X_train_std, y_train)
# 模型保存
import joblib

joblib.dump(best_model, "PriceRegModel.model")

# 预测
y_predict = best_model.predict(X_test_std)
# 预测与实际 图示点状图
plt.scatter(y_test, y_predict, label="test")
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         'k--',
         lw=3,
         label="predict"
         )
plt.show()

模型保存

预测实际比对

十五:完整源码分享

from sklearn.impute import SimpleImputer  # 众数
from sklearn.preprocessing import OneHotEncoder  # 独热编码
from sklearn.feature_selection import VarianceThreshold  # 方差过滤
from scipy.stats import pearsonr  # 皮尔逊相关系数
import pandas as pd
import numpy as np

# 设置显示所有行、列
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
data_df = pd.read_csv("train.csv", sep=',')
# print(data_df.head(), data_df.shape)  # 1460*80

# 空值处理
# print(data_df.isnull().sum())
# 超过1/3列的剔除
data_df.drop(columns=['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'],
             axis=1, inplace=True)
# print(data_df.shape)

# 2 空值填充
# 2-1 [数字列]填充
data_df['LotFrontage'].fillna(data_df['LotFrontage'].mean(), inplace=True)
data_df['GarageYrBlt'].fillna(data_df['GarageYrBlt'].median(), inplace=True)
data_df['MasVnrArea'].fillna(data_df['MasVnrArea'].median(), inplace=True)
# print(data_df.isnull().sum())

# 2-2 剩余的[字符列]在miss_col_list中
# 获取缺失列的名字
miss_col_list = data_df.isnull().any()[data_df.isnull().any().values == True].index.tolist()
# print(data_df.isnull().any())
# print(miss_col_list)

# 获取缺失列对应的列值--list
miss_list = []

for i in miss_col_list:
    miss_list.append(data_df[i].values.reshape(-1, 1))  # 任意行 列数为1
# print(miss_list)

# 对每一列进行众数填充
for i in range(0, len(miss_list)):
    im_most = SimpleImputer(strategy='most_frequent')
    most = im_most.fit_transform(miss_list[i])
    data_df.loc[:, miss_col_list[i]] = most

# print(data_df.isnull().sum())

# 目标找到所有的字符列
ob_feature = data_df.select_dtypes(include=['object']).columns.tolist()
# print(ob_feature, len(ob_feature))
ob_df_data = data_df.loc[:, ob_feature]
# print(ob_df_data.head())

# 实例化独热编码对象
OneHot = OneHotEncoder()
# numpy ndarray
result = OneHot.fit_transform(ob_df_data).toarray()
# 获取列名
OneHotNames = OneHot.get_feature_names().tolist()
# print(OneHotNames, len(OneHotNames))

# 独热编码过后的dataframe
OneHot_df = pd.DataFrame(result, columns=OneHotNames)
# print(OneHot_df.head())

# 删除原来的38列字符数据  75-38=37
data_df.drop(columns=ob_feature, inplace=True)
# 行合并 37+234[独热编码出来的新列]=271列+label列=272
data_df = pd.concat([OneHot_df, data_df], axis=1)
# print(data_df.head(), data_df.shape)  # (1460, 272)

# feature太多 计算量过大
# 方差过滤
var_index = VarianceThreshold(threshold=0.1)
data = var_index.fit_transform(data_df)
# 获取留下了的索引
index = var_index.get_support(True).tolist()
# print(index)
data_df = data_df.iloc[:, index]
# print(data_df.head(), data_df.shape)  # (1460, 84)

# 相关系数分析--皮尔逊相关系数(根据权重)
features = data_df.columns.tolist()
# print(features)

f_names = []  # pearsonr>0.5的列名
# 存储皮尔逊相关系数的值
pear_num = []
# 每一列都要计算与最后一列的pearsonr>0.5
for i in range(0, len(features) - 1):
    if abs(pearsonr(data_df[features[i]], data_df[features[-1]])[0]) > 0.5:
        f_names.append(features[i])
        pear_num.append(pearsonr(data_df[features[i]], data_df[features[-1]])[0])
# print(f_names, len(f_names))  # 13

# 查看特征之间的相关系数
import matplotlib.pyplot as plt
import seaborn as sns

# 根据相关系数的大小--数据封装,封装成DataFrame
pear_dict = {
    'features': f_names,
    'pearData': pear_num
}
hotPear = pd.DataFrame(pear_dict)
# print(hotPear)

# 皮尔逊相关系数按照从大到小
hotPear.sort_values(by=['pearData'], ascending=False, inplace=True)
# 重置index
hotPear.reset_index(drop=True, inplace=True)
# print(hotPear)

# # 开始绘图
# plt.figure(figsize=(12, 12))
# sns.set(font_scale=0.8)
# cor = np.corrcoef(data_df[f_names].values.T)
# # 数据填充
# sns.heatmap(cor, cbar=False, annot=True, square=True, fmt='0.2f', yticklabels=f_names,
#             xticklabels=f_names)
# # plt.show()

#  舍弃相似度高的三列
f_names.append("SalePrice")
data_df = data_df[f_names]
data_df.drop(['GarageArea', 'TotRmsAbvGrd', '1stFlrSF'], inplace=True, axis=1)
# print(data_df.head(), data_df.shape)

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# 分出feature->array  label->array
feature_data = data_df.iloc[:, :-1].values
# print(type(feature_data))
label_data = np.ravel(data_df.iloc[:, -1].values)
# print(label_data,type(label_data))
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.3, random_state=666)
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.fit_transform(X_test)
# print(X_train_std)

# model = KNeighborsRegressor()
# param_list = [
#     {
#         'n_neighbors': list(range(1, 38)),
#         'weights': ['uniform']
#     },
#     {
#         'n_neighbors': list(range(1, 38)),
#         'weights': ['distance'],
#         'p': [i for i in range(1, 21)]
#     }
# ]
# grid = GridSearchCV(model, param_grid=param_list, cv=10)
# grid.fit(X_train_std, y_train)
# print(grid.best_score_)
# print(grid.best_params_)
# print(grid.best_estimator_)
# best_model = grid.best_estimator_

best_model = KNeighborsRegressor(n_neighbors=13, p=1, weights='distance')
# 训练模型
best_model.fit(X_train_std, y_train)
# 模型保存
import joblib

joblib.dump(best_model, "PriceRegModel.model")

# 预测
y_predict = best_model.predict(X_test_std)
# 预测与实际 图示点状图
plt.scatter(y_test, y_predict, label="test")
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         'k--',
         lw=3,
         label="predict"
         )
plt.show()

预测实际比对,也可以使用线型图方式查看

test_pre = pd.DataFrame({"test": y_test.tolist(),
                         "pre": y_predict.flatten()
                         })
test_pre.plot(figsize=(18, 10))
plt.show()

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/101732.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

跨域/解决跨域方法

一、同源策略 同源策略(Same Origin Policy)是一种约定,它是浏览器最核心也是最基本的安全功能。同源策略会阻止一个域的javascrip脚本和另一个域的内容进行交互,是用于隔离潜在恶意文件的关键安全机制;关于这一点我们后面会举例说明。如果缺…

C语言—指针

指针用来存放一个内存地址&#xff1b; 指针的类型就是要存放地址的变量的数据类型&#xff1b; #include <stdio.h>int main() {int a 123;char b H;int *pa &a;char *pb &b;printf("%d\n", *pa);printf("%c", *pb); } pa要存放int类…

评估篇 | 单元测试评估也能复用到集成测试?脚本帮你高效评估

上次我们分享了单元测试用例的复用&#xff0c;单元测试的用例可以复用到集成测试&#xff0c;那单元测试的评估是否也可以复用到集成测试&#xff1f;答案是可以的。 TPT中提供了多种多样的评估方式&#xff0c;其中的脚本评估使我们复用测试评估成为可能。脚本评估&#xff…

@EnableCaching如何一键开启缓存

EnableCaching如何一键开启缓存手动挡CacheManagerCache使用演示小结自动挡CachingConfigurationSelectorAutoProxyRegistrarProxyCachingConfigurationCacheOperationSourceCacheOperationBeanFactoryCacheOperationSourceAdvisorCacheInterceptor小结手动挡 我们首先来看看S…

成本、利润分析法在企业管理中的应用

1 、成本、利润分析法的主要内容 成本、利润分析法主要是指&#xff0c;利用数学模型&#xff0c;对关于企业成本、利润的要素分析&#xff0c;然后计算出要素的改变对企业成本、利润的影响&#xff0c;进而对企业决策提出建议的一种方法。在成本、利润分析法中&#xff0c;最主…

基础IO——文件描述符

文章目录1. 文件描述符fd1.1 open返回值2. 理解Linux下一切皆文件3. 文件描述符的分配规则4. 重定向的本质4.1 使用 dup2 系统调用4.2 追加重定向4.3 输入重定向1. 文件描述符fd 1.1 open返回值 我们先来看下面的例子&#xff1a; 运行结果如下&#xff1a; 我们知道open的…

磺基-CY5 马来酰亚胺 Cyanine5 Maleimide

磺基-CY5 马来酰亚胺 Cyanine5 Maleimide Cyanine5 maleimide是单一活性染料&#xff0c;有选择性的与硫醇基团&#xff08;比如蛋白和多肽的半胱氨酸&#xff09;结合以进行标记。我们使用水溶的Sulfo-Cyanine5 maleimide标记抗体和其他敏感蛋白。Cyanine5是Cy5的类似物&am…

Pb协议的接口测试

Protocol Buffers 是谷歌开源的序列化与反序列化框架。它与语言无关、平台无关、具有可扩展的机制。用于序列化结构化数据&#xff0c;此工具对标 XML &#xff0c;支持自动编码&#xff0c;解码。比 XML 性能好&#xff0c;且数据易于解析。更多有关工具的介绍可参考官网。 P…

Java8新特性学习

文章目录Lambda表达式为什么使用Lambda表达式Lambda表达式语法语法格式一&#xff1a;无参数&#xff0c;无返回值语法格式二&#xff1a;有一个参数&#xff0c;并且无返回值语法格式三&#xff1a;若只有一个参数&#xff0c;小括号可以省略不写语法格式四&#xff1a;有两个…

Docker容器数据卷

是什么 卷就是目录或文件&#xff0c;存在于一个或多个容器中&#xff0c;由docker挂载到容器&#xff0c;但不属于联合文件系统&#xff0c;因此能够绕过Union File System提供一些用于持续存储或共享数据的特性&#xff1a;卷的设计目的就是数据的持久化&#xff0c;完全独立…

LSTM(Long Short-Term Memory)

长短期记忆&#xff08;long short-term memory&#xff0c;LSTM&#xff09;&#xff0c;LSTM 中引入了3个门&#xff0c;即输入门&#xff08;input gate&#xff09;、遗忘门&#xff08;forget gate&#xff09;和输出门&#xff08;output gate&#xff09;&#xff0c;以…

华为时习知,让企业培训更简单!

在数字经济的发展过程中&#xff0c;人才始终是不容忽视的关键因素&#xff0c;企业对数字化人才培养的需求也愈加迫切。然而企业培训说起来简单&#xff0c;要做好却绝非易事。企业可能会面临员工分散各地、流动性大、关键岗位人才培训等复杂培训场景问题&#xff0c;无法高效…

为什么我们说NFT拥有无限潜力?

欢迎来到Hubbleverse &#x1f30d; 关注我们 关注宇宙新鲜事 &#x1f4cc; 预计阅读时长&#xff1a;8分钟 本文仅代表作者个人观点&#xff0c;不代表平台意见&#xff0c;不构成投资建议。 2021年底&#xff0c;NFT就已经发展得炙手可热了&#xff0c;热门到410亿美元投…

YOLO-V5 算法和代码解析系列(一)—— 快速开始

文章目录运行环境配置Demo重新训练 YOLO-V5s运行环境配置 环境配置的官方教程如下&#xff0c;如果一些库安装失败&#xff0c;导致安装中断&#xff0c;可以单独安装一些库&#xff0c;比如 Pytorch&#xff0c;然后再执行下列安装步骤&#xff0c;具体如下&#xff1a; 个人建…

国内食糖行业数据浅析

大家好&#xff0c;这里是小安说网控。 食糖行业是国民消费不可或缺的产业之一。2022年9月份国内成品糖产量当期值为40.4万吨&#xff0c;同比增长30.7%&#xff1b;10月份当期值为63.7万吨&#xff0c;同比下滑2%。今年1-10月份&#xff0c;国内成品糖产量累计值为1089.4万吨&…

艾美捷细胞糖酵解分析试剂盒基本参数和相关研究

艾美捷基于糖酵解细胞的测定试剂盒提供了一种比色法&#xff0c;用于检测培养细胞产生和分泌的糖酵解最终产物L-乳酸。在该测定中&#xff0c;乳酸脱氢酶催化NAD和乳酸之间的反应&#xff0c;产生丙酮酸和NADH。NADH直接将四氮唑盐&#xff08;INT&#xff09;还原为吸收490至5…

【High 翻天】Higer-order Networks with Battiston Federico (3)

目录模型&#xff08;1&#xff09;Equilibrium modelsBipartite modelsMotifs modelsStochastic set modelsHypergraphs modelsSimplicial complexes models模型的目的是再现、解释和预测系统的结构&#xff0c;最好用涉及系统两个或多个元素的交互来描述。为了考虑其输出的可…

【1971. 寻找图中是否存在路径】

来源&#xff1a;力扣&#xff08;LeetCode&#xff09; 描述&#xff1a; 有一个具有 n 个顶点的 双向 图&#xff0c;其中每个顶点标记从 0 到 n - 1&#xff08;包含 0 和 n - 1&#xff09;。图中的边用一个二维整数数组 edges 表示&#xff0c;其中 edges[i] [ui, vi] …

计算机毕设Python+Vue学衡国学堂围棋社管理系统(程序+LW+部署)

项目运行 环境配置&#xff1a; Jdk1.8 Tomcat7.0 Mysql HBuilderX&#xff08;Webstorm也行&#xff09; Eclispe&#xff08;IntelliJ IDEA,Eclispe,MyEclispe,Sts都支持&#xff09;。 项目技术&#xff1a; SSM mybatis Maven Vue 等等组成&#xff0c;B/S模式 M…

nvidia 使用

watch -n 0.5 nvidia-smi ./build/examples/openpose/openpose.bin --video examples/media/video.avi Linux CPU&GPU烤机&#xff08;压力测试&#xff09; 盛夏捷关注IP属地: 青海 0.1342021.04.14 09:50:16字数 152阅读 6,307 GPU-burn工具进行GPU烤机 下载Multi-G…