对汽车是否值得购买,进行聚类分析:
1、数据指标解释:
buying, 购买费用
maint, 维修费用
doors, 车门数量
person, 乘坐人数
lug_boot, 行李箱容量
safety, 安全性
2、对数据进行转换
将字符串转换映射量化为数字
数据加载:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
data = pd.read_csv('./car_data.txt')
data
X = data.reset_index(drop = True) #行索引重置
X
X.to_csv('./car_data_new.csv',index = False) #不要行索引
pd.read_csv('./car_data_new.csv') #加载新保存数据
数据转化:
for col in X.columns:
print(col,X[col].unique())
结果:
buying ['vhigh' 'high' 'med' 'low']
maint ['2' '3' '4' '5more']
doors ['2' '4' 'more']
person ['small' 'med' 'big']
lug_boot ['low' 'med' 'high']
safety ['unacc' 'acc' 'vgood' 'good']
字典映射:
X['buying'] = X['buying'].map({'vhigh':1,'high':2,'med':3,'low':4})
X['maint'] = X['maint'].map({'2':2,'3':3,'4':4,'5more':5})
X['doors'] = X['doors'].map({'2':2,'4':4,'more':5})
X['person'] = X['person'].map({'small':2,'med':5,'big':7})
X['lug_boot'] = X['lug_boot'].map({'low':1,'med':2,'high':3})
X['safety'] = X['safety'].map({'unacc':1,'acc':2,'vgood':3,'good':4})
X
#字典映射 只执行一次,再次执行字典的值为NULL,因为字典的键改变了
聚类建模:
kmeans = KMeans(n_clusters = 3)
kmeans.fit(X)
y_= kmeans.predict(X)
silhouette_score(X,y_)
分数:
0.28526565681580135
筛选最佳聚类条件:
scores = []
for k in range(2,8):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
y_ = kmeans.predict(X)
score = silhouette_score(X,y_)
scores.append(score)
print(scores)
plt.plot(range(2,8),scores)
scores数组:
[0.3486833182368877, 0.28526565681580135, 0.2607003011258018, 0.23938352073818228, 0.23148278681018836, 0.2285215746743637]