数据预处理和探索性数据分析（上）

news2025/7/15 3:37:13

数据预处理

数据清洗

处理缺失值：

异常值检测与处理：

类别特征编码：

特征工程

创建新特征：

特征缩放：

探索性数据分析 (EDA)

使用Matplotlib进行可视化

绘制直方图：

绘制箱线图：

绘制散点图：

构建简单的机器学习模型

准备数据

划分训练集和测试集：

特征缩放：

训练模型

使用线性回归模型：

评估模型

计算模型的性能指标：

实战项目

项目步骤

代码实操示例

生成图片

单变量分析：

多变量分析：

条形图：

散点图：

热力图：

数据预处理

数据预处理是机器学习流程中非常重要的一步，它包括数据清洗、特征工程等步骤。

数据清洗

处理缺失值：

# 使用中位数填充缺失值
df['Age'].fillna(df['Age'].median(), inplace=True)

异常值检测与处理：

# 使用IQR方法检测异常值
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]

类别特征编码：

# 使用one-hot编码
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)

特征工程

创建新特征：

df['Total_Pay'] = df['Base_Pay'] + df['Bonus']

特征缩放：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['Age', 'Salary']])
df[['Age', 'Salary']] = scaled_features

探索性数据分析 (EDA)

EDA 是为了更好地理解数据集的特性。我们可以使用可视化工具来辅助这一过程。

使用Matplotlib进行可视化

绘制直方图：

import matplotlib.pyplot as plt

plt.hist(df['Age'], bins=20)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

绘制箱线图：

df.boxplot(column='Salary')
plt.title('Salary Distribution')
plt.show()

绘制散点图：

plt.scatter(df['Age'], df['Salary'])
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

构建简单的机器学习模型

现在我们有了清理过后的数据，可以开始构建机器学习模型了。这里我们将使用线性回归模型作为示例。

准备数据

划分训练集和测试集：

from sklearn.model_selection import train_test_split

X = df[['Age', 'Experience']]
y = df['Salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

特征缩放：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

训练模型

使用线性回归模型：

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

评估模型

计算模型的性能指标：

from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R2 Score: {r2}')

实战项目

为了更好地巩固所学的知识，你可以尝试完成一个小的实战项目。例如，这下面是我从Kaggle网站下载一个数据集，对其进行预处理、特征工程、EDA，并最终训练一个简单的机器学习模型。

项目步骤

数据加载：使用Pandas加载数据。
数据清洗：处理缺失值、异常值。
特征工程：创建新特征、进行特征缩放。
EDA：使用Matplotlib进行可视化。
模型训练：使用Scikit-Learn训练模型。
模型评估：使用适当的指标评估模型性能。

代码实操示例

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#加载数据
data_path=r'D:\机器学习\数据集：国家划分的生活成本\Cost_of_Living_Index_by_Country_2024.csv'
df=pd.read_csv(data_path)
#显示数据前几行
print(df.head())
#检查是否有缺失值
print(df.isnull().sum())
#基本统计信息
print(df.describe())
#可视化
#单变量分析
df.hist(bins=20,figsize=(12,10),color='blue')
plt.tight_layout()  #调整子图参数，使之填充整个图像区域
plt.show()
#多变量分析
numeric_df = df.select_dtypes(include=['float64', 'int64'])
corr_matrix=numeric_df.corr()
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
# #保存图片
# plt.savefig('8.11Cost_of_Living_Index_by_Country_2024.png')
#柱状图-排名前十的国家的生活成本指数
top_10_countries=df.head(10)
plt.figure(figsize=(12,6))
sns.barplot(x='Country',y='Cost of Living Index',data=top_10_countries)
plt.xticks(rotation=90)#旋转x轴标签
plt.title('Top 10 Countries by Cost of Living Index')
plt.show()
#散点图-生活成本指数与租金指数关系
plt.figure(figsize=(10,6))
sns.scatterplot(x='Cost of Living Index',y='Rent Index',data=df)#添加标题和标签
plt.title('Cost of Living Index vs Rent Index')
plt.show()
#热力图-各个指标之间的相关性
plt.figure(figsize=(10,6))
numeric1_df = df.select_dtypes(include=['float64', 'int64'])
sns.heatmap(numeric1_df.corr(),annot=True,cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()