数据增强和数据平衡

news2026/4/3 1:43:22

数据增强（Data Augmentation）

数据增强是一种通过对数据进行变换来增加数据多样性的方法。它可以提高模型的泛化能力，特别是在图像和文本处理任务中。

原理

通过数据增强技术，生成更多样本，提高模型的泛化能力。例如，对于图像数据，可以进行旋转、翻转、裁剪、缩放等操作。

核心公式：

例如，图像旋转：

其中，M 是旋转矩阵，θ 是旋转角度。

生活场景案例：产品图像数据增强

假设我们有一个电子商务网站的产品图像数据集，每个产品只有一张图像。为了提高图像分类模型的性能，我们将使用数据增强技术生成更多样本。

数据描述

product_image：产品图像

Python代码

from PIL import Image, ImageOps
import matplotlib.pyplot as plt
import numpy as np
import random

# 加载图像
image_path = './pic1595051357.jpg'  # 图像路径
image = Image.open(image_path)

# 定义数据增强函数
def augment_data(image):
    # 随机旋转
    angle = random.randint(-30, 30)
    rotated_image = image.rotate(angle)

    # 随机水平翻转
    if random.choice([True, False]):
        flipped_image = ImageOps.mirror(rotated_image)
    else:
        flipped_image = rotated_image

    return flipped_image

# 数据增强示例
num_samples = 4
plt.figure(figsize=(12, 8))

for i in range(num_samples):
    augmented_image = augment_data(image)
    plt.subplot(2, num_samples//2, i+1)
    plt.imshow(np.array(augmented_image))
    plt.axis('off')

plt.tight_layout()
plt.show()

代码解析

加载图像：使用 Pillow (PIL) 加载图像，将图像从文件读取到内存中。
定义数据增强函数：

- 随机旋转：生成一个随机角度，对图像进行旋转。
- 随机水平翻转：随机决定是否对图像进行水平翻转。

数据增强示例：生成4个增强后的图像样本，并使用 Matplotlib 进行展示。

图像数据增强步骤详解

随机旋转：

- 使用 random.randint(-30, 30) 生成一个随机角度。
- 使用 image.rotate(angle) 对图像进行旋转。

随机水平翻转：

- 使用 random.choice([True, False]) 随机选择是否进行水平翻转。
- 如果选择为 True，使用 ImageOps.mirror(rotated_image) 进行水平翻转。

通过这个生活场景的案例，我们可以看到数据增强如何帮助我们生成更多样本，从而提高模型的泛化能力。这在电子商务等领域的图像分类任务中尤为重要。

数据平衡（Data Balancing）

处理类别不平衡问题，可以使用过采样（如SMOTE）、欠采样等方法。

原理

通过过采样、欠采样等方法平衡类别分布，提高模型在少数类上的表现。

核心公式

过采样（SMOTE）：通过在少数类样本之间插值生成新的样本。对于两个少数类样本 xi 和 xj，生成新样本 xnew：

其中，λ 是介于 0 和 1 之间的随机数。

Python案例

数据平衡是数据预处理中的一个重要步骤，特别是在处理分类问题时。如果一个类别的样本数量远多于其他类别，会导致分类器偏向于多数类别，从而影响模型的性能。常见的数据平衡方法包括过采样、欠采样和合成少数类过采样技术（SMOTE）。

下面是一个基于SMOTE的案例，展示如何使用Python代码进行数据平衡，并绘制相关图形。

假设我们有一个客户满意度调查数据集，包含客户的年龄、消费金额、和满意度（满意/不满意）。

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from collections import Counter

# 生成示例不平衡数据集
np.random.seed(42)
ages = np.random.randint(18, 70, 1000)
spending = np.random.randint(500, 10000, 1000)
satisfaction = np.where(spending > 5000, 1, 0)
# 人为制造类别不平衡
satisfaction[:950] = 0

data = pd.DataFrame({'Age': ages, 'Spending': spending, 'Satisfaction': satisfaction})

# 查看原始数据集的类别分布
print(f"Original dataset shape: {Counter(data['Satisfaction'])}")

# 可视化原始数据集
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x=data['Age'], y=data['Spending'], hue=data['Satisfaction'], palette='viridis', alpha=0.6)
plt.title('Original Dataset')
plt.xlabel('Age')
plt.ylabel('Spending')

# 进行SMOTE过采样
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(data[['Age', 'Spending']], data['Satisfaction'])

# 将过采样后的数据转换为 DataFrame
resampled_data = pd.DataFrame(X_res, columns=['Age', 'Spending'])
resampled_data['Satisfaction'] = y_res

# 查看过采样后的数据类别分布
print(f"Resampled dataset shape: {Counter(y_res)}")

# 可视化过采样后的数据集
plt.subplot(1, 2, 2)
sns.scatterplot(x='Age', y='Spending', hue='Satisfaction', data=resampled_data, palette='viridis', alpha=0.6)
plt.title('SMOTE Resampled Dataset')
plt.xlabel('Age')
plt.ylabel('Spending')

plt.tight_layout()
plt.show()

# 数据平衡前后的类别分布直方图
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

sns.histplot(data['Satisfaction'], ax=axes[0], bins=2, kde=False)
axes[0].set_title('Original Dataset Class Distribution')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Frequency')

sns.histplot(y_res, ax=axes[1], bins=2, kde=False)
axes[1].set_title('SMOTE Resampled Dataset Class Distribution')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

代码解析

加载和生成数据：模拟一个不平衡的客户满意度数据集，其中包括客户年龄、消费金额和满意度。
定义数据增强函数：使用 SMOTE 方法进行过采样，平衡满意度类别的分布。
数据增强示例：使用 SMOTE 方法生成新的样本，使得满意和不满意的样本数量相等，并使用 Matplotlib 和 Seaborn 可视化原始和增强后的数据集。