【机器学习-线性回归-3】深入浅出：简单线性回归的概念、原理与实现

news2026/2/17 12:08:29

在机器学习的世界里，线性回归是最基础也是最常用的算法之一。作为预测分析的基石，简单线性回归为我们理解更复杂的模型提供了完美的起点。无论你是机器学习的新手还是希望巩固基础的老手，理解简单线性回归都至关重要。本文将带你全面了解简单线性回归的概念、数学原理以及如何用Python实现它。

1. 什么是简单线性回归？

简单线性回归(Simple Linear Regression)是一种用于建立和描述两个连续变量之间线性关系的统计方法。它假设因变量(目标变量)与一个自变量(预测变量)之间存在线性关系。

核心思想：找到一条最佳拟合直线，能够最好地描述这两个变量之间的关系。这条直线的方程可以表示为：

y = β₀ + β₁x + ε

其中：

y 是因变量(我们想要预测的值)
x 是自变量(用于预测的特征)
β₀ 是截距(y轴截距)
β₁ 是斜率(每单位x变化引起的y变化)
ε 是误差项(模型无法解释的随机波动)

2. 简单线性回归的数学原理

2.1 最小二乘法

简单线性回归的核心是通过最小二乘法(Ordinary Least Squares, OLS)来估计回归系数β₀和β₁。最小二乘法的目标是找到使**残差平方和(RSS)**最小的系数。

残差是指实际观测值(yᵢ)与预测值(ŷᵢ)之间的差异：

RSS = Σ(yᵢ - ŷᵢ)² = Σ(yᵢ - (β₀ + β₁xᵢ))²

2.2 系数计算

通过微积分求导，我们可以得到β₀和β₁的闭式解：

斜率(β₁)的计算公式：

β₁ = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / Σ(xᵢ - x̄)²

截距(β₀)的计算公式：

β₀ = ȳ - β₁x̄

其中x̄和ȳ分别是x和y的样本均值。

2.3 模型评估指标

建立模型后，我们需要评估其性能：

R平方(R²)：表示模型解释的方差比例，范围0-1，越接近1越好
```
R² = 1 - (RSS/TSS)
```
其中TSS是总平方和：TSS = Σ(yᵢ - ȳ)²
均方误差(MSE)：预测值与实际值差异的平方的平均值
```
MSE = RSS/n
```
均方根误差(RMSE)：MSE的平方根，与目标变量单位相同

3. 简单线性回归的假设

为了确保模型的可靠性，简单线性回归有以下关键假设：

线性关系：自变量和因变量之间存在线性关系
独立性：观测值之间相互独立(特别是时间序列数据需要注意)
同方差性：残差的方差应保持恒定(不应随预测值增加而变化)
正态性：对于任何固定的x值，y值呈正态分布
无多重共线性：简单线性回归中自动满足(因为只有一个自变量)
无自相关：残差之间不应存在相关性

4. Python实现简单线性回归

现在让我们用Python从头实现简单线性回归，并使用scikit-learn进行验证。

4.1 数据准备

import numpy as np
import matplotlib.pyplot as plt

# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100) * 10  # 自变量：0-10之间的随机数
y = 2 * x + 3 + np.random.randn(100) * 2  # 因变量：线性关系加上噪声

# 可视化数据
plt.scatter(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('原始数据散点图')
plt.show()

4.2 从零实现简单线性回归

import numpy as np
import matplotlib.pyplot as plt


class SimpleLinearRegression:
    def __init__(self):
        self.intercept_ = None  # β₀
        self.coef_ = None  # β₁

    def fit(self, x, y):
        # 计算均值
        x_mean = np.mean(x)
        y_mean = np.mean(y)

        # 计算协方差和x的方差
        cov = np.sum((x - x_mean) * (y - y_mean))
        var = np.sum((x - x_mean) ** 2)

        # 计算系数
        self.coef_ = cov / var
        self.intercept_ = y_mean - self.coef_ * x_mean

    def predict(self, x):
        return self.intercept_ + self.coef_ * x

    def score(self, x, y):
        # 计算R平方
        y_pred = self.predict(x)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)


# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100) * 10  # 自变量：0-10之间的随机数
y = 2 * x + 3 + np.random.randn(100) * 2  # 因变量：线性关系加上噪声
# 使用我们的实现
slr = SimpleLinearRegression()
slr.fit(x, y)

print(f"截距(β₀): {slr.intercept_:.2f}")
print(f"斜率(β₁): {slr.coef_:.2f}")
print(f"R平方: {slr.score(x, y):.2f}")

# 绘制回归线
plt.scatter(x, y)
plt.plot(x, slr.predict(x), color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('手动实现的简单线性回归')
plt.show()

4.3 使用scikit-learn实现

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 生成模拟数据
np.random.seed(42)
x = np.random.rand(100) * 10  # 自变量：0-10之间的随机数
y = 2 * x + 3 + np.random.randn(100) * 2  # 因变量：线性关系加上噪声

# 重塑x的形状(n_samples, n_features)
x_reshaped = x.reshape(-1, 1)

# 创建并拟合模型
model = LinearRegression()
model.fit(x_reshaped, y)

# 输出结果
print(f"截距(β₀): {model.intercept_:.2f}")
print(f"斜率(β₁): {model.coef_[0]:.2f}")

# 预测
y_pred = model.predict(x_reshaped)

# 评估
print(f"均方误差(MSE): {mean_squared_error(y, y_pred):.2f}")
print(f"R平方: {r2_score(y, y_pred):.2f}")

# 可视化
plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('scikit-learn实现的简单线性回归')
plt.show()