Spatial Data Analysis（一）：线性回归

来源：https://github.com/Ziqi-Li/GEO4162C/tree/main

在此示例中，我们将介绍如何在 python 中拟合线性回归模型。我们将使用的数据集是 2020 年县级选举投票数据以及来自 ACS 的社会经济数据。

import pandas as pd
import geopandas as gpd

voting_url = "https://raw.githubusercontent.com/Ziqi-Li/GEO4162C/main/data/voting_2021.csv"
voting = pd.read_csv(voting_url)

voting.shape

(3108, 22)

voting.head()

	county_id	state	county	NAME	proj_X	proj_Y	total_pop	new_pct_dem	sex_ratio	pct_black	...	median_income	pct_65_over	pct_age_18_29	gini	pct_manuf	ln_pop_den	pct_3rd_party	turn_out	pct_fb	pct_uninsured
0	17051	17	51	Fayette County, Illinois	597980.182064	1.796863e+06	21565	18.479911	113.6	4.7	...	46650	18.8	14.899142	0.4373	14.9	3.392748	1.740255	59.066079	1.3	8.2
1	17107	17	107	Logan County, Illinois	559815.813333	1.920479e+06	29003	29.593095	97.2	6.9	...	57308	18.0	17.256835	0.4201	12.4	3.847163	2.392057	57.395734	1.6	4.5
2	17165	17	165	Saline County, Illinois	650277.214895	1.660710e+06	23994	25.605949	96.9	2.6	...	44090	19.9	13.586730	0.4692	8.7	4.128731	1.572384	59.078533	1.0	4.2
3	17097	17	97	Lake County, Illinois	654006.840698	2.174577e+06	701473	62.275888	99.8	6.8	...	89427	13.7	15.823132	0.4847	16.3	7.308582	1.960338	71.156446	18.7	6.8
4	17127	17	127	Massac County, Illinois	640398.986258	1.599902e+06	14219	25.662005	89.5	5.8	...	47481	20.8	12.370772	0.4097	7.4	4.067788	1.335682	62.372974	1.0	5.4

5 rows × 22 columns

shp_url = "https://raw.githubusercontent.com/Ziqi-Li/GEO4162C/main/data/County_shp_2018/County_shp_2018.shp"
shp = gpd.read_file(shp_url)

shp.plot()

<Axes: >

在这里插入图片描述

shp.head()

	STATEFP	COUNTYFP	COUNTYNS	AFFGEOID	GEOID	NAME	LSAD	ALAND	AWATER	geoid_int	geometry
0	39	071	01074048	0500000US39071	39071	Highland	06	1432479992	12194983	39071	POLYGON ((1037181.007 1847418.019, 1034790.811...
1	06	003	01675840	0500000US06003	06003	Alpine	06	1912292630	12557304	6003	POLYGON ((-2057214.391 1981928.249, -2051788.3...
2	12	033	00295737	0500000US12033	12033	Escambia	06	1701544502	563927612	12033	POLYGON ((797106.710 901987.747, 797133.592 90...
3	17	101	00424252	0500000US17101	17101	Lawrence	06	963936864	5077783	17101	POLYGON ((697325.783 1756964.285, 694889.230 1...
4	28	153	00695797	0500000US28153	28153	Wayne	06	2099745573	7255476	28153	POLYGON ((664815.315 992438.494, 664464.412 99...

使用公共键合并两个 DataFrame。

shp_df = shp.merge(voting,left_on="geoid_int",right_on="county_id")

制作一张关于人们如何投票的地图。非常熟悉的蓝红色景观。

shp_df.plot(column="new_pct_dem",cmap="bwr_r",vmin=0,vmax=100,legend=True,figsize=(10,10))

<Axes: >

在这里插入图片描述

线性回归

将使用“statsmodels”库来估计线性回归模型。

import statsmodels.formula.api as smf

shp_df.columns

Index(['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME_x',
       'LSAD', 'ALAND', 'AWATER', 'geoid_int', 'geometry', 'county_id',
       'state', 'county', 'NAME_y', 'proj_X', 'proj_Y', 'total_pop',
       'new_pct_dem', 'sex_ratio', 'pct_black', 'pct_hisp', 'pct_bach',
       'median_income', 'pct_65_over', 'pct_age_18_29', 'gini', 'pct_manuf',
       'ln_pop_den', 'pct_3rd_party', 'turn_out', 'pct_fb', 'pct_uninsured'],
      dtype='object')

import matplotlib.pyplot as plt

做一个简单的散点图，我们可以看到一点正相关。

plt.scatter(shp_df.pct_bach, shp_df.new_pct_dem)
plt.xlabel("pct_bach")
plt.ylabel("pct_dem")

Text(0, 0.5, 'pct_dem')

在这里插入图片描述

from scipy.stats import pearsonr
pearsonr(shp_df.pct_bach, shp_df.new_pct_dem)

PearsonRResult(statistic=0.533176728506252, pvalue=7.032775736251e-228)

现在让我们指定一个简单的模型：

Model_1: new_pct_dem ~ 1 + pct_bach

new_pct_dem: 因变量，投票给民主党的人的百分比

pct_bach: 自变量，拥有学士学位的人的百分比

1: 是添加截距，这个是经常需要的

model_1 = smf.ols(formula='new_pct_dem ~ 1 + pct_bach', data=shp_df).fit()

model_1.summary()

OLS Regression Results
Dep. Variable:	new_pct_dem	R-squared:	0.284
Model:	OLS	Adj. R-squared:	0.284
Method:	Least Squares	F-statistic:	1234.
Date:	Fri, 17 Nov 2023	Prob (F-statistic):	7.03e-228
Time:	16:07:30	Log-Likelihood:	-12560.
No. Observations:	3108	AIC:	2.512e+04
Df Residuals:	3106	BIC:	2.514e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	13.9652	0.618	22.615	0.000	12.754	15.176
pct_bach	0.9055	0.026	35.124	0.000	0.855	0.956

Omnibus:	376.503	Durbin-Watson:	1.927
Prob(Omnibus):	0.000	Jarque-Bera (JB):	603.255
Skew:	0.845	Prob(JB):	1.01e-131
Kurtosis:	4.342	Cond. No.	60.0

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

这给了我们回归函数：

pct_dem = 13.9652 + 0.9055*pct_bach

解释：

Bach的%与DEM的%呈正相关。
增加1%Bach将增加0.9055%的人投票给DEM。
截距和 % Bach 的 p 值都非常小 (<0.05)，表明系数/发现的关系在统计上都是显着的。
R2 值 = 0.284 表明该模型具有一定程度的解释力。请记住 0<=R2<=1。

import numpy as np
np.sqrt(0.284)

0.532916503778969

R2 的平方根也是皮尔逊相关系数。

现在让我们指定一个更复杂的模型：

Model_2: new_pct_dem ~ 1 + pct_bach + median_income + ln_pop_den

new_pct_dem: 因变量，投票给民主党的人的百分比

pct_bach: 自变量，拥有学士学位的人的百分比
median_income: 自变量，收入中位数

ln_pop_den: 自变量，人口密度的对数尺度

1: 是添加截距，这个是经常需要的

model_2 = smf.ols(formula='new_pct_dem ~ 1 + pct_bach + median_income + ln_pop_den', data=shp_df).fit()

model_2.summary()

OLS Regression Results
Dep. Variable:	new_pct_dem	R-squared:	0.433
Model:	OLS	Adj. R-squared:	0.432
Method:	Least Squares	F-statistic:	789.2
Date:	Fri, 17 Nov 2023	Prob (F-statistic):	0.00
Time:	16:07:30	Log-Likelihood:	-12199.
No. Observations:	3108	AIC:	2.441e+04
Df Residuals:	3104	BIC:	2.443e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	17.3905	0.898	19.361	0.000	15.629	19.152
pct_bach	0.9890	0.034	29.145	0.000	0.923	1.056
median_income	-0.0004	2.23e-05	-15.804	0.000	-0.000	-0.000
ln_pop_den	3.5657	0.142	25.074	0.000	3.287	3.845

Omnibus:	483.110	Durbin-Watson:	1.926
Prob(Omnibus):	0.000	Jarque-Bera (JB):	843.802
Skew:	1.002	Prob(JB):	5.90e-184
Kurtosis:	4.580	Cond. No.	2.25e+05

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.25e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

这给了我们新的回归函数：

pct_dem = 17.3905 + 0.9890 * pct_bach + -0.0004 *median_income + 3.5657 * ln_pop_den

解释：

%Bach，人口密度与%DEM呈正相关。
收入中位数与 % DEM 呈负相关。
增加 1% Bach将增加 0.9890% 的人投票给 DEM。
收入中位数每增加 1 美元，投票给 DEM 的人数就会减少 0.0004%。这也意味着增加 10,000 美元将减少 4%。
所有变量的 p 值都非常小 (<0.05)，表明所有系数/发现的关系在统计上都是显着的。
R2 值 = 0.433，表明该模型与模型 1 相比具有更好的解释力。

诊断图

线性回归的四个假设是什么？提示：线

残差与拟合图

#get the predicted value
pred = model_2.fittedvalues

#get the residual
resid = model_2.resid

shp_df.new_pct_dem

0       19.505057
1       66.111111
2       42.319648
3       22.505948
4       36.491793
          ...    
3103    28.144652
3104    63.766385
3105    58.965517
3106    32.751442
3107    46.070656
Name: new_pct_dem, Length: 3108, dtype: float64

plt.scatter(shp_df.new_pct_dem,pred)
plt.xlabel("True % DEM")
plt.ylabel("Predicted % DEM")

Text(0, 0.5, 'Predicted % DEM')

在这里插入图片描述

您可以看到任何明显的问题吗？

plt.scatter(pred, resid,alpha=0.5)

plt.axhline(y = 0.0, color = 'black', linestyle = '--')

plt.xlabel("Predicted value")
plt.ylabel("Residual")
plt.title("Model 2")

Text(0.5, 1.0, 'Model 2')

在这里插入图片描述

检查残余正态性。是否歪斜？

plt.hist(resid)

(array([1.000e+00, 9.300e+01, 8.650e+02, 1.137e+03, 6.410e+02, 2.330e+02,
        9.500e+01, 3.200e+01, 1.000e+01, 1.000e+00]),
 array([-38.91209715, -28.29712822, -17.6821593 ,  -7.06719038,
          3.54777854,  14.16274746,  24.77771638,  35.3926853 ,
         46.00765422,  56.62262314,  67.23759207]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

然后我们来画一个 Q-Q 图。如果所有蓝点（残差）都位于代表完美正态分布的直线上，则期望是这样。很明显，残差不正常。

import scipy.stats as stats

stats.probplot(resid, dist="norm", plot=plt)

((array([-3.51125824, -3.26814139, -3.13373139, ...,  3.13373139,
          3.26814139,  3.51125824]),
  array([-38.91209715, -27.53724546, -25.76653482, ...,  52.7401659 ,
          53.71192504,  67.23759207])),
 (11.948687014006548, 8.341427973378232e-12, 0.9739250994350543))

在这里插入图片描述

完整模型

让我们在模型中添加更多变量

shp_df.columns

Index(['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME_x',
       'LSAD', 'ALAND', 'AWATER', 'geoid_int', 'geometry', 'county_id',
       'state', 'county', 'NAME_y', 'proj_X', 'proj_Y', 'total_pop',
       'new_pct_dem', 'sex_ratio', 'pct_black', 'pct_hisp', 'pct_bach',
       'median_income', 'pct_65_over', 'pct_age_18_29', 'gini', 'pct_manuf',
       'ln_pop_den', 'pct_3rd_party', 'turn_out', 'pct_fb', 'pct_uninsured'],
      dtype='object')

formula = 'new_pct_dem ~ 1 + sex_ratio + pct_black + pct_hisp + pct_bach + median_income + pct_65_over + pct_age_18_29 + gini + pct_manuf + ln_pop_den + pct_3rd_party + turn_out + pct_fb + pct_uninsured'

model_3 = smf.ols(formula=formula, data=shp_df).fit()

model_3.summary()

OLS Regression Results
Dep. Variable:	new_pct_dem	R-squared:	0.673
Model:	OLS	Adj. R-squared:	0.672
Method:	Least Squares	F-statistic:	455.1
Date:	Fri, 17 Nov 2023	Prob (F-statistic):	0.00
Time:	16:07:34	Log-Likelihood:	-11342.
No. Observations:	3108	AIC:	2.271e+04
Df Residuals:	3093	BIC:	2.280e+04
Df Model:	14
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-25.5366	4.843	-5.272	0.000	-35.033	-16.040
sex_ratio	0.0376	0.017	2.162	0.031	0.004	0.072
pct_black	0.5523	0.014	39.052	0.000	0.525	0.580
pct_hisp	0.3120	0.020	15.745	0.000	0.273	0.351
pct_bach	0.6819	0.039	17.542	0.000	0.606	0.758
median_income	-0.0002	2.58e-05	-8.428	0.000	-0.000	-0.000
pct_65_over	0.1804	0.058	3.103	0.002	0.066	0.294
pct_age_18_29	0.1371	0.067	2.036	0.042	0.005	0.269
gini	18.0909	6.040	2.995	0.003	6.249	29.933
pct_manuf	0.0487	0.030	1.646	0.100	-0.009	0.107
ln_pop_den	2.6681	0.154	17.328	0.000	2.366	2.970
pct_3rd_party	4.3480	0.234	18.582	0.000	3.889	4.807
turn_out	0.2263	0.026	8.750	0.000	0.176	0.277
pct_fb	0.1392	0.051	2.707	0.007	0.038	0.240
pct_uninsured	-0.3402	0.044	-7.683	0.000	-0.427	-0.253

Omnibus:	1000.163	Durbin-Watson:	1.910
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8501.905
Skew:	1.283	Prob(JB):	0.00
Kurtosis:	10.686	Cond. No.	2.34e+06

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

该模型得到了进一步改进，R2 为 0.673。

#get the predicted value
pred_3 = model_3.fittedvalues

#get the residual
resid_3 = model_3.resid

plt.scatter(shp_df.new_pct_dem,pred_3)
plt.xlabel("True % DEM")
plt.ylabel("Predicted % DEM")

Text(0, 0.5, 'Predicted % DEM')

在这里插入图片描述

让我们重新制作这些诊断图，现在它们看起来好多了！

plt.scatter(pred_3, resid_3,alpha=0.5)

plt.axhline(y = 0.0, color = 'r', linestyle = '-')

plt.xlabel("Predicted value")
plt.ylabel("Residual")
plt.title("Model 3")

Text(0.5, 1.0, 'Model 3')

在这里插入图片描述

plt.hist(resid_3)

(array([3.000e+00, 1.200e+01, 5.960e+02, 1.866e+03, 5.510e+02, 6.300e+01,
        6.000e+00, 7.000e+00, 3.000e+00, 1.000e+00]),
 array([-47.50438204, -34.08283911, -20.66129618,  -7.23975325,
          6.18178968,  19.60333261,  33.02487554,  46.44641847,
         59.8679614 ,  73.28950433,  86.71104726]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

QQ 图表明，除了一些极值之外，大多数点都在直线上。

import scipy.stats as stats
stats.probplot(resid_3, dist="norm", plot=plt)

((array([-3.51125824, -3.26814139, -3.13373139, ...,  3.13373139,
          3.26814139,  3.51125824]),
  array([-47.50438204, -43.62464376, -38.82786172, ...,  68.71420089,
          72.06270506,  86.71104726])),
 (8.974562038717135, 3.0034852088946733e-12, 0.963803701208645))

在这里插入图片描述