GDP分析
文章目录
- GDP分析
- 1 分析过程与目标
- 1.1 数据来源
- 1.2 熟悉数据
- 2 各国与地区GDP数据分析关系多源组成
- 2.2 清洗数据
- 2.3 设定分析目标
- 3 主要国家DGP分析
- 3.1 主要国家GDP趋势
- 3.2 1990年开始GDP对比
- 4 中国GDP分析
- 4.1 从1990年开始GDP变化化
- 4.2 中国GDP分析增长超过10%的年份
- 4.2.1 计算每年增长率思路
- 4.2.2 获取增长大于10%的年份
- 4.3 5年连续累加增长率最高年份
- 4.3.1 rolling:移动窗口方法
1 分析过程与目标
- 数据来源
- 熟悉数据
- 分析过程
- 分析结果呈现
- 使用知识点与代码实现
1.1 数据来源
- 企业内部采集数据:web端,小程序,Android或者IOS应用,智能设备(智能电表,温度传感器等)
- 开方数据平台:国家数据统计局,世界银行数据等
- 第三方数据集:kaagle等竞赛平台
- 爬虫抓取第三方数据
- 数据可能由多源组成
1.2 熟悉数据
- 通过工具展示数据
- 查看数据字段
- 多个数据源观察,数据源关系
2 各国与地区GDP数据分析关系多源组成
import pandas as pd
import numpy as np
%matplotlib inline
#读取excel文件
fpath = r'data\GDP.csv'
f = open(fpath)
pdata = pd.read_csv(f)
pdata
Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.791961e+09 | 2.498933e+09 | 2.467704e+09 | 2.584464e+09 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Afghanistan | AFG | GDP (current US$) | NY.GDP.MKTP.CD | 5.377778e+08 | 5.488889e+08 | 5.466667e+08 | 7.511112e+08 | 8.000000e+08 | 1.006667e+09 | ... | 1.019053e+10 | 1.248694e+10 | 1.593680e+10 | 1.793024e+10 | 2.053654e+10 | 2.004633e+10 | 2.005019e+10 | 1.921556e+10 | 1.946902e+10 | NaN |
2 | Angola | AGO | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 8.417803e+10 | 7.549238e+10 | 8.247091e+10 | 1.041160e+11 | 1.153980e+11 | 1.249120e+11 | 1.267770e+11 | 1.029620e+11 | 9.533511e+10 | NaN |
3 | Albania | ALB | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.288135e+10 | 1.204421e+10 | 1.192695e+10 | 1.289087e+10 | 1.231978e+10 | 1.277628e+10 | 1.322824e+10 | 1.133526e+10 | 1.186387e+10 | NaN |
4 | Andorra | AND | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.007353e+09 | 3.660531e+09 | 3.355695e+09 | 3.442063e+09 | 3.164615e+09 | 3.281585e+09 | 3.350736e+09 | 2.811489e+09 | 2.858518e+09 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
259 | Kosovo | XKX | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.687488e+09 | 5.653793e+09 | 5.829934e+09 | 6.649291e+09 | 6.473725e+09 | 7.072092e+09 | 7.386891e+09 | 6.440501e+09 | 6.649889e+09 | NaN |
260 | Yemen, Rep. | YEM | GDP (current US$) | NY.GDP.MKTP.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.691085e+10 | 2.513027e+10 | 3.090675e+10 | 3.272642e+10 | 3.539315e+10 | 4.041523e+10 | 4.322858e+10 | 3.773392e+10 | 2.731761e+10 | NaN |
261 | South Africa | ZAF | GDP (current US$) | NY.GDP.MKTP.CD | 7.575248e+09 | 7.972841e+09 | 8.497830e+09 | 9.423212e+09 | 1.037379e+10 | 1.133417e+10 | ... | 2.871000e+11 | 2.972170e+11 | 3.752980e+11 | 4.168780e+11 | 3.963330e+11 | 3.668100e+11 | 3.511190e+11 | 3.176110e+11 | 2.954560e+11 | NaN |
262 | Zambia | ZMB | GDP (current US$) | NY.GDP.MKTP.CD | 7.130000e+08 | 6.962857e+08 | 6.931429e+08 | 7.187143e+08 | 8.394286e+08 | 1.082857e+09 | ... | 1.791086e+10 | 1.532834e+10 | 2.026556e+10 | 2.346010e+10 | 2.550337e+10 | 2.804546e+10 | 2.715063e+10 | 2.115439e+10 | 2.106399e+10 | NaN |
263 | Zimbabwe | ZWE | GDP (current US$) | NY.GDP.MKTP.CD | 1.052990e+09 | 1.096647e+09 | 1.117602e+09 | 1.159512e+09 | 1.217138e+09 | 1.311436e+09 | ... | 4.415703e+09 | 8.621574e+09 | 1.014186e+10 | 1.209845e+10 | 1.424249e+10 | 1.545177e+10 | 1.589105e+10 | 1.630467e+10 | 1.661996e+10 | NaN |
264 rows × 62 columns
2.2 清洗数据
观察数据,删除无用数据;
pdata.columns
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
'1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
'1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
'1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
'1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
'1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
'2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017'],
dtype='object')
#删除数据
pdata = pdata.drop(['Country Code','Indicator Name', 'Indicator Code'], axis=1)
#重置索引
pdata = pdata.set_index('Country Name')
pdata = pdata.stack()
pdata = pd.DataFrame(pdata)
pdata.columns = ['GDP']
pdata
GDP | ||
---|---|---|
Country Name | ||
Aruba | 1994 | 1.330168e+09 |
1995 | 1.320670e+09 | |
1996 | 1.379888e+09 | |
1997 | 1.531844e+09 | |
1998 | 1.665363e+09 | |
... | ... | ... |
Zimbabwe | 2012 | 1.424249e+10 |
2013 | 1.545177e+10 | |
2014 | 1.589105e+10 | |
2015 | 1.630467e+10 | |
2016 | 1.661996e+10 |
11507 rows × 1 columns
2.3 设定分析目标
- 主要国家GDP数据变化
- 从1990年开始主要国家GDP数据变化
- 中国GDP1990年开始GDP增长与累积增长
- 中国GDP1990年开始,增长超过10%年份
- 中国GDP连续5年增长最高的年份
3 主要国家DGP分析
选择国家:['China', 'Japan','United States', 'Germany', 'France', 'United Kingdom']
3.1 主要国家GDP趋势
问题:选择合适图表代表数据趋势?折线图图]最高的年份
import matplotlib.pyplot as plt
import matplotlib
import matplotlib.pyplot as plt
#设置支持中文
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus']=False
#设置画布大小
plt.figure(1,figsize=(15, 4))
countrys = ['China', 'Japan','United States', 'Germany', 'France', 'United Kingdom']
for c in countrys:
plt.plot(pdata.loc[c])
plt.title("主要国家GDP增长趋势")
plt.legend(countrys)
_ = plt.xticks(rotation=90)
3.2 1990年开始GDP对比
import matplotlib.pyplot as plt
plt.figure(1,figsize=(15, 4))
countrys = ['China', 'Japan','United States', 'Germany', 'France', 'United Kingdom']
for c in countrys:
#取国家,切片,取年代
plt.plot(pdata.loc[c]['1990':])
plt.title("1990年-2016年 主要国家GDP增长趋势")
plt.legend(countrys)
_ = plt.xticks(rotation=90)
4 中国GDP分析
4.1 从1990年开始GDP变化化
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['SimHei']
_ = plt.figure(1,figsize=(15, 4))
#国家
countrys = ['China']
base1 = 10000*10000*10
base2 = 10000*10000*10
for c in countrys:
#中国,年份1990-
data = pdata.loc[c]['1990':]
plt.plot(data/base1, label='每年值')
#累计值
plt.plot(data.cumsum()/base1,label='累积GDP', color='r')
plt.legend()
_ = plt.xticks(rotation=90)
4.2 中国GDP分析增长超过10%的年份
问题: 计算每一年增长率,使用什么知识点?
gdp = pdata.loc['China']['1990':]
gdp
GDP | |
---|---|
1990 | 3.608580e+11 |
1991 | 3.833730e+11 |
1992 | 4.269160e+11 |
1993 | 4.447310e+11 |
1994 | 5.643250e+11 |
1995 | 7.345480e+11 |
1996 | 8.637470e+11 |
1997 | 9.616040e+11 |
1998 | 1.029040e+12 |
1999 | 1.094000e+12 |
2000 | 1.211350e+12 |
2001 | 1.339400e+12 |
2002 | 1.470550e+12 |
2003 | 1.660290e+12 |
2004 | 1.955350e+12 |
2005 | 2.285970e+12 |
2006 | 2.752130e+12 |
2007 | 3.552180e+12 |
2008 | 4.598210e+12 |
2009 | 5.109950e+12 |
2010 | 6.100620e+12 |
2011 | 7.572550e+12 |
2012 | 8.560550e+12 |
2013 | 9.607220e+12 |
2014 | 1.048240e+13 |
2015 | 1.106470e+13 |
2016 | 1.119910e+13 |
4.2.1 计算每年增长率思路
思路1:循环迭代 :第二年-第一年/第一年
思路2:利用numpy计算
- 第一年数据tmp1:
[开始:结束-1]
- 第二年数据tmp2:
[第二年:结束]
- 结果:tmp2-tmp1/tmp1 * 10000算
#获取第一年与第二年数据
tmp1 = gdp.loc[:'2015']['GDP']/1000000
tmp2 = gdp.loc['1991':]['GDP']/1000000
#转换成整数
tmp1 = tmp1.astype('i')
tmp2 = tmp2.astype('i')
#计算增长率
index = (tmp2.values-tmp1.values)/tmp1.values*100
#第一年插入0
grow = np.insert(index, 0, 0)
#插入新的列
gdp['grow'] = grow
gdp
GDP | grow | |
---|---|---|
1990 | 3.608580e+11 | 0.000000 |
1991 | 3.833730e+11 | 6.239296 |
1992 | 4.269160e+11 | 11.357868 |
1993 | 4.447310e+11 | 4.172952 |
1994 | 5.643250e+11 | 26.891312 |
1995 | 7.345480e+11 | 30.164001 |
1996 | 8.637470e+11 | 17.588912 |
1997 | 9.616040e+11 | 11.329359 |
1998 | 1.029040e+12 | 7.012866 |
1999 | 1.094000e+12 | 6.312680 |
2000 | 1.211350e+12 | 10.726691 |
2001 | 1.339400e+12 | 10.570851 |
2002 | 1.470550e+12 | 9.791698 |
2003 | 1.660290e+12 | 12.902655 |
2004 | 1.955350e+12 | 17.771594 |
2005 | 2.285970e+12 | 16.908482 |
2006 | 2.752130e+12 | 20.392219 |
2007 | 3.552180e+12 | 29.070211 |
2008 | 4.598210e+12 | 29.447551 |
2009 | 5.109950e+12 | 11.129113 |
2010 | 6.100620e+12 | 19.387078 |
2011 | 7.572550e+12 | 24.127548 |
2012 | 8.560550e+12 | 13.047124 |
2013 | 9.607220e+12 | 12.226668 |
2014 | 1.048240e+13 | 9.109607 |
2015 | 1.106470e+13 | 5.555026 |
2016 | 1.119910e+13 | 1.214674 |
4.2.2 获取增长大于10%的年份
vals = gdp[gdp.grow > 10]
vals
GDP | grow | |
---|---|---|
1992 | 4.269160e+11 | 11.357868 |
1994 | 5.643250e+11 | 26.891312 |
1995 | 7.345480e+11 | 30.164001 |
1996 | 8.637470e+11 | 17.588912 |
1997 | 9.616040e+11 | 11.329359 |
2000 | 1.211350e+12 | 10.726691 |
2001 | 1.339400e+12 | 10.570851 |
2003 | 1.660290e+12 | 12.902655 |
2004 | 1.955350e+12 | 17.771594 |
2005 | 2.285970e+12 | 16.908482 |
2006 | 2.752130e+12 | 20.392219 |
2007 | 3.552180e+12 | 29.070211 |
2008 | 4.598210e+12 | 29.447551 |
2009 | 5.109950e+12 | 11.129113 |
2010 | 6.100620e+12 | 19.387078 |
2011 | 7.572550e+12 | 24.127548 |
2012 | 8.560550e+12 | 13.047124 |
2013 | 9.607220e+12 | 12.226668 |
4.3 5年连续累加增长率最高年份
分析问题:5年连续增长:第一年+第二年+第三年…+第五年
如何计算并获取最
4.3.1 rolling:移动窗口方法
应用场景:金融,股票,统计等大?
#求连续2个数据中最大值
tmp = pd.Series([1,2,3,1,1,0])
tmp.rolling(2).max()
0 NaN
1 2.0
2 3.0
3 3.0
4 1.0
5 1.0
dtype: float64
tmp = pd.Series([1,2,3,1,1,0])
print(tmp.rolling(2).min_periods)
None
tmp = gdp.rolling(5).sum()
tmp
GDP | grow | |
---|---|---|
1990 | NaN | NaN |
1991 | NaN | NaN |
1992 | NaN | NaN |
1993 | NaN | NaN |
1994 | 2.180203e+12 | 48.661428 |
1995 | 2.553893e+12 | 78.825430 |
1996 | 3.034267e+12 | 90.175045 |
1997 | 3.568955e+12 | 90.146536 |
1998 | 4.153264e+12 | 92.986450 |
1999 | 4.682939e+12 | 72.407818 |
2000 | 5.159741e+12 | 52.970508 |
2001 | 5.635394e+12 | 45.952447 |
2002 | 6.144340e+12 | 44.414785 |
2003 | 6.775590e+12 | 50.304575 |
2004 | 7.636940e+12 | 61.763489 |
2005 | 8.711560e+12 | 67.945280 |
2006 | 1.012429e+13 | 77.766648 |
2007 | 1.220592e+13 | 97.045161 |
2008 | 1.514384e+13 | 113.590056 |
2009 | 1.829844e+13 | 106.947575 |
2010 | 2.211309e+13 | 109.426172 |
2011 | 2.693351e+13 | 113.161501 |
2012 | 3.194188e+13 | 97.138414 |
2013 | 3.695089e+13 | 79.917531 |
2014 | 4.232334e+13 | 77.898025 |
2015 | 4.728742e+13 | 64.065972 |
2016 | 5.091397e+13 | 41.153098 |
tmp[tmp['grow']==tmp.grow.max()]
GDP | grow | |
---|---|---|
2008 | 1.514384e+13 | 113.590056 |