Pandas 组合多个表格的数据
本节使用的数据为 data/air_quality_no2_long.csv
,链接为 pandas案例和教程所使用的数据-机器学习文档类资源-CSDN文库
导入数据
- N O 2 NO_2 NO2
import pandas as pd
air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
parse_dates=True)
air_quality_no2 = air_quality_no2[["date.utc", "location",
"parameter", "value"]]
air_quality_no2
date.utc location parameter value
0 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
1 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
2 2019-06-20 22:00:00+00:00 FR04014 no2 26.5
3 2019-06-20 21:00:00+00:00 FR04014 no2 24.9
4 2019-06-20 20:00:00+00:00 FR04014 no2 21.4
... ... ... ... ...
2063 2019-05-07 06:00:00+00:00 London Westminster no2 26.0
2064 2019-05-07 04:00:00+00:00 London Westminster no2 16.0
2065 2019-05-07 03:00:00+00:00 London Westminster no2 19.0
2066 2019-05-07 02:00:00+00:00 London Westminster no2 19.0
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
[2068 rows x 4 columns]
- P M 2.5 PM2.5 PM2.5
air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
parse_dates=True)
air_quality_pm25 = air_quality_pm25[["date.utc", "location",
"parameter", "value"]]
air_quality_pm25
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
... ... ... ... ...
1105 2019-05-07 06:00:00+00:00 London Westminster pm25 9.0
1106 2019-05-07 04:00:00+00:00 London Westminster pm25 8.0
1107 2019-05-07 03:00:00+00:00 London Westminster pm25 8.0
1108 2019-05-07 02:00:00+00:00 London Westminster pm25 8.0
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0
[1110 rows x 4 columns]
数据连接 (concat)
- 将同样数据结构的两个表格的数据连接在一起
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)
air_quality
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
... ... ... ... ...
2063 2019-05-07 06:00:00+00:00 London Westminster no2 26.0
2064 2019-05-07 04:00:00+00:00 London Westminster no2 16.0
2065 2019-05-07 03:00:00+00:00 London Westminster no2 19.0
2066 2019-05-07 02:00:00+00:00 London Westminster no2 19.0
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
[3178 rows x 4 columns]
concat
会将两个表格连接起来,默认行连接(rows 增多)
,也可以设置为列连接(columns 增多)
。
- 按日期排序
air_quality = air_quality.sort_values("date.utc")
air_quality.head()
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0
使用多个表格公共的标识符连接表格 (merge)
stations_coord = pd.read_csv("data/air_quality_stations.csv")
stations_coord
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226
.. ... ... ...
61 Southend-on-Sea 51.54420 0.67841
62 Southwark A2 Old Kent Road 51.48050 -0.05955
63 Thurrock 51.47707 0.31797
64 Tower Hamlets Roadside 51.52253 -0.04216
65 Groton Fort Griswold 41.35360 -72.07890
[66 rows x 3 columns]
air_quality_merge = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality_merge
date.utc location parameter value \
0 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
2 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
3 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
4 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
... ... ... ... ...
4177 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
4178 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
4179 2019-06-21 00:00:00+00:00 London Westminster pm25 7.0
4180 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
4181 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
coordinates.latitude coordinates.longitude
0 51.49467 -0.13193
1 48.83724 2.39390
2 48.83722 2.39390
3 51.20966 4.43182
4 51.20966 4.43182
... ... ...
4177 48.83724 2.39390
4178 48.83722 2.39390
4179 51.49467 -0.13193
4180 48.83724 2.39390
4181 48.83722 2.39390
[4182 rows x 6 columns]
使用
merge
,对于air_quality
表中的每一行,都从air_quality_stations_coord
表中添加相应的坐标,两个表都有相同的列位置,用作组合信息的键。merge
函数支持多个连接选项,类似于数据库风格的操作。
air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality_merge2 = pd.merge(air_quality_merge, air_quality_parameters,
how="left", left_on="parameter", right_on='id')
air_quality_merge2
date.utc location parameter value \
0 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
2 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
3 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
4 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
... ... ... ... ...
4177 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
4178 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
4179 2019-06-21 00:00:00+00:00 London Westminster pm25 7.0
4180 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
4181 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
coordinates.latitude coordinates.longitude id \
0 51.49467 -0.13193 no2
1 48.83724 2.39390 no2
2 48.83722 2.39390 no2
3 51.20966 4.43182 pm25
4 51.20966 4.43182 no2
... ... ... ...
4177 48.83724 2.39390 no2
4178 48.83722 2.39390 no2
4179 51.49467 -0.13193 pm25
4180 48.83724 2.39390 no2
4181 48.83722 2.39390 no2
...
4179 Particulate matter less than 2.5 micrometers i... PM2.5
4180 Nitrogen Dioxide NO2
4181 Nitrogen Dioxide NO2
[4182 rows x 9 columns]
记住
多个表格的连接,可以用
concat
函数,可以基于column
,也可以基于row
的连接对于类似于数据库的
merging/joining
表格,可以使用merge
函数。
参考
- How to combine data from multiple tables? — pandas 1.5.2 documentation (pydata.org)