Pandas 调整表格的布局
本节使用的数据为 data/titanic.csv
,链接为 pandas案例和教程所使用的数据-机器学习文档类资源-CSDN文库
- 导入数据
import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
titanic.head()
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
行数据的排序
- 将 Titanic 中的乘客按照年龄进行排序
titanic.sort_values(by="Age").head()
PassengerId Survived Pclass Name Sex \
803 804 1 3 Thomas, Master. Assad Alexander male
755 756 1 2 Hamalainen, Master. Viljo male
644 645 1 3 Baclini, Miss. Eugenie female
469 470 1 3 Baclini, Miss. Helene Barbara female
78 79 1 2 Caldwell, Master. Alden Gates male
Age SibSp Parch Ticket Fare Cabin Embarked
803 0.42 0 1 2625 8.5167 NaN C
755 0.67 1 1 250649 14.5000 NaN S
644 0.75 2 1 2666 19.2583 NaN C
469 0.75 2 1 2666 19.2583 NaN C
78 0.83 0 2 248738 29.0000 NaN S
- 将 Titanic 的数据按照年龄和 舱等级进行降序排序
# 排序
sorted_agepclass = titanic.sort_values(by=['Pclass', 'Age'], ascending=False)
sorted_agepclass.head()
PassengerId Survived Pclass Name Sex Age \
851 852 0 3 Svensson, Mr. Johan male 74.0
116 117 0 3 Connors, Mr. Patrick male 70.5
280 281 0 3 Duane, Mr. Frank male 65.0
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0
SibSp Parch Ticket Fare Cabin Embarked
851 0 0 347060 7.7750 NaN S
116 0 0 370369 7.7500 NaN Q
280 0 0 336439 7.7500 NaN Q
483 0 0 4134 9.5875 NaN S
326 0 0 345364 6.2375 NaN S
DataFrame.sort_values()
对表格数据进行排序,
表格里的行数据就会根据定义的列进行索引排序。
长的表格数据转换为宽的表格数据
我们取 N O 2 NO_2 NO2 的数据进行分析;
air_quality = pd.read_csv(
"data/air_quality_long.csv", index_col="date.utc", parse_dates=True
)
air_quality.head()
city country location parameter value unit
date.utc
2019-06-18 06:00:00+00:00 Antwerpen BE BETR801 pm25 18.0 µg/m³
2019-06-17 08:00:00+00:00 Antwerpen BE BETR801 pm25 6.5 µg/m³
2019-06-17 07:00:00+00:00 Antwerpen BE BETR801 pm25 18.5 µg/m³
2019-06-17 06:00:00+00:00 Antwerpen BE BETR801 pm25 16.0 µg/m³
2019-06-17 05:00:00+00:00 Antwerpen BE BETR801 pm25 7.5 µg/m³
# 只取 no2的数据
no2 = air_quality[air_quality["parameter"] == "no2"]
# 利用 groupby 对每个位置进行聚类
no2_subset = no2.sort_index().groupby(['location']).head(2)
no2_subset.head()
city country location parameter \
date.utc
2019-04-09 01:00:00+00:00 Antwerpen BE BETR801 no2
2019-04-09 01:00:00+00:00 Paris FR FR04014 no2
2019-04-09 02:00:00+00:00 London GB London Westminster no2
2019-04-09 02:00:00+00:00 Antwerpen BE BETR801 no2
2019-04-09 02:00:00+00:00 Paris FR FR04014 no2
value unit
date.utc
2019-04-09 01:00:00+00:00 22.5 µg/m³
2019-04-09 01:00:00+00:00 24.4 µg/m³
2019-04-09 02:00:00+00:00 67.0 µg/m³
2019-04-09 02:00:00+00:00 53.5 µg/m³
2019-04-09 02:00:00+00:00 27.4 µg/m³
- 将三个station的值作为单独的相邻列
no2_subset.pivot(columns="location", values="value")
location BETR801 FR04014 London Westminster
date.utc
2019-04-09 01:00:00+00:00 22.5 24.4 NaN
2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2019-04-09 03:00:00+00:00 NaN NaN 67.0
pivot()
是单纯的将数据reshape
,
由于pandas
支持开箱即用的多列绘图(参见绘图教程),从长表格式到宽表格式的转换可以同时绘制不同的时间序列:
no2.pivot(columns="location", values="value").plot()
数据透视
- 统计每个 station 的 N O 2 NO_2 NO2 和 P M 2.5 PM_{2.5} PM2.5 的平均浓度
air_quality.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")
parameter no2 pm25
location
BETR801 26.950920 23.169492
FR04014 29.374284 NaN
London Westminster 29.740050 13.443568
pivot()
只对数据进行重新组织,多个数据聚合时,使用pivot_table()
,
数据透视表是电子表格软件中一个众所周知的概念。当对每个变量的行/列 margin(小计)感兴趣时,将margin参数设置为True:
air_quality.pivot_table( values="value", index="location", columns="parameter", aggfunc="mean", margins=True, )
宽表类型转换为长表类型
no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
location date.utc BETR801 FR04014 London Westminster
0 2019-04-09 01:00:00+00:00 22.5 24.4 NaN
1 2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2 2019-04-09 03:00:00+00:00 54.5 34.2 67.0
3 2019-04-09 04:00:00+00:00 34.5 48.5 41.0
4 2019-04-09 05:00:00+00:00 46.5 59.5 41.0
- 将 N O 2 NO_2 NO2 测量数据放到单列里;
no_2 = no2_pivoted.melt(id_vars="date.utc")
no_2.head()
date.utc location value
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5
no_2 = no2_pivoted.melt(
id_vars="date.utc",
value_vars=["BETR801", "FR04014", "London Westminster"],
value_name="NO_2",
var_name="id_location",
)
no_2.head()
date.utc id_location NO_2
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5
pandas.melt()
将 DataFrame 从宽表转换为长表; 列头变成了变量;
记住
- 可以通过
sort_values
对一列和多列进行排序;
pivot()
只是单纯的对数据进行重组,pivot_table()
支持聚合重组
pivot()
的逆操作是melt()
可以将宽表变成长表;
【参考】
How to reshape the layout of tables? — pandas 1.5.2 documentation (pydata.org)