“Pandas数据处理与分析：实用技巧与应用“

# 开篇

1. pandas的series的了解

1.1 pd.Series 创建

1.2 pd.series 的索引使用

1.3 pd.series 之字典/索引

1.4 pandas 转换数据类型

1.5 pandas 通过索引或者通过位置来取值

1.6 pandas 指定行取值

1.7 pands之Series 切片和索引

1.8 pands之Series 的索引和值

2. pandas读取外部数据/sql/mongodb

2.1 read_csv 读取csv文件

3. pandas的dataFrame的创建

3.1 pd.DataFrame 的创建

3.2 DataFrame指定行索引，指定列索引

3.3 DataFrame切片操作

3.4 字典转成DataFrame

4. Dataframe的描述信息

4.1 sort_values 排序

4.2 指定前2名排序

5. dataFrame的索引

5.1 t[:5]

5.2 t["W"]

5.3 loc & iloc

5.4 t.loc[:, "Y"]

5.5 t.loc[["a","b"]]

5.6 t.loc[:,["X","Y"]]

5.7 t.loc[["a","b"],["X","Y"]]

5.8 t.iloc[1,:]

5.9 t.iloc[:,2]

5.10 t.iloc[:,[2,1]]

5.11 t.iloc[[0,2],[2,1]]

5.12 t.iloc[1:,:2]

5.13 t.iloc[1:,:2] = np.nan

6. bool索引和缺失数据的处理

6.1 t[(10 < t["W"]) & (t["W"] < 20)]

6.2 t[(6 == t["W"]) | (t["W"] == 9)]

6.3 t2["name"].str.split("/")

6.4 转换数据类型为list列表

6.5 缺失数据的处理

6.6 pd.isnull(t1) & pd.notnull(t1)

6.7 t1[pd.notnull(t1["W"])]

6.8 dropna 删除NaN的行

6.9 t1.fillna(0)

6.10 t1.fillna(t1.mean())

6.11 t1["X"].fillna(t1["X"].mean())

7. pandas中的常用统计方法

描述性统计：

7.1 df.describe()

7.2 df.mean()

7.3 df.median()

7.4 df.std()

7.5 df.var()

7.6 df.min() 和 df.max()

7.7 df.sum()

7.8 df.count()

7.9 df.quantile(q)

聚合和分组统计：

7.10 df.groupby('column').mean()

7.10.1 分组统计案例

7.11 df.groupby('column').agg({'col1': 'mean', 'col2': 'sum'})

7.12 df.pivot_table(values='col1', index='col2', columns='col3', aggfunc='mean')

其他常用统计方法：

7.13 df.corr()

7.14 df.cov()

7.15 df.value_counts()

7.16 df.unique()

7.17 df.nunique()

7.18 df.mode()

# 数据的合并和分组聚合

8. 离散化及其在 Pandas 中的实现方法

9. 数据合并

9.1 t1.join(t2)

9.2 merge方法

9.3 t1.merge(t2, on="B")

9.4 t1.merge(t2, how="outer", on="B")

9.5 how="left" & how="right"

10. 数据分组聚合

11. 数据的索引学习

11.1 index & reindex

11.2 t1.set_index("A")

11.3 t1.set_index("A").index

11.4 t1.set_index("A", drop=False)

11.5 t1["A"].unique()

11.6 a.set_index(["c", "d"]

索引练习：

11.7 b.swaplevel()

11.8 series.Series ：索引类型

11.9 frame.DataFrame

11.10 c["one"] & c["one"]["bar"]

11.11 b.loc["one"].loc["bar"]

11.12 b.swaplevel().loc["bar"]

# 开篇

pandas可以读取SQL；pd.read_sql(sql_sentence,connection；

类型转换： map(int, a)：

map是一个 Python 内置函数；
map() 的使用示例，用于将一个可迭代对象 a 中的每个元素都应用 int() 函数进行转换；
在 Python 中，a 可以是任何可迭代对象，包括但不限于：

列表（List）: 最常见的可迭代对象之一，包含多个元素的有序集合。
元组（Tuple）: 与列表类似，但元组是不可变的序列。
集合（Set）: 无序且不重复的元素集合。
字符串（String）: 字符的有序集合。
字典（Dictionary）: 包含键值对的集合，map() 可以应用于字典的键或值。

1. pandas的series的了解

pandas的使用：

numpy帮我们处理的是数值型的数据，无法处理除数值型之外的类型，而pandas除了处理数值之外（基于numpy），还能帮我们处理其它类型的数据；
pandas的常用数据类型：

Series一维，带标签数组；

DataFrame二维，Series容器；

1.1 pd.Series 创建

import pandas as pd
import numpy as np
import string

# t = pd.Series(np.arange(10))
t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))  # 创建Series
print(t)
print("*" * 100)
print(type(t))

1.2 pd.series 的索引使用

索引默认0开始；

t = pd.Series([1, 2, 3, 1, 2, 3])
print(type(t))
print(t)

索引默认0开始，但是可以手动指定索引；

t = pd.Series([1, 2, 3, 1, 2, 3], index=list("abcdef"))
print(type(t))
print("*" * 100)
print(t)

1.3 pd.series 之字典/索引

通过字典创建一个Series，其中的索引就是字典的键；

重新给其指定其它索引之后，如果能够对应上，就取其值，如果不能，就为Nan；

temp_dict = {"name": "yiyi", "age": 30, "tel": 1186}
t3 = pd.Series(temp_dict)
print(t3)

1.4 pandas 转换数据类型

temp_dict = {"name": "yiyi", "age": 30, "tel": 1186}
t3 = pd.Series(temp_dict)  # 创建Series
print(t3)
print("*" * 100)

t = pd.Series([1, 2, 3, 1, 2, 3], index=list("abcdef"))
print(t.dtype)
print("*" * 100)
t4 = t.astype(float)  # 转换数据类型
print(t4)
print("*" * 100)
print(t4.dtype)

1.5 pandas 通过索引或者通过位置来取值

temp_dict = {"name": "yiyi", "age": 30, "tel": 1186}
t3 = pd.Series(temp_dict)  # 创建Series

print(t3["name"])  # 根据索引key获取数据
print(t3[0])  # 根据索引下标获取数据

1.6 pandas 指定行取值

temp_dict = {"name": "yiyi", "age": 30, "tel": 1186}
t3 = pd.Series(temp_dict)  # 创建Series
print(t3[:2])  # 获取前2个数据

t3[:2] ：代表2前面的所有行被取出；

t3[["key1","key2"]] ：根据key取值；

t3[[1, 2]] ：根据位置取值；

如果不存在该key，那么取出来的就会是NaN的值；

temp_dict = {"name": "yiyi", "age": 30, "tel": 1186}
t3 = pd.Series(temp_dict)  # 创建Series
print(t3[:2])  # 获取前2个数据
print("*" * 100)
print(t3[[1, 2]])  # 获取指定索引的数据
print("*" * 100)
print(t3[["age", "tel"]])  # 获取指定索引的数据

1.7 pands之Series 切片和索引

切片：直接传入start end或者步长即可；

索引：一个的时候直接传入序号或者index，多个的时候传入序号或者index的列表；

1. t[2:10:2] ：从2开始到10结束，步长为2

t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))  # 创建Series
print(t)
print("*" * 100)
print(t[2:10:2])  # 从2开始到10结束，步长为2

2. t[[2, 3, 6]] ：只获取2，3，6的数据

t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))  # 创建Series
print(t)
print("*" * 100)
print(t[[2, 3, 6]])  # 获取指定索引的数据

3. t["F"] ：根据key获取数据

t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))  # 创建Series
print(t)
print("*" * 100)
print(t["F"])  # 获取指定索引的数据

1.8 pands之Series 的索引和值

Series对象本质上有两个数组构成，一个数组构成对象的键（index，索引），一个数组构成对象的值（value）

键→值

ndarray 的很多方法都可以运用于series 类型，比如argmax，clip ；
series 具有where 方法，但是结果和ndarray 不同；

t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))  # 创建Series
print(t.index)  # 获取Series的索引
print(type(t.index))  # 获取Series的索引下标类型
print("*" * 100)
print(t.values)  # 获取Series的值
print(type(t.values))  # 获取Series的值类型

各种方法：

list(t.index)[:2] ：先获取所有的index，取2前面的所有行

t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))  # 创建Series
print(type(t.index))  # 获取Series的索引下标类型
print(len(t.index))  # 获取Series的索引下标长度
print(list(t.index))  # 获取Series的索引下标
print(list(t.index)[:2])  # 指定获取Series的索引下标

2. pandas读取外部数据/sql/mongodb

2.1 read_csv 读取csv文件

pd.read_csv("../youyube_video_data/GBvideos-pandas.csv") ：读取csv文件

除了cvs文件外，还可以读取SQL，MongoDB的文件内容：
- from pymongo import MongoClient 导包；
- pd.Series(t) 读取；
不同列索引的情况下，也不会影响这些数据转成DataFrame：
- df = pd.DataFrame(data) 转换

df = pd.read_csv("../youyube_video_data/GBvideos-pandas.csv")  # 读取csv文件
print(df)

3. pandas的dataFrame的创建

3.1 pd.DataFrame 的创建

DataFrame对象既有行索引，又有列索引：
- 行索引，表明不同行，横向索引，叫index，0轴，axis=0；
- 列索引，表明不同列，纵向索引，叫columns，1轴，axis=1；

t = pd.DataFrame(np.arange(12).reshape((3, 4)))
print(t)

3.2 DataFrame指定行索引，指定列索引

t = pd.DataFrame(np.arange(12).reshape((3, 4)), index=list("abc"), columns=list("WXYZ"))
print(t)

3.3 DataFrame切片操作

d1 = {"name": ["yiYi", "zhangSan"], "age": [24, 28], "tel": [1186, 1185]}
print(pd.DataFrame(d1))  # 创建DataFrame

3.4 字典转成DataFrame

d1 = [{"name": "yiYi", "age": 24, "tel": 1186}, {"name": "zhangSan", "age": 28}, {"name": "wangWu", "age": 30}]
print(d1)
print("*" * 100)
print(pd.DataFrame(d1))  # 将列表转换为DataFrame

4. Dataframe的描述信息

DataFrame的基础属性：

df.shape # 行数列数

df.dtypes # 列数据类型

df.ndim # 数据维度

df.index # 行索引

df.columns # 列索引

df.values # 对象值，二维ndarray数组
DataFrame整体情况查询：

df.head(3) # 显示头部几行，默认5

df.tail(3) # 显示末尾几行，默认5行

df.info() # 相关信息概览：行数，列数，列索引，列非空值个数，列类型，内存占用

df.describe() # 快速综合统计结果：计数，均值，标准差，最大值，四分位数，最小值

4.1 sort_values 排序

df.sort_values(by="W", ascending=False) ：排序，by 指定行索引，ascending=False 降序排序

t = pd.DataFrame(np.arange(12).reshape((3, 4)), index=list("abc"), columns=list("WXYZ"))
t[2:] = 1
print(t)
print("*" * 100)
t1 = t.sort_values(by="W")  # 根据W升序排查
print(t1)
print("*" * 100)
t2 = t.sort_values(by="W", ascending=False)  # 根据W降序排查
print(t2)

4.2 指定前2名排序

t = pd.DataFrame(np.arange(12).reshape((3, 4)), index=list("abc"), columns=list("WXYZ"))
t[2:] = 1
print(t)
print("*" * 100)
df = t.sort_values(by="W", ascending=False)
print(df.head(2))  # 获取前2个数据

5. dataFrame的索引

注意事项：

方括号写数字，表示取行，对行进行操作；

写字符串，表示的去列索引，对列进行操作；

5.1 t[:5]

t[:5] ：取前5行数据

t = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list("WXY"))
t[2:] = 1
print(t)
print("*" * 100)
# 取前5行数据
print(t[:5])

5.2 t["W"]

t["W"] ：指定列获取

t = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list("WXY"))
t[2:] = 1
print(t)
print("*" * 100)
# 指定列获取
print(t["W"])

5.3 `loc & iloc`

loc & iloc

t.loc[0, "Y"] ：获取指定行和列的数据

t.loc[0] & t.loc[0, :] ：取出指定行的数据

pandas 之loc ：

df.loc 通过标签索引行数据；
df.iloc 通过位置获取行数据；

t = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list("WXY"))
t[2:] = 1
print(t)
print("*" * 100)
print(t.loc[0, "Y"])  # 获取指定行和列的数据
print(type(t.loc[0, "Y"]))  # 获取指定行和列的数据类型
print("*" * 100)
print(t.loc[0])  # 取出指定行的数据
print(t.loc[0, :])  # 取出指定行的数据

5.4 t.loc[:, "Y"]

t.loc[:, "Y"] ：取出指定列的数据

t = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list("WXY"))
print(t)
print("*" * 100)
print(t.loc[:, "Y"])  # 取出指定列的数据

t.loc[] ：取多行多列：

5.5 t.loc[["a","b"]]

t.loc[["a","b"]] ：取a，b行

t = pd.DataFrame(np.arange(24).reshape((8, 3)), index=list("abcdefgh"), columns=list("WXY"))
print(t)
print("*" * 100)
print(t.loc[["a", "b"]])

5.6 t.loc[:,["X","Y"]]

t.loc[:,["X","Y"]] ：取X，Y两列

t = pd.DataFrame(np.arange(24).reshape((8, 3)), index=list("abcdefgh"), columns=list("WXY"))
print(t)
print("*" * 100)
print(t.loc[:, ["X", "Y"]])

5.7 t.loc[["a","b"],["X","Y"]]

t.loc[["a","b"],["X","Y"]] ：取a，b两行，X，Y两列

t = pd.DataFrame(np.arange(24).reshape((8, 3)), index=list("abcdefgh"), columns=list("WXY"))
print(t)
print("*" * 100)
print(t.loc[["a", "b"], ["X", "Y"]])

5.8 t.iloc[1,:]

t.iloc[1,:] ：只取第1行，和所有列，默认下标为0开始

print(t1.iloc[2:, 1]) ：从第3行开始获取，包含第三行；
print(t1.iloc[:2, 1]) ：获取3行前面的行，不包括第3行；

5.9 t.iloc[:,2]

t.iloc[:,2] ：取所有行，和第2列

5.10 t.iloc[:,[2,1]]

t.iloc[:,[2,1]] ：取所有行，和第2、第1列

5.11 t.iloc[[0,2],[2,1]]

t.iloc[[0,2],[2,1]] ：取0行和2行，取2列和1列

5.12 t.iloc[1:,:2]

t.iloc[1:,:2] ：取1行后所有的行，2列前全部的列，不包含第2列

5.13 t.iloc[1:,:2] = np.nan

t.iloc[1:,:2] = np.nan ：取1行后面的行，和2列前面的全部列将他们等于nan

6. bool索引和缺失数据的处理

6.1 t[(10 < t["W"]) & (t["W"] < 20)]

t[(10 < t["W"]) & (t["W"] < 20)] & 且：获取指定条件数据

t = pd.DataFrame(np.arange(24).reshape((8, 3)), index=list("abcdefgh"), columns=list("WXY"))
print(t)
print("*" * 100)
print(t[(10 < t["W"]) & (t["W"] < 20)])  # 获取指定条件数据

6.2 t[(6 == t["W"]) | (t["W"] == 9)]

t[(6 == t["W"]) | (t["W"] == 9)] | 或：获取指定行和列的数据

t = pd.DataFrame(np.arange(24).reshape((8, 3)), index=list("abcdefgh"), columns=list("WXY"))
print(t)
print("*" * 100)
print(t[(6 == t["W"]) | (t["W"] == 9)])  # 获取指定行和列的数据

6.3 t2["name"].str.split("/")

t2["name"].str.split("/") ：切割数据

t1 = [{"name": "yiYi/170/90", "age": 24, "tel": 1186}, {"name": "zhangSan/175/120", "age": 28},
      {"name": "wangWu/180/130", "age": 30}]
t2 = pd.DataFrame(t1)
print(t2)
print("*" * 100)
print(t2["name"].str.split("/"))  # 获取指定列的数据

6.4 转换数据类型为list列表

t2["name"].str.split("/").tolist() ：转换数据类型为list列表

t1 = [{"name": "yiYi/170/90", "age": 24, "tel": 1186}, {"name": "zhangSan/175/120", "age": 28},
      {"name": "wangWu/180/130", "age": 30}]
t2 = pd.DataFrame(t1)
print(t2)
print("*" * 100)
print(t2["name"].str.split("/").tolist())  # 转换数据类型为list

6.5 缺失数据的处理

处理NaN的数据；

注意：在计算平均值情况下，nan是不参与计算的，但是0会参与运算；

处理0的数据：t[t==0]=np.nan

6.6 pd.isnull(t1) & pd.notnull(t1)

pd.isnull(t1) & pd.notnull(t1) ：判断是否为空和不为空，返回bool类型

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("WXYZ"))
t1[2:] = np.nan
print(t1)
print("*" * 100)
print(pd.isnull(t1))  # 判断是否为空
print("*" * 100)
print(pd.notnull(t1))  # 判断是否不为空

6.7 t1[pd.notnull(t1["W"])]

t1[pd.notnull(t1["W"])] ：找到t1数组中W这一列不为null的值，并返回行

6.8 dropna 删除NaN的行

t1.dropna(axis=0, how="any") & t1.dropna(axis=0, how="all") ：删除全部有nan的行、删除全部为nan的行

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("WXYZ"))
t1["W"] = np.nan
t1[2:] = np.nan
print(t1)
print("*" * 100)
print(t1.dropna(axis=0, how="any"))  # 删除NaN的行，默认
print("*" * 100)
print(t1.dropna(axis=0, how="all"))  # 当前行全部为NaN时，删除该行

6.9 t1.fillna(0)

t1.fillna(0) ：填充NaN数据为0

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("WXYZ"))
t1["W"] = np.nan
t1[2:] = np.nan
print(t1)
print("*" * 100)
print(t1.fillna(0))  # 填充NaN数据为0

6.10 t1.fillna(t1.mean())

t1.fillna(t1.mean()) ：不过，一般都会将均值填入到nan进行计算，统计

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("WXYZ"))
t1["W"] = np.nan
t1[2:] = np.nan
print(t1)
print("*" * 100)
print(t1.fillna(t1.mean()))  # 填充NaN数据为平均值

6.11 t1["X"].fillna(t1["X"].mean())

t1["X"].fillna(t1["X"].mean()) ：仅对t1的X进行均值统计

注：当X列进行统计均值时，并不会将nan的值统计进行；

t1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("WXYZ"))
t1["W"] = np.nan
t1[2:] = np.nan
print(t1)
print("*" * 100)
print(t1["X"].fillna(t1["X"].mean()))  # 仅填充X列的NaN数据
print("*" * 100)
print(t1["X"].mean())  # 获取X列的平均值

7. pandas中的常用统计方法

描述性统计：

7.1 df.describe()

提供DataFrame的基本统计信息（计数、均值、标准差、最小值、四分位数和最大值）。

df.describe()

7.2 df.mean()

计算每一列的均值。

df.mean()

7.3 df.median()

计算每一列的中位数。

df.median()

7.4 df.std()

计算每一列的标准差。

df.std()

7.5 df.var()

计算每一列的方差。

df.var()

7.6 df.min() 和 df.max()

计算每一列的最小值和最大值。

df.min()
df.max()

7.7 df.sum()

计算每一列的总和。

df.sum()

7.8 df.count()

计算每一列的数量。

df.count()

7.9 df.quantile(q)

计算每一列的分位数。q 是分位数的值，可以是 0 到 1 之间的浮点数。

df.quantile(0.25)  # 计算第一四分位数

聚合和分组统计：

7.10 df.groupby('column').mean()

按某一列进行分组，并计算每组的均值。
也可以使用count() 来统计分组后的结果；

df.groupby('column').mean()

7.10.1 分组统计案例

t2 = pd.DataFrame({
    "W": [0, 0, 1, 1],
    "X": [1, 5, 9, 13],
    "Y": [2, 6, 10, 14],
    "Z": [3, 7, 11, 15]
})
print(t2)
print("*" * 100)
print(t2.groupby("W").mean())

结果：

   W   X   Y   Z
0  0   1   2   3
1  0   5   6   7
2  1   9  10  11
3  1  13  14  15
****************************************************************************************************
     X     Y     Z
W                  
0   3.0   4.0   5.0
1  11.0  12.0  13.0

这里，W=0 的组有两行，所以 X, Y, Z 列的均值分别是 (1+5)/2 = 3.0, (2+6)/2 = 4.0, (3+7)/2 = 5.0；W=1 的组有两行，所以 X, Y, Z 列的均值分别是 (9+13)/2 = 11.0, (10+14)/2 = 12.0, (11+15)/2 = 13.0。

7.11 df.groupby('column').agg({'col1': 'mean', 'col2': 'sum'})

按某一列进行分组，并对指定列应用多个聚合函数。

df.groupby('column').agg({'col1': 'mean', 'col2': 'sum'})

7.12 df.pivot_table(values='col1', index='col2', columns='col3', aggfunc='mean')

创建数据透视表。

df.pivot_table(values='col1', index='col2', columns='col3', aggfunc='mean')

其他常用统计方法：

7.13 df.corr()

计算列之间的相关系数。

df.corr()

7.14 df.cov()

计算列之间的协方差。

df.cov()

7.15 df.value_counts()

计算某一列中每个值的出现次数。

df['column'].value_counts()

7.16 df.unique()

获取某一列中的唯一值。

df['column'].unique()

7.17 df.nunique()

计算每一列中的唯一值数量。

df.nunique()

7.18 df.mode()

计算每一列的众数。

df.mode()

# 数据的合并和分组聚合

字符串离散化的案例：
- 大纲将字符串转换成数字，再进行统计的思路；
- 主要是思路，将字符串离散化后统计；
- 先把所有的字符串都置为零，然后把重复的值变成1或2或3或4这种数据，或出现过的为1，没出现过的为0，然后统计；
- 代码思路：
  - 首先找到大列表，就是df中的每个分裂的列表，把列表展开，进行list操作，然后去重，
  - 再把这个df做成一个数组的形状，有着行列索引，这个行列与电影的行列，分类是一样的，
  - 再对行数进行循环，当前出现过数据的地方都是1了，没有出现张的地方都是0，
  - 再zeros统计每个列中为0的结果，进行求和；最终得出结果；

8. 离散化及其在 Pandas 中的实现方法

参考：离散化及其在 Pandas 中的实现方法-CSDN博客

9. 数据合并

9.1 t1.join(t2)

t1.join(t2) ：横向合并

这时候会存在t1的所有行，但是t2没有的行全为NaN；

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
t2 = pd.DataFrame(np.arange(4).reshape(2, 2), columns=list("CD"))
print(t1)
print("*" * 100)
print(t2)
print(t1.join(t2))  # 横向合并

9.2 merge方法

merge方法：按照指定的列把数据按照一定的方式合并到一起。

合并方式：
- 默认的合并方式是inner，即交集。
- merge outer，并集，用NaN补全。
- merge left，左边为准，用NaN补全。
- merge right，右边为准，用NaN补全。

9.3 t1.merge(t2, on="B")

t1.merge(t2, on="B") ：合并相同的列，并删除重复列

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
t2 = pd.DataFrame(np.arange(4).reshape(2, 2), columns=list("CB"))
print(t1)
print("*" * 100)
print(t2)
print("*" * 100)
print(t1.merge(t2, on="B"))  # 合并相同的列，并删除重复列

9.4 t1.merge(t2, how="outer", on="B")

t1.merge(t2, how="outer", on="B") ：取出并合

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
t2 = pd.DataFrame(np.arange(4).reshape(2, 2), columns=list("CB"))
print(t1)
print("*" * 100)
print(t2)
print("*" * 100)
print(t1.merge(t2, how="outer", on="B"))  # 取出并合

9.5 how="left" & how="right"

how="left" & how="right" ：左连接以t1为准，右连接以t2为准

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
t2 = pd.DataFrame(np.arange(4).reshape(2, 2), columns=list("CB"))
print(t1)
print("*" * 100)
print(t2)
print("*" * 100)
print(t1.merge(t2, on="B", how="left"))
print("*" * 100)
print(t1.merge(t2, on="B", how="right"))

10. 数据分组聚合

参考目录7.10；

11. 数据的索引学习

11.1 index & reindex

t1.index.values & t1.index = ["a", "b", "c"] & t1.reindex(["c", "a", "d"])

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
print(t1)
print("*" * 100)
print(t1.index.values)  # 获取索引
print("*" * 100)
t1.index = ["a", "b", "c"]  # 修改索引
print(t1)
print("*" * 100)
print(t1.reindex(["c", "a", "d"]))  # 索引重排, 没有的索引会自动补NaN

11.2 t1.set_index("A")

t1.set_index("A") ：将某一列作为索引

注：设置的索引值是可以重复的；

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
print(t1)
print("*" * 100)
print(t1.set_index("A"))  # 将A列设置为索引

11.3 t1.set_index("A").index

t1.set_index("A").index ：返回索引

将A列设置为索引后，并返回索引值；

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
print(t1)
print("*" * 100)
print(t1.set_index("A"))  # 设置索引
print(t1.set_index("A").index)  # 索引返回

11.4 t1.set_index("A", drop=False)

t1.set_index("A", drop=False) ：将A列设置为索引, drop=False 表示不删除A列

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
print(t1)
print("*" * 100)
print(t1.set_index("A", drop=False))  # 将A列设置为索引, drop=False表示不删除A列

11.5 t1["A"].unique()

t1["A"].unique() ：获取A列的唯一值

t1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=list("AB"))
print(t1)
print("*" * 100)
print(t1.set_index("A", drop=False))  # 将A列设置为索引, drop=False表示不删除A列
print("*" * 100)
t1[1:] = 1
print(t1)
print("*" * 100)
print(t1["A"].unique())  # 获取A列的唯一值

len(t1.set_index("A").index) ：获取索引长度
list(t1.set_index("A").index) ：转换索引为list

11.6 a.set_index(["c", "d"]

a.set_index(["c", "d"] ：设置两个索引

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
print(a)
print("*" * 100)
b = a.set_index(["c", "d"])  # 设置两个索引
print(b)

索引练习：

11.7 b.swaplevel()

b.swaplevel() ：c 和 d 交换索引位置

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
print(a)
print("*" * 100)
b = a.set_index(["c", "d"])  # 设置两个索引
print(b)
print("*" * 100)
print(b.swaplevel())  # c 和 d 交换索引位置

   a  b    c    d
0  0  7  one  foo
1  1  6  one  bar
2  2  5  one  foo
3  3  4  two  bar
4  4  3  two  foo
5  5  2  two  bar
6  6  1  two  foo
****************************************************************************************************
         a  b
c   d        
one foo  0  7
    bar  1  6
    foo  2  5
two bar  3  4
    foo  4  3
    bar  5  2
    foo  6  1
****************************************************************************************************
         a  b
d   c        
foo one  0  7
bar one  1  6
foo one  2  5
bar two  3  4
foo two  4  3
bar two  5  2
foo two  6  1

type(b) ：

11.8 series.Series ：索引类型

series.Series ：索引类型

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
b = a.set_index(["c", "d"])
c = b["a"]
print(c)
print("*" * 100)
print(type(c))

11.9 frame.DataFrame

frame.DataFrame ：索引类型

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
b = a.set_index(["c", "d"])
c = b[["a", "b"]]
print(c)
print("*" * 100)
print(type(c))

11.10 c["one"] & c["one"]["bar"]

c["one"] & c["one"]["bar"] 根据复合索引取值

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
b = a.set_index(["c", "d"])
c = b["a"]
print(c)
print("*" * 100)
print(c["one"])
print("*" * 100)
print(c["one"]["bar"])

c    d  
one  foo    0
     bar    1
     foo    2
two  bar    3
     foo    4
     bar    5
     foo    6
Name: a, dtype: int64
****************************************************************************************************
d
foo    0
bar    1
foo    2
Name: a, dtype: int64
****************************************************************************************************
1

11.11 b.loc["one"].loc["bar"]

b.loc["one"].loc["bar"] ：获取索引为one和bar的数据

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
print(a)
print("*" * 100)
b = a.set_index(["c", "d"])
print(b)
print("*" * 100)
print(b.loc["one"].loc["bar"])  # 获取索引为one和bar的数据

   a  b    c    d
0  0  7  one  foo
1  1  6  one  bar
2  2  5  one  foo
3  3  4  two  bar
4  4  3  two  foo
5  5  2  two  bar
6  6  1  two  foo
****************************************************************************************************
         a  b
c   d        
one foo  0  7
    bar  1  6
    foo  2  5
two bar  3  4
    foo  4  3
    bar  5  2
    foo  6  1
****************************************************************************************************
a    1
b    6
Name: bar, dtype: int64

Process finished with exit code 0

11.12 b.swaplevel().loc["bar"]

b.swaplevel().loc["bar"] ：获取索引为bar的数据

a = pd.DataFrame({
    'a': range(7),
    'b': range(7, 0, - 1),
    'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    'd': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo']
})
print(a)
print("*" * 100)
b = a.set_index(["c", "d"])
print(b)
print("*" * 100)
print(b.loc["one"].loc["bar"])  # 获取索引为one和bar的数据
print("*" * 100)
print(b.swaplevel().loc["bar"])  # 获取索引为bar的数据

   a  b    c    d
0  0  7  one  foo
1  1  6  one  bar
2  2  5  one  foo
3  3  4  two  bar
4  4  3  two  foo
5  5  2  two  bar
6  6  1  two  foo
****************************************************************************************************
         a  b
c   d        
one foo  0  7
    bar  1  6
    foo  2  5
two bar  3  4
    foo  4  3
    bar  5  2
    foo  6  1
****************************************************************************************************
a    1
b    6
Name: bar, dtype: int64
****************************************************************************************************
     a  b
c        
one  1  6
two  3  4
two  5  2

Process finished with exit code 0