DataFrame行索引操作以及重置索引

一.DataFrame行索引操作

1.1 获取数据

1.1.1 loc 选取数据

df.loc[ ] 只能使用标签索引，不能使用整数索引。

当通过标签索引的切片方式来筛选数据时，它的取值前闭后闭。

传参：

1.如果选择单行或单列，返回的数据类型为 Series

2.选择多行或多列，返回的数据类型为 DataFrame

3.选择单个元素(某行某列对应的值)，返回的数据类型为该元素的原始数据类型（如整数、浮点数等）。

DataFrame.loc[row_indexer, column_indexer]

参数：

- row_indexer：行标签或布尔数组。

- column_indexer：列标签或布尔数组。

import pandas as pd

s = {
    'A':[1,2,3,4,5],
    'B':[6,7,8,9,10],
    'C':[11,12,13,14,15]
}

df = pd.DataFrame(s,index=['a','b','c','d','e'])
# 只取一行series
print(df.loc['a'])

A 1
B 6
C 11
Name: a, dtype: int64

# 切片
print(df.loc['a' : 'd'])

A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14

# 取一个具体元素
print(df.loc['a','A'])

1

# 行列一起切片
print(df.loc['a':'d','A':'c'])

# 列切片
print(df.loc[...,'A':'B'])

A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14
A B
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10

1.1.2 iloc 选取数据

iloc[ ] 方法用于基于位置的索引，即通过行和列的整数位置来选择数据（这个就不用标签索引了）。

这里切片就回到左闭右开了。

DataFrame.iloc[row_indexer, column_indexer]

import pandas as pd

s = {
    'A':[1,2,3,4,5],
    'B':[6,7,8,9,10],
    'C':[11,12,13,14,15]
}

df = pd.DataFrame(s,index=['a','b','c','d','e'])

# 获取行
print(df.iloc[0])

A 1
B 6
C 11
Name: a, dtype: int64

# 获取列
print(df.iloc[...,0])

a 1
b 2
c 3
d 4
e 5
Name: A, dtype: int64

# 获取具体值
print(df.iloc[1,0])

2

# 切片
print(df.iloc[0:2,0:2])

A B
a 1 6
b 2 7

1.2 添加数据

1.2.1 loc方法添加新行

# 对标给列的直接赋值
import pandas as pd

s = {
    'A':[1,2,3,4,5],
    'B':[6,7,8,9,10],
    'C':[11,12,13,14,15]
}

df = pd.DataFrame(s,index=['a','b','c','d','e'])
# 传标签进来
df.loc['g'] = 0
print(df)

A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14
e 5 10 15
g 0 0 0

1.2.2 iloc方法添加新行

# 对标给列的直接赋值
import pandas as pd

s = {
    'A':[1,2,3,4,5],
    'B':[6,7,8,9,10],
    'C':[11,12,13,14,15]
}

df = pd.DataFrame(s)
# 注意这里是不能添加的，只能修改，即使都用默认的数字标签也不行
df.iloc[4] = 0
print(df)


df.iloc[5] = 0

A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 0 0 0

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[69], line 16
12 df.iloc[4] = 0
13 print(df)
---> 16 df.iloc[5] = 0

~~~

~~~

~~~（此处省略）

1.2.3 concat拼接

pd.concat

(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

参数：

objs : 要连接的 DataFrame 或 Series 对象的列表或字典。

axis : 指定连接的轴，0 或 'index' 表示按行连接，1 或 'columns' 表示按列连接。

join : 指定连接方式，'outer' 表示并集（默认），'inner' 表示交集，就是是否丢弃多的列或行的意思。

how : 指定连接方式，'left' 表示左连接，'right' 表示右连接，'inner' 表示交集连接，'outer' 表示并集连接。

ignore_index : 如果为 True，则忽略所有对象中的原始索引并生成新的索引。

keys : 用于在连接结果中创建层次化索引。

levels : 指定层次化索引的级别。

names : 指定层次化索引的名称。

verify_integrity : 如果为 True，则在连接时检查是否有重复索引。

sort : 如果为 True，则在连接时对列进行排序。

copy : 如果为 True，则复制数据。

# 简单的行拼接
import pandas as pd

s = {
    'A':[1,2,3],
    'B':[4,5,6]
}

s1  = {
    'A':[7,8,9],
    'B':[10,11,12]
}

df = pd.DataFrame(s,index=['a','b','c'])
df1 = pd.DataFrame(s1,index=['a','b','c'])

# 重新设置索引
print(pd.concat([df,df1],axis=0,ignore_index=True))

A B
0 1 4
1 2 5
2 3 6
3 7 10
4 8 11
5 9 12

# 不设置索引，这样在执行其他操作时容易出现异常
print(pd.concat([df,df1],axis=0))

print()

# 比如下面，就会打印两行，因为索引是重复的
print(pd.concat([df,df1],axis=0).loc['a'])

print()

# 这种情况就只好使用iloc访问元素
print(pd.concat([df,df1],axis=0).iloc[0])

A B
a 1 4
b 2 5
c 3 6
a 7 10
b 8 11
c 9 12

A B
a 1 4
a 7 10

A 1
B 4
Name: a, dtype: int64

- 特殊情况

刚好也来看看并集交集的用法:

import pandas as pd

# 两个列标签完全不同的数据集拼接
s = {
    'A':[1,2,3],
    'B':[4,5,6]
}

s1  = {
    'C':[7,8,9],
    'D':[10,11,12]
}

df = pd.DataFrame(s,index=['a','b','c'])
df1 = pd.DataFrame(s1,index=['a','b','c'])

# 默认并集拼接
print(pd.concat([df,df1],axis=0))

print()

# 交集拼接
print(pd.concat([df,df1],axis=0,join='inner'))

A B C D
a 1.0 4.0 NaN NaN
b 2.0 5.0 NaN NaN
c 3.0 6.0 NaN NaN
a NaN NaN 7.0 10.0
b NaN NaN 8.0 11.0
c NaN NaN 9.0 12.0

Empty DataFrame
Columns: []
Index: [a, b, c, a, b, c]

因为两数据集列标签没有相同的，所以没有交集，所以就有了上面的输出。

# 简单列拼接
import pandas as pd

# 两个列标签完全不同的数据集拼接
s = {
    'A':[1,2,3],
    'B':[4,5,6]
}

s1  = {
    'C':[7,8,9],
    'D':[10,11,12]
}

df = pd.DataFrame(s,index=['a','b','c'])
df1 = pd.DataFrame(s1,index=['a','b','c'])

print(pd.concat([df,df1],axis=1))

A B C D
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12

- 特殊情况

import pandas as pd

# 两个列标签完全不同的数据集拼接
s = {
    'A':[1,2,3],
    'B':[4,5,6]
}

s1  = {
    'C':[7,8,9,0],
    'D':[10,11,12,0]
}

df = pd.DataFrame(s,index=['a','b','c'])
df1 = pd.DataFrame(s1,index=['a','b','c','d'])

# 默认并集拼接
print(pd.concat([df,df1],axis=1))

print()

# 交集拼接
print(pd.concat([df,df1],axis=1,join='inner'))

A B C D
a 1.0 4.0 7 10
b 2.0 5.0 8 11
c 3.0 6.0 9 12
d NaN NaN 0 0

A B C D
a 1 4 7 10
b 2 5 8 11
c 3 6 9 12

# DataFrame 与 Series 拼接
import pandas as pd

s = {
    'A':[1,2,3],
    'B':[4,5,6]
}
df = pd.DataFrame(s)

# 这里需要填入隐藏属性name，来表示这一整条数据的列标签
serie = pd.Series([7,8,9],name='C')

# 列添加
print(pd.concat([df,serie],axis=1))

print()

# 行添加
print(pd.concat([df,serie],axis=0))

A B C
0 1 4 7
1 2 5 8
2 3 6 9

A B C
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 6.0 NaN
0 NaN NaN 7.0
1 NaN NaN 8.0
2 NaN NaN 9.0

# DataFrame 与 Series 拼接
import pandas as pd

s = {
    'A':[1,2,3],
    'B':[4,5,6]
}
df = pd.DataFrame(s)
serie = pd.Series([7,8],name = 'g',index = ['A','B'])

# 列添加
print(pd.concat([df,serie],axis=1))

print()

# 行添加
print(pd.concat([df,serie],axis=0))

A B g
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 6.0 NaN
A NaN NaN 7.0
B NaN NaN 8.0

A B g
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 6.0 NaN
A NaN NaN 7.0
B NaN NaN 8.0

这里其实你会有一个问题，为什么series的name属性是给整条数据加一个列标签而不是通用标签呢？老实说暂时这样理解是对的，可以参考下面的看看。

series默认是竖向的，这是为了符合dataframe按列操作的习惯，直接把series按横向连接行不通，可以先把series转成dataframe，然后dataframe.T转置，再去连接应该就可以。

# DataFrame 与 Series 拼接
import pandas as pd

s = {
    'A':[1,2,3],
    'B':[4,5,6]
}
df = pd.DataFrame(s)

# 把s1横向拼接到s中

s1 = pd.Series([7,8,9],name='a',index=['C','D','E'])

df1 = pd.DataFrame(s1).T
df2 = pd.concat([df,df1],axis=0)
print(df2)

A B C D E
0 1.0 4.0 NaN NaN NaN
1 2.0 5.0 NaN NaN NaN
2 3.0 6.0 NaN NaN NaN
C NaN NaN 7.0 8.0 9.0

1.2.4 删除行

这里跟删除列一样，使用drop方法，传入要删除的行索引即可。

二.函数

2.1 常用的统计函数

count()	统计某个非空值的数量
sum()	求和
mean()	求均值
median()	求中位数
std()	求标准差
min()	求最小值
max()	求最大值
abs()	求绝对值
prod()	求所有数值的乘积

2.2 重置索引

可以更改原 DataFrame 的行标签或列标签，并使更改后的行、列标签与 DataFrame 中的数据逐一匹配。通过重置索引操作，您可以完成对现有数据的重新排序。如果重置的索引标签在原 DataFrame 中不存在，那么该标签对应的元素值将全部填充为 NaN。

DataFrame.reindex

(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=np.nan, limit=None, tolerance=None)

参数：

1. labels ：

- 类型：数组或列表，默认为 None。

- 描述：新的索引标签。

2. index ：

- 类型：数组或列表，默认为 None。

- 描述：新的行索引标签。

3. columns ：

- 类型：数组或列表，默认为 None。

- 描述：新的列索引标签。

4. axis ：

- 类型：整数或字符串，默认为 None。

- 描述：指定重新索引的轴。0 或 'index' 表示行，1 或 'columns' 表示列。

5. method ：

- 类型：字符串，默认为 None。

- 描述：用于填充缺失值的方法。可选值包括 'ffill'（前向填充）、'bfill'（后向填充）等。

6. copy：

- 类型：布尔值，默认为 True。

- 描述：是否返回新的 DataFrame 或 Series。

7. level：

- 类型：整数或级别名称，默认为 None。

- 描述：用于多级索引（MultiIndex），指定要重新索引的级别。

8. fill_value ：

- 类型：标量，默认为 np.nan。

- 描述：用于填充缺失值的值。

9. limit：

- 类型：整数，默认为 None。

- 描述：指定连续填充的最大数量。

10. tolerance：

- 类型：标量或字典，默认为 None。

- 描述：指定重新索引时的容差。

# 创建一个示例 DataFrame
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data, index=['a', 'b', 'c'])

# 重新索引行
new_index = ['a', 'b', 'c', 'd']
df_reindexed = df.reindex(new_index)
print(df_reindexed)

# 重新索引列
new_columns = ['A', 'B', 'C', 'D']
df_reindexed = df.reindex(columns=new_columns)
print(df_reindexed)

# 重新索引行，并使用前向填充
# 新的行索引 ['a', 'b', 'c', 'd'] 包含了原索引中不存在的标签 'd'，使用 method='ffill' 进行前向填充，因此 'd' 对应的行填充了前一行的值。
new_index = ['a', 'b', 'c', 'd']
df_reindexed = df.reindex(new_index, method='ffill')
print(df_reindexed)

# 重新索引行，并使用指定的值填充缺失值
new_index = ['a', 'b', 'c', 'd']
df_reindexed = df.reindex(new_index, fill_value=0)
print(df_reindexed)

A B C
a 1.0 4.0 7.0
b 2.0 5.0 8.0
c 3.0 6.0 9.0
d NaN NaN NaN

A B C D
a 1 4 7 NaN
b 2 5 8 NaN
c 3 6 9 NaN

A B C
a 1 4 7
b 2 5 8
c 3 6 9
d 3 6 9

A B C
a 1 4 7
b 2 5 8
c 3 6 9
d 0 0 0