Python酷库之旅-第三方库Pandas(118)

一、用法精讲

521、pandas.DataFrame.drop_duplicates方法

521-1、语法

521-2、参数

521-3、功能

521-4、返回值

521-5、说明

521-6、用法

521-6-1、数据准备

521-6-2、代码示例

521-6-3、结果输出

522、pandas.DataFrame.duplicated方法

522-1、语法

522-2、参数

522-3、功能

522-4、返回值

522-5、说明

522-6、用法

522-6-1、数据准备

522-6-2、代码示例

522-6-3、结果输出

523、pandas.DataFrame.equals方法

523-1、语法

523-2、参数

523-3、功能

523-4、返回值

523-5、说明

523-6、用法

523-6-1、数据准备

523-6-2、代码示例

523-6-3、结果输出

524、pandas.DataFrame.filter方法

524-1、语法

524-2、参数

524-3、功能

524-4、返回值

524-5、说明

524-6、用法

524-6-1、数据准备

524-6-2、代码示例

524-6-3、结果输出

525、pandas.DataFrame.first方法

525-1、语法

525-2、参数

525-3、功能

525-4、返回值

525-5、说明

525-6、用法

525-6-1、数据准备

525-6-2、代码示例

525-6-3、结果输出

二、推荐阅读

1、Python筑基之旅

2、Python函数之旅

3、Python算法之旅

4、Python魔法之旅

5、博客个人主页

一、用法精讲

521、pandas.DataFrame.drop_duplicates方法

521-1、语法

# 521、pandas.DataFrame.drop_duplicates方法
pandas.DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:
subsetcolumn label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep.

‘first’ : Drop duplicates except for the first occurrence.

‘last’ : Drop duplicates except for the last occurrence.

False : Drop all duplicates.

inplacebool, default False
Whether to modify the DataFrame rather than creating a new one.

ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:
DataFrame or None
DataFrame with duplicates removed or None if inplace=True.

521-2、参数

521-2-1、subset(可选，默认值为None)：单一标签或列表，用于指定在哪些列中寻找重复项，如果没有提供，默认会检查所有列。

521-2-2、keep(可选，默认值为'first')：{'first', 'last', False}，用于确定在遇到重复项时保留哪一行，可选的值有：

'first'：保留第一次出现的重复项。
'last'：保留最后一次出现的重复项。
False：删除所有重复项，不保留任何重复项。

521-2-3、inplace(可选，默认值为False)：布尔值，如果设置为True，将在原地删除重复项，而不是返回一个新对象，DataFrame会在原地修改，返回值为None。

521-2-4、ignore_index(可选，默认值为False)：布尔值，如果设置为True，返回的DataFrame会重置索引。

521-3、功能

移除DataFrame中的重复行，你可以指定考虑哪些列来判断重复，你可以选择保留首次出现的行还是最后出现的行，或者删除所有的重复行。

521-4、返回值

如果inplace参数设置为False，该函数返回一个新的DataFrame，其中移除了重复的行；如果inplace参数设置为True，该函数不会返回任何值，但会在原地修改DataFrame。

521-5、说明

无

521-6、用法

521-6-1、数据准备

无

521-6-2、代码示例

# 521、pandas.DataFrame.drop_duplicates方法
import pandas as pd
# 创建一个示例DataFrame
data = {
    'A': [1, 1, 2, 2, 3, 3],
    'B': [4, 4, 5, 5, 6, 6]
}
df = pd.DataFrame(data)
# 输出原始DataFrame
print("原始DataFrame:")
print(df)
# 移除重复行（保留首次出现的行）
df_no_duplicates = df.drop_duplicates()
print("\n移除重复行（保留首次出现的行）:")
print(df_no_duplicates)
# 移除重复行（保留最后一次出现的行）
df_no_duplicates_last = df.drop_duplicates(keep='last')
print("\n移除重复行（保留最后一次出现的行）:")
print(df_no_duplicates_last)
# 在原地移除重复行
df.drop_duplicates(inplace=True)
print("\n在原地移除重复行:")
print(df)

521-6-3、结果输出

# 521、pandas.DataFrame.drop_duplicates方法
# 原始DataFrame:
#    A  B
# 0  1  4
# 1  1  4
# 2  2  5
# 3  2  5
# 4  3  6
# 5  3  6
# 
# 移除重复行（保留首次出现的行）:
#    A  B
# 0  1  4
# 2  2  5
# 4  3  6
# 
# 移除重复行（保留最后一次出现的行）:
#    A  B
# 1  1  4
# 3  2  5
# 5  3  6
# 
# 在原地移除重复行:
#    A  B
# 0  1  4
# 2  2  5
# 4  3  6

522、pandas.DataFrame.duplicated方法

522-1、语法

# 522、pandas.DataFrame.duplicated方法
pandas.DataFrame.duplicated(subset=None, keep='first')
Return boolean Series denoting duplicate rows.

Considering certain columns is optional.

Parameters:
subset
column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.

keep
{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to mark.

first : Mark duplicates as True except for the first occurrence.

last : Mark duplicates as True except for the last occurrence.

False : Mark all duplicates as True.

Returns:
Series
Boolean series for each duplicated rows.

522-2、参数

522-2-1、subset(可选，默认值为None)：单个标签或标签列表，用于指定要检查重复的列，如果设置为None(默认)，则使用所有列进行重复检查。

522-2-2、keep(可选，默认值为'first')：字符串，用于指定在重复项中哪个标记为非重复，可选的值有：

'first'：保留第一次出现的重复项。
'last'：保留最后一次出现的重复项。
False：删除所有重复项，不保留任何重复项。

522-3、功能

返回一个布尔型Series，每行是否为重复行，True表示该行是重复的，False表示该行是唯一的(保留的)。

522-4、返回值

返回值是一个布尔型Series，与DataFrame的行数量相同，每个元素对应DataFrame中一行的重复状态。

522-5、说明

无

522-6、用法

522-6-1、数据准备

无

522-6-2、代码示例

# 522、pandas.DataFrame.duplicated方法
import pandas as pd
data = {
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'x', 'x']
}
df = pd.DataFrame(data)
# 检查所有列的重复
print(df.duplicated())
# 检查列'A'的重复
print(df.duplicated(subset=['A']))
# 保留最后一个重复项
print(df.duplicated(keep='last'))
# 标记所有重复项
print(df.duplicated(keep=False))

522-6-3、结果输出

# 522、pandas.DataFrame.duplicated方法
# 0    False
# 1    False
# 2     True
# 3    False
# 4    False
# 5     True
# dtype: bool
# 0    False
# 1    False
# 2     True
# 3    False
# 4    False
# 5     True
# dtype: bool
# 0    False
# 1     True
# 2    False
# 3    False
# 4     True
# 5    False
# dtype: bool
# 0    False
# 1     True
# 2     True
# 3    False
# 4     True
# 5     True
# dtype: bool

523、pandas.DataFrame.equals方法

523-1、语法

# 523、pandas.DataFrame.equals方法
pandas.DataFrame.equals(other)
Test whether two objects contain the same elements.

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

The row/column index do not need to have the same type, as long as the values are considered equal. Corresponding columns and index must be of the same dtype.

Parameters:
other
Series or DataFrame
The other Series or DataFrame to be compared with the first.

Returns:
bool
True if all elements are the same in both objects, False otherwise.

523-2、参数

523-2-1、other(必须)：指要与当前DataFrame进行比较的另一个DataFrame。

523-3、功能

用于判断两个DataFrame是否相等，它比较两个DataFrame中的所有元素，并返回一个布尔值，指示它们是否完全相同。

523-4、返回值

返回True表示两个DataFrame完全相同(包括相同的元素、相同的标签和相同的数据类型)，否则返回False。

523-5、说明

无

523-6、用法

523-6-1、数据准备

无

523-6-2、代码示例

# 523、pandas.DataFrame.equals方法
import pandas as pd
# 创建两个相同的DataFrame
data1 = {
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
}
data2 = {
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# 比较两个DataFrame
result = df1.equals(df2)
print(result)
# 创建一个不同的DataFrame
data3 = {
    'A': [1, 2, 4],
    'B': ['a', 'b', 'd']
}
df3 = pd.DataFrame(data3)
# 比较df1和df3
result = df1.equals(df3)
print(result)

523-6-3、结果输出

# 523、pandas.DataFrame.equals方法
# True
# False

524、pandas.DataFrame.filter方法

524-1、语法

# 524、pandas.DataFrame.filter方法
pandas.DataFrame.filter(items=None, like=None, regex=None, axis=None)
Subset the dataframe rows or columns according to the specified index labels.

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Parameters:
items
list-like
Keep labels from axis which are in items.

like
str
Keep labels from axis for which “like in label == True”.

regex
str (regular expression)
Keep labels from axis for which re.search(regex, label) == True.

axis
{0 or ‘index’, 1 or ‘columns’, None}, default None
The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘columns’ for DataFrame. For Series this parameter is unused and defaults to None.

Returns:
same type as input object.

524-2、参数

524-2-1、items(可选，默认值为None)：类似列表的对象，例如list、tuple等，用于精确指定要保留的行或列的标签，只有这些标签对应的部分会被保留。

524-2-2、like(可选，默认值为None)：字符串，根据包含某个字符串的标签进行过滤，标签中只要包含了指定字符串的行或列就会被保留。

524-2-3、regex(可选，默认值为None)：正则表达式，使用正则表达式进行标签匹配，匹配成功的标签对应的行或列会被保留。

524-2-4、axis(可选，默认值为None)：{0 or 'index', 1 or 'columns'}，指定要应用筛选的轴，0或'index'表示按行过滤，1或'columns'表示按列过滤。

524-3、功能

用于根据指定条件筛选DataFrame的行或列，返回符合条件的一个新的DataFrame，该方法提供了通过标签、关键词或正则表达式来进行灵活过滤的方式。

524-4、返回值

返回一个经过筛选的新DataFrame，原始DataFrame不会被修改。

524-5、说明

无

524-6、用法

524-6-1、数据准备

无

524-6-2、代码示例

# 524、pandas.DataFrame.filter方法
import pandas as pd
# 示例数据
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9],
    'D': [10, 11, 12]
}
df = pd.DataFrame(data)
# 使用items参数过滤列
filtered_df_items = df.filter(items=['A', 'C'])
print(filtered_df_items, end='\n\n')
# 使用like参数过滤列
filtered_df_like = df.filter(like='B')
print(filtered_df_like, end='\n\n')
# 使用regex参数过滤列
filtered_df_regex = df.filter(regex='[CD]')
print(filtered_df_regex)

524-6-3、结果输出

# 524、pandas.DataFrame.filter方法
#    A  C
# 0  1  7
# 1  2  8
# 2  3  9
# 
#    B
# 0  4
# 1  5
# 2  6
# 
#    C   D
# 0  7  10
# 1  8  11
# 2  9  12

525、pandas.DataFrame.first方法

525-1、语法

# 525、pandas.DataFrame.first方法
pandas.DataFrame.first(offset)
Select initial periods of time series data based on a date offset.

Deprecated since version 2.1: first() is deprecated and will be removed in a future version. Please create a mask and filter using .loc instead.

For a DataFrame with a sorted DatetimeIndex, this function can select the first few rows based on a date offset.

Parameters:
offset
str, DateOffset or dateutil.relativedelta
The offset length of the data that will be selected. For instance, ‘1ME’ will display all the rows having their index within the first month.

Returns:
Series or DataFrame
A subset of the caller.

Raises:
TypeError
If the index is not a DatetimeIndex.

525-2、参数

525-2-1、offset(必须)：字符串，表示时间偏移量的字符串。例如，'5D'表示5天，'3M'表示3个月。

525-3、功能

基于时间索引提取从开始到指定偏移的行。

525-4、返回值

返回一个DataFrame对象，包含从DataFrame开始到指定偏移量的行。

525-5、说明

无

525-6、用法

525-6-1、数据准备

无

525-6-2、代码示例

# 525、pandas.DataFrame.first方法
import pandas as pd
import numpy as np
# 创建示例数据
dates =pd.date_range('2024-01-01', periods=10)
data = np.random.randn(10, 2)
df = pd.DataFrame(data, index=dates, columns=['A', 'B'])
# 使用 first() 方法提取前3天的数据
first3_days = df.first('3D')
print(first3_days)

525-6-3、结果输出

# 525、pandas.DataFrame.first方法
#                    A         B
# 2024-01-01 -0.619384  1.252433
# 2024-01-02 -0.556967  0.084537
# 2024-01-03  0.692299 -0.505099