Python酷库之旅-第三方库Pandas(013)

一、用法精讲

31、pandas.read_feather函数

31-1、语法

31-2、参数

31-3、功能

31-4、返回值

31-5、说明

31-6、用法

31-6-1、数据准备

31-6-2、代码示例

31-6-3、结果输出

32、pandas.DataFrame.to_feather函数

32-1、语法

32-2、参数

32-3、功能

32-4、返回值

32-5、说明

32-6、用法

32-6-1、数据准备

32-6-2、代码示例

32-6-3、结果输出

33、pandas.read_parquet函数

33-1、语法

33-2、参数

33-3、功能

33-4、返回值

33-5、说明

33-6、用法

33-6-1、数据准备

33-6-2、代码示例

33-6-3、结果输出

二、推荐阅读

1、Python筑基之旅

2、Python函数之旅

3、Python算法之旅

4、Python魔法之旅

5、博客个人主页

一、用法精讲

31、pandas.read_feather函数

31-1、语法

# 31、pandas.read_feather函数
pandas.read_feather(path, columns=None, use_threads=True, storage_options=None, dtype_backend=_NoDefault.no_default)
Load a feather-format object from the file path.

Parameters:
pathstr, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.feather.

columnssequence, default None
If not provided, all columns are read.

use_threadsbool, default True
Whether to parallelize reading using multiple threads.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

"numpy_nullable": returns nullable-dtype-backed DataFrame (default).

"pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame.

New in version 2.0.

Returns:
type of object stored in file

31-2、参数

31-2-1、path(必须)：文件路径(字符串或路径对象)，指向要读取的Feather格式文件。

31-2-2、columns(可选，默认值为None)：指定要读取的列名列表，如果为None(默认值)，则读取文件中的所有列，这可以用于减少内存使用，特别是当只需要文件中的部分列时。

31-2-3、use_threads(可选，默认值为True)：是否使用多线程来加速读取过程，默认为True，意味着将尝试使用多线程来加速读取，但这可能取决于底层系统和Python解释器的实现，在某些情况下，关闭多线程(use_threads=False)可能会提供更好的性能。

31-2-4、storage_options(可选，默认值为None)：用于文件系统的额外选项，比如S3或Google Cloud Storage等，这些选项将传递给底层的文件系统对象。对于大多数用户来说，这个参数可能不需要设置，除非你在处理存储在特殊存储系统中的Feather文件。

31-2-5、dtype_backend(可选)：内部调用，通常不需要用户直接设置。

31-3、功能

用于从文件路径中加载Feather格式的对象。

31-4、返回值

返回值是存储在Feather文件中的对象类型，通常是pandas.DataFrame。如果Feather文件中存储的是DataFrame类型的数据，那么read_feather函数就会读取这些数据并返回一个DataFrame对象。

31-5、说明

Feather格式是一种二进制文件格式，专为pandas DataFrame的高效读写而设计，它相比其他文本格式(如CSV)具有更快的读写速度和更小的文件大小，因此，这个函数非常适合于需要快速加载大型数据集的场景。

31-6、用法

31-6-1、数据准备

无

31-6-2、代码示例

# 31、pandas.read_feather函数
# 运行此程序，务必确保你已经安装了pyarrow或fastparquet库
import pandas as pd
import numpy as np
# 创建一个简单的DataFrame
df = pd.DataFrame({
    'A': np.random.randn(100),  # 生成100个正态分布的随机数
    'B': np.random.randint(1, 100, 100)  # 生成100个1到99之间的随机整数
})
# 保存到Feather文件
file_path = 'example.feather'
try:
    df.to_feather(file_path)
    print(f"DataFrame 已成功保存到 {file_path}")
except Exception as e:
    print(f"保存 Feather 文件时发生错误: {e}")
# 读取Feather文件
try:
    df_read = pd.read_feather(file_path)
    print("读取 Feather 文件成功!")
    # 显示读取的数据
    print(df_read.head())  # 只显示前几行，以避免打印太多数据
except FileNotFoundError:
    print(f"文件 {file_path} 未找到，请确保文件存在!")
except Exception as e:
    print(f"读取 Feather 文件时发生错误: {e}")

31-6-3、结果输出

# 31、pandas.read_feather函数
# DataFrame 已成功保存到 example.feather
# 读取 Feather 文件成功!
#           A   B
# 0 -0.425313  48
# 1 -1.915324  72
# 2 -0.391787  97
# 3 -0.014345  48
# 4  1.813109  53

32、pandas.DataFrame.to_feather函数

32-1、语法

# 32、pandas.DataFrame.to_feather函数
DataFrame.to_feather(path, **kwargs)
Write a DataFrame to the binary Feather format.

Parameters:
path
str, path object, file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.

**kwargs
Additional keywords passed to pyarrow.feather.write_feather(). This includes the compression, compression_level, chunksize and version keywords.

Notes

This function writes the dataframe as a feather file. Requires a default index. For saving the DataFrame with your custom index use a method that supports custom indices e.g. to_parquet.

32-2、参数

32-2-1、path(必须)：文件路径(字符串或路径对象)，指定输出文件的路径，可以是相对路径或绝对路径，如果文件已经存在，它会被覆盖。

32-2-2、**kwargs(可选)：传递给PyArrow Feather写入器的额外关键字参数。虽然Pandas的文档可能不直接列出所有可能的参数，但PyArrow的Feather写入器支持一些有用的选项，例如压缩和元数据。以下是一些可能的有用参数(请注意，这些参数的可用性和具体行为可能随 PyArrow 的版本而异)：

32-2-2-1、compression(可选，默认值为None)：指定用于压缩文件的压缩算法，可选值包括'lz4', 'zstd', 'uncompressed'和'snappy'(注意：并非所有算法在所有平台上都可用)。

32-2-2-2、compression_level：int(对于某些压缩算法)，指定压缩级别，较高的值通常会导致更好的压缩比，但也会增加压缩和解压缩的计算成本。

32-2-2-3、version：int(默认是最新支持的版本)，指定要写入的Feather文件的版本，这通常不需要手动指定，除非你有特定的兼容性要求。

32-2-2-4、metadata(可选)：一个字典，允许你为文件附加自定义元数据，这些数据将作为文件的元数据存储，可以在读取文件时检索。

32-3、功能

用于将DataFrame保存为Feather格式的文件。

32-4、返回值

本身不返回任何值(即返回None)，它的主要作用是将DataFrame保存到指定的文件路径中，而不是生成一个新的DataFrame或其他对象。

32-5、说明

无

32-6、用法

32-6-1、数据准备

无

32-6-2、代码示例

# 32、pandas.DataFrame.to_feather函数
# 运行此程序，务必确保你已经安装了pyarrow或fastparquet库
import pandas as pd
# 创建一个示例 DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
})
# 将DataFrame保存为Feather文件
df.to_feather('example.feather')
# 注意：这里不会显示任何返回值，因为 to_feather() 不返回任何内容
# 但是，你可以通过检查文件系统来验证文件是否已被创建
# 稍后，你可以使用pd.read_feather()来重新加载数据
df_loaded = pd.read_feather('example.feather')
print(df_loaded)

32-6-3、结果输出

# 32、pandas.DataFrame.to_feather函数
#    A  B
# 0  1  a
# 1  2  b
# 2  3  c

33、pandas.read_parquet函数

33-1、语法

# 33、pandas.read_parquet函数
pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=_NoDefault.no_default, dtype_backend=_NoDefault.no_default, filesystem=None, filters=None, **kwargs)
Load a parquet object from the file path, returning a DataFrame.

Parameters:
pathstr, path object or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir.

engine{‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’
Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

When using the 'pyarrow' engine and no storage options are provided and a filesystem is implemented by both pyarrow.fs and fsspec (e.g. “s3://”), then the pyarrow.fs filesystem is attempted first. Use the filesystem keyword with an instantiated fsspec filesystem if you wish to use its implementation.

columnslist, default=None
If not None, only these columns will be read from the file.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

New in version 1.3.0.

use_nullable_dtypesbool, default False
If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.

Deprecated since version 2.0.

dtype_backend{‘numpy_nullable’, ‘pyarrow’}, default ‘numpy_nullable’
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

"numpy_nullable": returns nullable-dtype-backed DataFrame (default).

"pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame.

New in version 2.0.

filesystemfsspec or pyarrow filesystem, default None
Filesystem object to use when reading the parquet file. Only implemented for engine="pyarrow".

New in version 2.1.0.

filtersList[Tuple] or List[List[Tuple]], default None
To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in] The innermost tuples are transposed into a set of filters applied through an AND operation. The outer list combines these sets of filters through an OR operation. A single list of tuples can also be used, meaning that no OR operation between set of filters is to be conducted.

Using this argument will NOT result in row-wise filtering of the final partitions unless engine="pyarrow" is also specified. For other engines, filtering is only performed at the partition level, that is, to prevent the loading of some row-groups and/or files.

New in version 2.1.0.

**kwargs
Any additional kwargs are passed to the engine.

Returns:
DataFrame

33-2、参数

33-2-1、path(必须)：Parquet文件的路径，可以是相对路径或绝对路径。

33-2-2、engine(可选，默认值为'auto')：指定用于读取Parquet文件的底层库，'auto'会自动选择(通常基于已安装的库)，'pyarrow'和'fastparquet'是两个流行的Parquet库。

33-2-3、columns(可选，默认值为None)：要读取的列名列表，如果指定，则只读取这些列，这可以显著减少内存使用和数据加载时间。

33-2-4、storage_options(可选，默认值为None)：传递给文件系统的额外选项，如认证信息或配置设置，这通常用于处理存储在云存储(如AWS S3、Google Cloud Storage)上的Parquet文件。

33-2-5、use_nullable_dtypes(可选)：如果为True，则使用Pandas的可空数据类型(如pd.Int64Dtype()、pd.StringDtype())来读取数据，这可以提高数据的准确性和性能，尤其是在处理大型数据集时；如果未指定，则根据Pandas的版本和配置自动选择。

33-2-6、dtype_backend(可选)：内部调用，通常不需要用户手动设置。

33-2-7、filesystem(可选，默认值为None)：用于读取Parquet文件的文件系统实例，这通常与storage_options一起使用，以处理存储在特定存储系统上的文件。

33-2-8、filters(可选，默认值为None)：用于在读取Parquet文件时应用过滤器的表达式列表，这可以显著减少需要加载到内存中的数据量，过滤器的具体语法取决于底层Parquet引擎。

33-2-9、**kwargs(可选)：其他关键字参数将传递给底层的Parquet读取器，这些参数可能因使用的引擎而异，因此请参考相应引擎的文档以获取更多信息。

33-3、功能

从指定的文件路径加载Parquet格式的数据，并返回一个Pandas DataFrame对象。

33-4、返回值

返回一个Pandas DataFrame对象，该对象包含了从Parquet文件中读取的数据。

33-5、说明

33-5-1、在处理大型Parquet文件时，建议合理使用columns和filters参数，以减少加载到内存中的数据量，提高读取效率。

33-5-2、如果Parquet文件存储在云存储上，需要确保已经正确设置了storage_options和(如果需要)filesystem参数，以便能够成功访问和读取文件。

33-5-3、use_nullable_dtypes和dtype_backend参数提供了对数据类型处理的精细控制，但通常不需要手动设置，除非在特定情况下需要优化性能或兼容性。

33-5-4、Parquet是一种列式存储的文件格式，非常适合于大数据的存储和高效读写，通过这个函数，用户可以轻松地将存储在Parquet文件中的数据加载到Pandas DataFrame中，以便进行进一步的数据分析或处理。

33-6、用法

33-6-1、数据准备

无

33-6-2、代码示例

# 33、pandas.read_parquet函数
# 运行此程序，务必确保你已经安装了pyarrow或fastparquet库
import pandas as pd
# 创建一个Pandas DataFrame
data = {
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# 指定Parquet文件的保存路径
parquet_path = 'example.parquet'
# 将DataFrame保存为Parquet文件
df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
print(f"Parquet文件已成功保存到：{parquet_path}")
# 读取Parquet文件
df_read = pd.read_parquet(parquet_path, engine='pyarrow')
# 显示读取的DataFrame以验证数据
print("读取的Parquet文件内容：")
print(df_read)

33-6-3、结果输出

# 33、pandas.read_parquet函数
# Parquet文件已成功保存到：example.parquet
# 读取的Parquet文件内容：
#    id     name  age         city
# 0   1    Alice   25     New York
# 1   2      Bob   30  Los Angeles
# 2   3  Charlie   35      Chicago
# 3   4    David   40      Houston