对Python的 pandas 库所有的内置元类、函数、子模块等全部浏览一遍,然后挑选一些重点学习一下。我安装的库版本号为1.3.5,如下:
>>> import pandas as pd
>>> pd.__version__
'1.3.5'
>>> print(pd.__doc__)
pandas - a powerful data analysis and manipulation library for Python
=====================================================================
**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.
Main Features
-------------
Here are just a few of the things that pandas does well:
- Easy handling of missing data in floating point as well as non-floating
point data.
- Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned
to a set of labels, or the user can simply ignore the labels and let
`Series`, `DataFrame`, etc. automatically align the data for you in
computations.
- Powerful, flexible group by functionality to perform split-apply-combine
operations on data sets, for both aggregating and transforming data.
- Make it easy to convert ragged, differently-indexed data in other Python
and NumPy data structures into DataFrame objects.
- Intelligent label-based slicing, fancy indexing, and subsetting of large
data sets.
- Intuitive merging and joining data sets.
- Flexible reshaping and pivoting of data sets.
- Hierarchical labeling of axes (possible to have multiple labels per tick).
- Robust IO tools for loading data from flat files (CSV and delimited),
Excel files, databases, and saving/loading data from the ultrafast HDF5
format.
- Time series-specific functionality: date range generation and frequency
conversion, moving window statistics, date shifting and lagging.
百度翻译如下:
panda——一个强大的Python数据分析和操作库
=====================================================================
**pandas**是一个Python包,提供快速、灵活和富有表现力的数据设计用于同时处理“关系”或“标记”数据的结构简单直观。它旨在成为使用Python进行实际的**真实世界**数据分析。此外,它还有成为**最强大、最灵活的开源数据的更广泛目标任何语言的分析/操作工具**。它已经很好了朝着这个目标前进。主要特点
-------------
下面是pandas做得很好的几件事:-轻松处理浮点和非浮点丢失的数据点数据。
-大小可变性:可以从DataFrame和高维对象
-自动和显式数据对齐:对象可以显式对齐或者用户可以忽略这些标签,让`“系列”、“DataFrame”等自动为您对齐计算。
-功能强大、灵活的分组功能,可执行拆分应用组合对数据集的操作,用于聚合和转换数据。
-在其他Python中轻松转换不规则的、不同索引的数据将NumPy数据结构转换为DataFrame对象。
-基于标签的智能切片、花式索引和大型数据的子集数据集。
-直观地合并和连接数据集。
-数据集的灵活重塑和数据透视。
-轴的分层标签(每个刻度可能有多个标签)。
-用于从平面文件(CSV和分隔)加载数据的强大IO工具,Excel文件、数据库以及从超快HDF5保存/加载数据总体安排
-时间序列特定功能:日期范围生成和频率转换、移动窗口统计、日期偏移和滞后。
Python pandas库中包含有好几千的元类、库函数、子模块等等,真所谓“任凭弱水三千,我只取一瓢饮”,我先来全部罗列一遍,然后再挑几个重要的学习一番。
119个pandas库函数(包含元类、函数、子模块等):
>>> import pandas as pd
>>> funcs = [_ for _ in dir(pd) if not _.startswith('_')]
>>> len(funcs)
119
>>> for i,f in enumerate(funcs,1):
print(f'{f:18}',end='' if i%5 else '\n')
BooleanDtype Categorical CategoricalDtype CategoricalIndex DataFrame
DateOffset DatetimeIndex DatetimeTZDtype ExcelFile ExcelWriter
Flags Float32Dtype Float64Dtype Float64Index Grouper
HDFStore Index IndexSlice Int16Dtype Int32Dtype
Int64Dtype Int64Index Int8Dtype Interval IntervalDtype
IntervalIndex MultiIndex NA NaT NamedAgg
Period PeriodDtype PeriodIndex RangeIndex Series
SparseDtype StringDtype Timedelta TimedeltaIndex Timestamp
UInt16Dtype UInt32Dtype UInt64Dtype UInt64Index UInt8Dtype
api array arrays bdate_range compat
concat core crosstab cut date_range
describe_option errors eval factorize get_dummies
get_option infer_freq interval_range io isna
isnull json_normalize lreshape melt merge
merge_asof merge_ordered notna notnull offsets
option_context options pandas period_range pivot
pivot_table plotting qcut read_clipboard read_csv
read_excel read_feather read_fwf read_gbq read_hdf
read_html read_json read_orc read_parquet read_pickle
read_sas read_spss read_sql read_sql_query read_sql_table
read_stata read_table read_xml reset_option set_eng_float_format
set_option show_versions test testing timedelta_range
to_datetime to_numeric to_pickle to_timedelta tseries
unique util value_counts wide_to_long
按元类、函数、子模块等,先分个类:
import pandas as pd
funcs = [_ for _ in dir(pd) if not _.startswith('_')]
types = type(pd.DataFrame), type(pd.array), type(pd)
Names = 'Type','Function','Module','Other'
Types = {}
for f in funcs:
t = type(eval("pd."+f))
t = Names[-1 if t not in types else types.index(type(eval("pd."+f)))]
Types[t] = Types.get(t,[])+[f]
for n in Names:
print(f"\n\n【{n}】:{len(Types[n])}")
for i,f in enumerate(Types[n],1):
print(f'{f:18} ',end='' if i%5 or i==len(Types[n]) else '\n')
分类结果:
【Type】:40
BooleanDtype CategoricalDtype CategoricalIndex DataFrame DatetimeIndex
DatetimeTZDtype ExcelFile Flags Float32Dtype Float64Dtype
Float64Index Grouper HDFStore Index Int16Dtype
Int32Dtype Int64Dtype Int64Index Int8Dtype Interval
IntervalDtype IntervalIndex MultiIndex NamedAgg Period
PeriodDtype PeriodIndex RangeIndex Series SparseDtype
StringDtype Timedelta TimedeltaIndex Timestamp UInt16Dtype
UInt32Dtype UInt64Dtype UInt64Index UInt8Dtype option_context【Function】:56
array bdate_range concat crosstab cut
date_range eval factorize get_dummies infer_freq
interval_range isna isnull json_normalize lreshape
melt merge merge_asof merge_ordered notna
notnull period_range pivot pivot_table qcut
read_clipboard read_csv read_excel read_feather read_fwf
read_gbq read_hdf read_html read_json read_orc
read_parquet read_pickle read_sas read_spss read_sql
read_sql_query read_sql_table read_stata read_table read_xml
set_eng_float_format show_versions test timedelta_range to_datetime
to_numeric to_pickle to_timedelta unique value_counts
wide_to_long【Module】:12
api arrays compat core errors
io offsets pandas plotting testing
tseries util【Other】:11
Categorical DateOffset ExcelWriter IndexSlice NA
NaT describe_option get_option options reset_option
set_option
先给出56个库函数的原版帮助,有252K之多单篇博文放不下,只能以连载方式罗列:
A~G: Function01~09
Types['Function'][:9]
['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'eval', 'factorize', 'get_dummies']
Function01
array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'
Help on function array in module pandas.core.construction:
array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'
Create an array.
Parameters
----------
data : Sequence of objects
The scalars inside `data` should be instances of the
scalar type for `dtype`. It's expected that `data`
represents a 1-dimensional array of data.
When `data` is an Index or Series, the underlying array
will be extracted from `data`.
dtype : str, np.dtype, or ExtensionDtype, optional
The dtype to use for the array. This may be a NumPy
dtype or an extension type registered with pandas using
:meth:`pandas.api.extensions.register_extension_dtype`.
If not specified, there are two possibilities:
1. When `data` is a :class:`Series`, :class:`Index`, or
:class:`ExtensionArray`, the `dtype` will be taken
from the data.
2. Otherwise, pandas will attempt to infer the `dtype`
from the data.
Note that when `data` is a NumPy array, ``data.dtype`` is
*not* used for inferring the array type. This is because
NumPy cannot represent all the types of data that can be
held in extension arrays.
Currently, pandas will infer an extension dtype for sequences of
==============================
Scalar Type Array Type
============================== =======================================
:class:`pandas.Interval` :class:`pandas.arrays.IntervalArray`
:class:`pandas.Period` :class:`pandas.arrays.PeriodArray`
:class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray`
:class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray`
:class:`int` :class:`pandas.arrays.IntegerArray`
:class:`float` :class:`pandas.arrays.FloatingArray`
:class:`str` :class:`pandas.arrays.StringArray` or
:class:`pandas.arrays.ArrowStringArray`
:class:`bool` :class:`pandas.arrays.BooleanArray`
=======================================
The ExtensionArray created when the scalar type is :class:`str` is determined by
``pd.options.mode.string_storage`` if the dtype is not explicitly given.
For all other cases, NumPy's usual inference rules will be used.
.. versionchanged:: 1.0.0
Pandas infers nullable-integer dtype for integer data,
string dtype for string data, and nullable-boolean dtype
for boolean data.
.. versionchanged:: 1.2.0
Pandas now also infers nullable-floating dtype for float-like
input data
copy : bool, default True
Whether to copy the data, even if not necessary. Depending
on the type of `data`, creating the new array may require
copying data, even if ``copy=False``.
Returns
-------
ExtensionArray
The newly created array.
Raises
------
ValueError
When `data` is not 1-dimensional.
See Also
--------
numpy.array : Construct a NumPy array.
Series : Construct a pandas Series.
Index : Construct a pandas Index.
arrays.PandasArray : ExtensionArray wrapping a NumPy array.
Series.array : Extract the array stored within a Series.
Notes
-----
Omitting the `dtype` argument means pandas will attempt to infer the
best array type from the values in the data. As new array types are
added by pandas and 3rd party libraries, the "best" array type may
change. We recommend specifying `dtype` to ensure that
1. the correct array type for the data is returned
2. the returned array type doesn't change as new extension types
are added by pandas and third-party libraries
Additionally, if the underlying memory representation of the returned
array matters, we recommend specifying the `dtype` as a concrete object
rather than a string alias or allowing it to be inferred. For example,
a future version of pandas or a 3rd-party library may include a
dedicated ExtensionArray for string data. In this event, the following
would no longer return a :class:`arrays.PandasArray` backed by a NumPy
array.
>>> pd.array(['a', 'b'], dtype=str)
<PandasArray>
['a', 'b']
Length: 2, dtype: str32
This would instead return the new ExtensionArray dedicated for string
data. If you really need the new array to be backed by a NumPy array,
specify that in the dtype.
>>> pd.array(['a', 'b'], dtype=np.dtype("<U1"))
<PandasArray>
['a', 'b']
Length: 2, dtype: str32
Finally, Pandas has arrays that mostly overlap with NumPy
* :class:`arrays.DatetimeArray`
* :class:`arrays.TimedeltaArray`
When data with a ``datetime64[ns]`` or ``timedelta64[ns]`` dtype is
passed, pandas will always return a ``DatetimeArray`` or ``TimedeltaArray``
rather than a ``PandasArray``. This is for symmetry with the case of
timezone-aware data, which NumPy does not natively support.
>>> pd.array(['2015', '2016'], dtype='datetime64[ns]')
<DatetimeArray>
['2015-01-01 00:00:00', '2016-01-01 00:00:00']
Length: 2, dtype: datetime64[ns]
>>> pd.array(["1H", "2H"], dtype='timedelta64[ns]')
<TimedeltaArray>
['0 days 01:00:00', '0 days 02:00:00']
Length: 2, dtype: timedelta64[ns]
Examples
--------
If a dtype is not specified, pandas will infer the best dtype from the values.
See the description of `dtype` for the types pandas infers for.
>>> pd.array([1, 2])
<IntegerArray>
[1, 2]
Length: 2, dtype: Int64
>>> pd.array([1, 2, np.nan])
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
>>> pd.array([1.1, 2.2])
<FloatingArray>
[1.1, 2.2]
Length: 2, dtype: Float64
>>> pd.array(["a", None, "c"])
<StringArray>
['a', <NA>, 'c']
Length: 3, dtype: string
>>> with pd.option_context("string_storage", "pyarrow"):
... arr = pd.array(["a", None, "c"])
...
>>> arr
<ArrowStringArray>
['a', <NA>, 'c']
Length: 3, dtype: string
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
<PeriodArray>
['2000-01-01', '2000-01-01']
Length: 2, dtype: period[D]
You can use the string alias for `dtype`
>>> pd.array(['a', 'b', 'a'], dtype='category')
['a', 'b', 'a']
Categories (2, object): ['a', 'b']
Or specify the actual dtype
>>> pd.array(['a', 'b', 'a'],
... dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True))
['a', 'b', 'a']
Categories (3, object): ['a' < 'b' < 'c']
If pandas does not infer a dedicated extension type a
:class:`arrays.PandasArray` is returned.
>>> pd.array([1 + 1j, 3 + 2j])
<PandasArray>
[(1+1j), (3+2j)]
Length: 2, dtype: complex128
As mentioned in the "Notes" section, new extension types may be added
in the future (by pandas or 3rd party libraries), causing the return
value to no longer be a :class:`arrays.PandasArray`. Specify the `dtype`
as a NumPy dtype if you need to ensure there's no future change in
behavior.
>>> pd.array([1, 2], dtype=np.dtype("int32"))
<PandasArray>
[1, 2]
Length: 2, dtype: int32
`data` must be 1-dimensional. A ValueError is raised when the input
has the wrong dimensionality.
>>> pd.array(1)
Traceback (most recent call last):
...
ValueError: Cannot pass scalar '1' to 'pandas.array'.
Function02
bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'
Help on function bdate_range in module pandas.core.indexes.datetimes:
bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'
Return a fixed frequency DatetimeIndex, with business day as the default
frequency.
Parameters
----------
start : str or datetime-like, default None
Left bound for generating dates.
end : str or datetime-like, default None
Right bound for generating dates.
periods : int, default None
Number of periods to generate.
freq : str or DateOffset, default 'B' (business daily)
Frequency strings can have multiples, e.g. '5H'.
tz : str or None
Time zone name for returning localized DatetimeIndex, for example
Asia/Beijing.
normalize : bool, default False
Normalize start/end dates to midnight before generating date range.
name : str, default None
Name of the resulting DatetimeIndex.
weekmask : str or None, default None
Weekmask of valid business days, passed to ``numpy.busdaycalendar``,
only used when custom frequency strings are passed. The default
value None is equivalent to 'Mon Tue Wed Thu Fri'.
holidays : list-like or None, default None
Dates to exclude from the set of valid business days, passed to
``numpy.busdaycalendar``, only used when custom frequency strings
are passed.
closed : str, default None
Make the interval closed with respect to the given frequency to
the 'left', 'right', or both sides (None).
**kwargs
For compatibility. Has no effect on the result.
Returns
-------
DatetimeIndex
Notes
-----
Of the four parameters: ``start``, ``end``, ``periods``, and ``freq``,
exactly three must be specified. Specifying ``freq`` is a requirement
for ``bdate_range``. Use ``date_range`` if specifying ``freq`` is not
desired.
To learn more about the frequency strings, please see `this link
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
Examples
--------
Note how the two weekend days are skipped in the result.
>>> pd.bdate_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-08'],
dtype='datetime64[ns]', freq='B')
Function03
concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'
Help on function concat in module pandas.core.reshape.concat:
concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'
Concatenate pandas objects along a particular axis with optional set logic
along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.
Parameters
----------
objs : a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the `keys`
argument, unless it is passed, in which case the values will be
selected (see below). Any None objects will be dropped silently unless
they are all None in which case a ValueError will be raised.
axis : {0/'index', 1/'columns'}, default 0
The axis to concatenate along.
join : {'inner', 'outer'}, default 'outer'
How to handle indexes on other axis (or axes).
ignore_index : bool, default False
If True, do not use the index values along the concatenation axis. The
resulting axis will be labeled 0, ..., n - 1. This is useful if you are
concatenating objects where the concatenation axis does not have
meaningful indexing information. Note the index values on the other
axes are still respected in the join.
keys : sequence, default None
If multiple levels passed, should contain tuples. Construct
hierarchical index using the passed keys as the outermost level.
levels : list of sequences, default None
Specific levels (unique values) to use for constructing a
MultiIndex. Otherwise they will be inferred from the keys.
names : list, default None
Names for the levels in the resulting hierarchical index.
verify_integrity : bool, default False
Check whether the new concatenated axis contains duplicates. This can
be very expensive relative to the actual data concatenation.
sort : bool, default False
Sort non-concatenation axis if it is not already aligned when `join`
is 'outer'.
This has no effect when ``join='inner'``, which already preserves
the order of the non-concatenation axis.
.. versionchanged:: 1.0.0
Changed to not sort by default.
copy : bool, default True
If False, do not copy data unnecessarily.
Returns
-------
object, type of objs
When concatenating all ``Series`` along the index (axis=0), a
``Series`` is returned. When ``objs`` contains at least one
``DataFrame``, a ``DataFrame`` is returned. When concatenating along
the columns (axis=1), a ``DataFrame`` is returned.
See Also
--------
Series.append : Concatenate Series.
DataFrame.append : Concatenate DataFrames.
DataFrame.join : Join DataFrames using indexes.
DataFrame.merge : Merge DataFrames by indexes or columns.
Notes
-----
The keys, levels, and names arguments are all optional.
A walkthrough of how this method fits in with other tools for combining
pandas objects can be found `here
<https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>`__.
Examples
--------
Combine two ``Series``.
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0 a
1 b
0 c
1 d
dtype: object
Clear the existing index and reset it in the result
by setting the ``ignore_index`` option to ``True``.
>>> pd.concat([s1, s2], ignore_index=True)
0 a
1 b
2 c
3 d
dtype: object
Add a hierarchical index at the outermost level of
the data with the ``keys`` option.
>>> pd.concat([s1, s2], keys=['s1', 's2'])
s1 0 a
1 b
s2 0 c
1 d
dtype: object
Label the index keys you create with the ``names`` option.
>>> pd.concat([s1, s2], keys=['s1', 's2'],
... names=['Series name', 'Row ID'])
Series name Row ID
s1 0 a
1 b
s2 0 c
1 d
dtype: object
Combine two ``DataFrame`` objects with identical columns.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
Combine ``DataFrame`` objects with overlapping columns
and return everything. Columns outside the intersection will
be filled with ``NaN`` values.
>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
... columns=['letter', 'number', 'animal'])
>>> df3
letter number animal
0 c 3 cat
1 d 4 dog
>>> pd.concat([df1, df3], sort=False)
letter number animal
0 a 1 NaN
1 b 2 NaN
0 c 3 cat
1 d 4 dog
Combine ``DataFrame`` objects with overlapping columns
and return only those that are shared by passing ``inner`` to
the ``join`` keyword argument.
>>> pd.concat([df1, df3], join="inner")
letter number
0 a 1
1 b 2
0 c 3
1 d 4
Combine ``DataFrame`` objects horizontally along the x axis by
passing in ``axis=1``.
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
... columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
Prevent the result from including duplicate index values with the
``verify_integrity`` option.
>>> df5 = pd.DataFrame([1], index=['a'])
>>> df5
0
a 1
>>> df6 = pd.DataFrame([2], index=['a'])
>>> df6
0
a 2
>>> pd.concat([df5, df6], verify_integrity=True)
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: ['a']
Function04
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'
Help on function crosstab in module pandas.core.reshape.pivot:
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'
Compute a simple cross tabulation of two (or more) factors. By default
computes a frequency table of the factors unless an array of values and an
aggregation function are passed.
Parameters
----------
index : array-like, Series, or list of arrays/Series
Values to group by in the rows.
columns : array-like, Series, or list of arrays/Series
Values to group by in the columns.
values : array-like, optional
Array of values to aggregate according to the factors.
Requires `aggfunc` be specified.
rownames : sequence, default None
If passed, must match number of row arrays passed.
colnames : sequence, default None
If passed, must match number of column arrays passed.
aggfunc : function, optional
If specified, requires `values` be specified as well.
margins : bool, default False
Add row/column margins (subtotals).
margins_name : str, default 'All'
Name of the row/column that will contain the totals
when margins is True.
dropna : bool, default True
Do not include columns whose entries are all NaN.
normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
Normalize by dividing all values by the sum of values.
- If passed 'all' or `True`, will normalize over all values.
- If passed 'index' will normalize over each row.
- If passed 'columns' will normalize over each column.
- If margins is `True`, will also normalize margin values.
Returns
-------
DataFrame
Cross tabulation of the data.
See Also
--------
DataFrame.pivot : Reshape data based on column values.
pivot_table : Create a pivot table as a DataFrame.
Notes
-----
Any Series passed will have their name attributes used unless row or column
names for the cross-tabulation are specified.
Any input passed containing Categorical data will have **all** of its
categories included in the cross-tabulation, even if the actual data does
not contain any instances of a particular category.
In the event that there aren't overlapping indexes an empty DataFrame will
be returned.
Examples
--------
>>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
... "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> b = np.array(["one", "one", "one", "two", "one", "one",
... "one", "two", "two", "two", "one"], dtype=object)
>>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
... "shiny", "dull", "shiny", "shiny", "shiny"],
... dtype=object)
>>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b one two
c dull shiny dull shiny
a
bar 1 2 1 0
foo 2 2 1 2
Here 'c' and 'f' are not represented in the data and will not be
shown in the output because dropna is True by default. Set
dropna=False to preserve categories with no data.
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0 d e
row_0
a 1 0
b 0 1
>>> pd.crosstab(foo, bar, dropna=False)
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
Function05
cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)
Help on function cut in module pandas.core.reshape.tile:
cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)
Bin values into discrete intervals.
Use `cut` when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, `cut` could convert ages to groups of
age ranges. Supports binning into an equal number of bins, or a
pre-specified array of bins.
Parameters
----------
x : array-like
The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or IntervalIndex
The criteria to bin by.
* int : Defines the number of equal-width bins in the range of `x`. The
range of `x` is extended by .1% on each side to include the minimum
and maximum values of `x`.
* sequence of scalars : Defines the bin edges allowing for non-uniform
width. No extension of the range of `x` is done.
* IntervalIndex : Defines the exact bins to be used. Note that
IntervalIndex for `bins` must be non-overlapping.
right : bool, default True
Indicates whether `bins` includes the rightmost edge or not. If
``right == True`` (the default), then the `bins` ``[1, 2, 3, 4]``
indicate (1,2], (2,3], (3,4]. This argument is ignored when
`bins` is an IntervalIndex.
labels : array or False, default None
Specifies the labels for the returned bins. Must be the same length as
the resulting bins. If False, returns only integer indicators of the
bins. This affects the type of the output container (see below).
This argument is ignored when `bins` is an IntervalIndex. If True,
raises an error. When `ordered=False`, labels must be provided.
retbins : bool, default False
Whether to return the bins or not. Useful when bins is provided
as a scalar.
precision : int, default 3
The precision at which to store and display the bins labels.
include_lowest : bool, default False
Whether the first interval should be left-inclusive or not.
duplicates : {default 'raise', 'drop'}, optional
If bin edges are not unique, raise ValueError or drop non-uniques.
ordered : bool, default True
Whether the labels are ordered or not. Applies to returned types
Categorical and Series (with Categorical dtype). If True,
the resulting categorical will be ordered. If False, the resulting
categorical will be unordered (labels must be provided).
.. versionadded:: 1.1.0
Returns
-------
out : Categorical, Series, or ndarray
An array-like object representing the respective bin for each value
of `x`. The type depends on the value of `labels`.
* True (default) : returns a Series for Series `x` or a
Categorical for all other inputs. The values stored within
are Interval dtype.
* sequence of scalars : returns a Series for Series `x` or a
Categorical for all other inputs. The values stored within
are whatever the type in the sequence is.
* False : returns an ndarray of integers.
bins : numpy.ndarray or IntervalIndex.
The computed or specified bins. Only returned when `retbins=True`.
For scalar or sequence `bins`, this is an ndarray with the computed
bins. If set `duplicates=drop`, `bins` will drop non-unique bin. For
an IntervalIndex `bins`, this is equal to `bins`.
See Also
--------
qcut : Discretize variable into equal-sized buckets based on rank
or based on sample quantiles.
Categorical : Array type for storing data that come from a
fixed set of values.
Series : One-dimensional array with axis labels (including time series).
IntervalIndex : Immutable Index implementing an ordered, sliceable set.
Notes
-----
Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting Series or Categorical object.
Examples
--------
Discretize into three equal-sized bins.
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
... # doctest: +ELLIPSIS
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
... # doctest: +ELLIPSIS
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
array([0.994, 3. , 5. , 7. ]))
Discovers the same bins, but assign them specific labels. Notice that
the returned Categorical's categories are `labels` and is ordered.
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
... 3, labels=["bad", "medium", "good"])
['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories (3, object): ['bad' < 'medium' < 'good']
``ordered=False`` will result in unordered categories when labels are passed.
This parameter can be used to allow non-unique labels:
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
... labels=["B", "A", "B"], ordered=False)
['B', 'B', 'A', 'A', 'B', 'B']
Categories (2, object): ['A', 'B']
``labels=False`` implies you just want the bins back.
>>> pd.cut([0, 1, 1, 2], bins=4, labels=False)
array([0, 1, 1, 3])
Passing a Series as an input returns a Series with categorical dtype:
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
... index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
... # doctest: +ELLIPSIS
a (1.992, 4.667]
b (1.992, 4.667]
c (4.667, 7.333]
d (7.333, 10.0]
e (7.333, 10.0]
dtype: category
Categories (3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...
Passing a Series as an input returns a Series with mapping value.
It is used to map numerically to intervals based on bins.
>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
... index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False)
... # doctest: +ELLIPSIS
(a 1.0
b 2.0
c 3.0
d 4.0
e NaN
dtype: float64,
array([ 0, 2, 4, 6, 8, 10]))
Use `drop` optional when bins is not unique
>>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True,
... right=False, duplicates='drop')
... # doctest: +ELLIPSIS
(a 1.0
b 2.0
c 3.0
d 3.0
e NaN
dtype: float64,
array([ 0, 2, 4, 6, 10]))
Passing an IntervalIndex for `bins` results in those categories exactly.
Notice that values not covered by the IntervalIndex are set to NaN. 0
is to the left of the first bin (which is closed on the right), and 1.5
falls between two bins.
>>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
>>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
[NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]]
Categories (3, interval[int64, right]): [(0, 1] < (2, 3] < (4, 5]]
Function06
date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'
Help on function date_range in module pandas.core.indexes.datetimes:
date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'
Return a fixed frequency DatetimeIndex.
Returns the range of equally spaced time points (where the difference between any
two adjacent points is specified by the given frequency) such that they all
satisfy `start <[=] x <[=] end`, where the first one and the last one are, resp.,
the first and last time points in that range that fall on the boundary of ``freq``
(if given as a frequency string) or that are valid for ``freq`` (if given as a
:class:`pandas.tseries.offsets.DateOffset`). (If exactly one of ``start``,
``end``, or ``freq`` is *not* specified, this missing parameter can be computed
given ``periods``, the number of timesteps in the range. See the note below.)
Parameters
----------
start : str or datetime-like, optional
Left bound for generating dates.
end : str or datetime-like, optional
Right bound for generating dates.
periods : int, optional
Number of periods to generate.
freq : str or DateOffset, default 'D'
Frequency strings can have multiples, e.g. '5H'. See
:ref:`here <timeseries.offset_aliases>` for a list of
frequency aliases.
tz : str or tzinfo, optional
Time zone name for returning localized DatetimeIndex, for example
'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
timezone-naive.
normalize : bool, default False
Normalize start/end dates to midnight before generating date range.
name : str, default None
Name of the resulting DatetimeIndex.
closed : {None, 'left', 'right'}, optional
Make the interval closed with respect to the given frequency to
the 'left', 'right', or both sides (None, the default).
**kwargs
For compatibility. Has no effect on the result.
Returns
-------
rng : DatetimeIndex
See Also
--------
DatetimeIndex : An immutable container for datetimes.
timedelta_range : Return a fixed frequency TimedeltaIndex.
period_range : Return a fixed frequency PeriodIndex.
interval_range : Return a fixed frequency IntervalIndex.
Notes
-----
Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
exactly three must be specified. If ``freq`` is omitted, the resulting
``DatetimeIndex`` will have ``periods`` linearly spaced elements between
``start`` and ``end`` (closed on both sides).
To learn more about the frequency strings, please see `this link
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
Examples
--------
**Specifying the values**
The next four examples generate the same `DatetimeIndex`, but vary
the combination of `start`, `end` and `periods`.
Specify `start` and `end`, with the default daily frequency.
>>> pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
Specify `start` and `periods`, the number of periods (days).
>>> pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
Specify `end` and `periods`, the number of periods (days).
>>> pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')
Specify `start`, `end`, and `periods`; the frequency is generated
automatically (linearly spaced).
>>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
'2018-04-27 00:00:00'],
dtype='datetime64[ns]', freq=None)
**Other Parameters**
Changed the `freq` (frequency) to ``'M'`` (month end frequency).
>>> pd.date_range(start='1/1/2018', periods=5, freq='M')
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
'2018-05-31'],
dtype='datetime64[ns]', freq='M')
Multiples are allowed
>>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
'2019-01-31'],
dtype='datetime64[ns]', freq='3M')
`freq` can also be specified as an Offset object.
>>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
'2019-01-31'],
dtype='datetime64[ns]', freq='3M')
Specify `tz` to set the timezone.
>>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
'2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
'2018-01-05 00:00:00+09:00'],
dtype='datetime64[ns, Asia/Tokyo]', freq='D')
`closed` controls whether to include `start` and `end` that are on the
boundary. The default includes boundary points on either end.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
dtype='datetime64[ns]', freq='D')
Use ``closed='left'`` to exclude `end` if it falls on the boundary.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
dtype='datetime64[ns]', freq='D')
Use ``closed='right'`` to exclude `start` if it falls on the boundary.
>>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
dtype='datetime64[ns]', freq='D')
Function07
eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)
Help on function eval in module pandas.core.computation.eval:
eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)
Evaluate a Python expression as a string using various backends.
The following arithmetic operations are supported: ``+``, ``-``, ``*``,
``/``, ``**``, ``%``, ``//`` (python engine only) along with the following
boolean operations: ``|`` (or), ``&`` (and), and ``~`` (not).
Additionally, the ``'pandas'`` parser allows the use of :keyword:`and`,
:keyword:`or`, and :keyword:`not` with the same semantics as the
corresponding bitwise operators. :class:`~pandas.Series` and
:class:`~pandas.DataFrame` objects are supported and behave as they would
with plain ol' Python evaluation.
Parameters
----------
expr : str
The expression to evaluate. This string cannot contain any Python
`statements
<https://docs.python.org/3/reference/simple_stmts.html#simple-statements>`__,
only Python `expressions
<https://docs.python.org/3/reference/simple_stmts.html#expression-statements>`__.
parser : {'pandas', 'python'}, default 'pandas'
The parser to use to construct the syntax tree from the expression. The
default of ``'pandas'`` parses code slightly different than standard
Python. Alternatively, you can parse an expression using the
``'python'`` parser to retain strict Python semantics. See the
:ref:`enhancing performance <enhancingperf.eval>` documentation for
more details.
engine : {'python', 'numexpr'}, default 'numexpr'
The engine used to evaluate the expression. Supported engines are
- None : tries to use ``numexpr``, falls back to ``python``
- ``'numexpr'``: This default engine evaluates pandas objects using
numexpr for large speed ups in complex expressions
with large frames.
- ``'python'``: Performs operations as if you had ``eval``'d in top
level python. This engine is generally not that useful.
More backends may be available in the future.
truediv : bool, optional
Whether to use true division, like in Python >= 3.
.. deprecated:: 1.0.0
local_dict : dict or None, optional
A dictionary of local variables, taken from locals() by default.
global_dict : dict or None, optional
A dictionary of global variables, taken from globals() by default.
resolvers : list of dict-like or None, optional
A list of objects implementing the ``__getitem__`` special method that
you can use to inject an additional collection of namespaces to use for
variable lookup. For example, this is used in the
:meth:`~DataFrame.query` method to inject the
``DataFrame.index`` and ``DataFrame.columns``
variables that refer to their respective :class:`~pandas.DataFrame`
instance attributes.
level : int, optional
The number of prior stack frames to traverse and add to the current
scope. Most users will **not** need to change this parameter.
target : object, optional, default None
This is the target object for assignment. It is used when there is
variable assignment in the expression. If so, then `target` must
support item assignment with string keys, and if a copy is being
returned, it must also support `.copy()`.
inplace : bool, default False
If `target` is provided, and the expression mutates `target`, whether
to modify `target` inplace. Otherwise, return a copy of `target` with
the mutation.
Returns
-------
ndarray, numeric scalar, DataFrame, Series, or None
The completion value of evaluating the given code or None if ``inplace=True``.
Raises
------
ValueError
There are many instances where such an error can be raised:
- `target=None`, but the expression is multiline.
- The expression is multiline, but not all them have item assignment.
An example of such an arrangement is this:
a = b + 1
a + 2
Here, there are expressions on different lines, making it multiline,
but the last line has no variable assigned to the output of `a + 2`.
- `inplace=True`, but the expression is missing item assignment.
- Item assignment is provided, but the `target` does not support
string item assignment.
- Item assignment is provided and `inplace=False`, but the `target`
does not support the `.copy()` method
See Also
--------
DataFrame.query : Evaluates a boolean expression to query the columns
of a frame.
DataFrame.eval : Evaluate a string describing operations on
DataFrame columns.
Notes
-----
The ``dtype`` of any objects involved in an arithmetic ``%`` operation are
recursively cast to ``float64``.
See the :ref:`enhancing performance <enhancingperf.eval>` documentation for
more details.
Examples
--------
>>> df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
>>> df
animal age
0 dog 10
1 pig 20
We can add a new column using ``pd.eval``:
>>> pd.eval("double_age = df.age * 2", target=df)
animal age double_age
0 dog 10 20
1 pig 20 40
Function08
factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'
Help on function factorize in module pandas.core.algorithms:
factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an
array when all that matters is identifying distinct values. `factorize`
is available as both a top-level function :func:`pandas.factorize`,
and as a method :meth:`Series.factorize` and :meth:`Index.factorize`.
Parameters
----------
values : sequence
A 1-D sequence. Sequences that aren't pandas objects are
coerced to ndarrays before factorization.
sort : bool, default False
Sort `uniques` and shuffle `codes` to maintain the
relationship.
na_sentinel : int or None, default -1
Value to mark "not found". If None, will not drop the NaN
from the uniques of the values.
.. versionchanged:: 1.1.2
size_hint : int, optional
Hint to the hashtable sizer.
Returns
-------
codes : ndarray
An integer ndarray that's an indexer into `uniques`.
``uniques.take(codes)`` will have the same values as `values`.
uniques : ndarray, Index, or Categorical
The unique valid values. When `values` is Categorical, `uniques`
is a Categorical. When `values` is some other pandas object, an
`Index` is returned. Otherwise, a 1-D ndarray is returned.
.. note ::
Even if there's a missing value in `values`, `uniques` will
*not* contain an entry for it.
See Also
--------
cut : Discretize continuous-valued array.
unique : Find the unique value in an array.
Examples
--------
These examples all show factorize as a top-level method like
``pd.factorize(values)``. The results are identical for methods like
:meth:`Series.factorize`.
>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> codes
array([0, 0, 1, 2, 0]...)
>>> uniques
array(['b', 'a', 'c'], dtype=object)
With ``sort=True``, the `uniques` will be sorted, and `codes` will be
shuffled so that the relationship is the maintained.
>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> codes
array([1, 1, 0, 2, 1]...)
>>> uniques
array(['a', 'b', 'c'], dtype=object)
Missing values are indicated in `codes` with `na_sentinel`
(``-1`` by default). Note that missing values are never
included in `uniques`.
>>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> codes
array([ 0, -1, 1, 2, 0]...)
>>> uniques
array(['b', 'a', 'c'], dtype=object)
Thus far, we've only factorized lists (which are internally coerced to
NumPy arrays). When factorizing pandas objects, the type of `uniques`
will differ. For Categoricals, a `Categorical` is returned.
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> codes, uniques = pd.factorize(cat)
>>> codes
array([0, 0, 1]...)
>>> uniques
['a', 'c']
Categories (3, object): ['a', 'b', 'c']
Notice that ``'b'`` is in ``uniques.categories``, despite not being
present in ``cat.values``.
For all other pandas objects, an Index of the appropriate type is
returned.
>>> cat = pd.Series(['a', 'a', 'c'])
>>> codes, uniques = pd.factorize(cat)
>>> codes
array([0, 0, 1]...)
>>> uniques
Index(['a', 'c'], dtype='object')
If NaN is in the values, and we want to include NaN in the uniques of the
values, it can be achieved by setting ``na_sentinel=None``.
>>> values = np.array([1, 2, 1, np.nan])
>>> codes, uniques = pd.factorize(values) # default: na_sentinel=-1
>>> codes
array([ 0, 1, 0, -1])
>>> uniques
array([1., 2.])
>>> codes, uniques = pd.factorize(values, na_sentinel=None)
>>> codes
array([0, 1, 0, 2])
>>> uniques
array([ 1., 2., nan])
Function09
get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'
Help on function get_dummies in module pandas.core.reshape.reshape:
get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'
Convert categorical variable into dummy/indicator variables.
Parameters
----------
data : array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefix : str, list of str, or dict of str, default None
String to append DataFrame column names.
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternatively, `prefix`
can be a dictionary mapping column names to prefixes.
prefix_sep : str, default '_'
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix`.
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.
sparse : bool, default False
Whether the dummy-encoded columns should be backed by
a :class:`SparseArray` (True) or a regular NumPy array (False).
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.
Returns
-------
DataFrame
Dummy-coded data.
See Also
--------
Series.str.get_dummies : Convert Series to dummy codes.
Examples
--------
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
>>> pd.get_dummies(s1, dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
... 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
待续......
下一篇链接:
https://blog.csdn.net/boysoft2002/article/details/128428096