Python pandas有好几百个库函数,你都用过吗(1)

news2024/11/19 7:34:38

对Python的 pandas 库所有的内置元类、函数、子模块等全部浏览一遍,然后挑选一些重点学习一下。我安装的库版本号为1.3.5,如下:

>>> import pandas as pd
>>> pd.__version__
'1.3.5'
>>> print(pd.__doc__)

pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, date shifting and lagging.

百度翻译如下: 

panda——一个强大的Python数据分析和操作库
=====================================================================
**pandas**是一个Python包,提供快速、灵活和富有表现力的数据设计用于同时处理“关系”或“标记”数据的结构简单直观。它旨在成为使用Python进行实际的**真实世界**数据分析。此外,它还有成为**最强大、最灵活的开源数据的更广泛目标任何语言的分析/操作工具**。它已经很好了朝着这个目标前进。

主要特点
-------------
下面是pandas做得很好的几件事:

-轻松处理浮点和非浮点丢失的数据点数据。

-大小可变性:可以从DataFrame和高维对象

-自动和显式数据对齐:对象可以显式对齐或者用户可以忽略这些标签,让`“系列”、“DataFrame”等自动为您对齐计算。

-功能强大、灵活的分组功能,可执行拆分应用组合对数据集的操作,用于聚合和转换数据。

-在其他Python中轻松转换不规则的、不同索引的数据将NumPy数据结构转换为DataFrame对象。

-基于标签的智能切片、花式索引和大型数据的子集数据集。

-直观地合并和连接数据集。

-数据集的灵活重塑和数据透视。

-轴的分层标签(每个刻度可能有多个标签)。

-用于从平面文件(CSV和分隔)加载数据的强大IO工具,Excel文件、数据库以及从超快HDF5保存/加载数据总体安排

-时间序列特定功能:日期范围生成和频率转换、移动窗口统计、日期偏移和滞后。

共119个pandas库函数(包含元类、函数、子模块等):

>>> import pandas as pd
>>> funcs = [_ for _ in dir(pd) if not _.startswith('_')]
>>> len(funcs)
119
>>> for i,f in enumerate(funcs,1):
	print(f'{f:18}',end='' if i%5 else '\n')

	
BooleanDtype      Categorical       CategoricalDtype  CategoricalIndex  DataFrame         
DateOffset        DatetimeIndex     DatetimeTZDtype   ExcelFile         ExcelWriter       
Flags             Float32Dtype      Float64Dtype      Float64Index      Grouper           
HDFStore          Index             IndexSlice        Int16Dtype        Int32Dtype        
Int64Dtype        Int64Index        Int8Dtype         Interval          IntervalDtype     
IntervalIndex     MultiIndex        NA                NaT               NamedAgg          
Period            PeriodDtype       PeriodIndex       RangeIndex        Series            
SparseDtype       StringDtype       Timedelta         TimedeltaIndex    Timestamp         
UInt16Dtype       UInt32Dtype       UInt64Dtype       UInt64Index       UInt8Dtype        
api               array             arrays            bdate_range       compat            
concat            core              crosstab          cut               date_range        
describe_option   errors            eval              factorize         get_dummies       
get_option        infer_freq        interval_range    io                isna              
isnull            json_normalize    lreshape          melt              merge             
merge_asof        merge_ordered     notna             notnull           offsets           
option_context    options           pandas            period_range      pivot             
pivot_table       plotting          qcut              read_clipboard    read_csv          
read_excel        read_feather      read_fwf          read_gbq          read_hdf          
read_html         read_json         read_orc          read_parquet      read_pickle       
read_sas          read_spss         read_sql          read_sql_query    read_sql_table    
read_stata        read_table        read_xml          reset_option      set_eng_float_format
set_option        show_versions     test              testing           timedelta_range   
to_datetime       to_numeric        to_pickle         to_timedelta      tseries           
unique            util              value_counts      wide_to_long      

按元类、函数、子模块等,先分个类:

import pandas as pd
funcs = [_ for _ in dir(pd) if not _.startswith('_')]
types = type(pd.DataFrame), type(pd.array), type(pd)
Names = 'Type','Function','Module','Other'
Types = {}

for f in funcs:
    t = type(eval("pd."+f))
    t = Names[-1 if t not in types else types.index(type(eval("pd."+f)))]
    Types[t] = Types.get(t,[])+[f]

for n in Names:
    print(f"\n\n【{n}】:{len(Types[n])}")
    for i,f in enumerate(Types[n],1):
        print(f'{f:18} ',end='' if i%5 or i==len(Types[n]) else '\n')

分类结果: 

【Type】:40
BooleanDtype       CategoricalDtype   CategoricalIndex   DataFrame        DatetimeIndex      
DatetimeTZDtype    ExcelFile          Flags              Float32Dtype       Float64Dtype       
Float64Index       Grouper            HDFStore           Index              Int16Dtype         
Int32Dtype         Int64Dtype         Int64Index         Int8Dtype          Interval           
IntervalDtype      IntervalIndex      MultiIndex         NamedAgg           Period             
PeriodDtype        PeriodIndex        RangeIndex         Series             SparseDtype        
StringDtype        Timedelta          TimedeltaIndex     Timestamp          UInt16Dtype        
UInt32Dtype        UInt64Dtype        UInt64Index        UInt8Dtype         option_context     

【Function】:56
array              bdate_range        concat             crosstab           cut                
date_range         eval               factorize          get_dummies        infer_freq         
interval_range     isna               isnull             json_normalize     lreshape           
melt               merge              merge_asof         merge_ordered      notna              
notnull            period_range       pivot              pivot_table        qcut               
read_clipboard     read_csv           read_excel         read_feather       read_fwf           
read_gbq           read_hdf           read_html          read_json          read_orc           
read_parquet       read_pickle        read_sas           read_spss          read_sql           
read_sql_query     read_sql_table     read_stata         read_table         read_xml           
set_eng_float_format show_versions      test               timedelta_range    to_datetime        
to_numeric         to_pickle          to_timedelta       unique             value_counts       
wide_to_long       

【Module】:12
api                arrays             compat             core               errors             
io                 offsets            pandas             plotting           testing            
tseries            util               

【Other】:11
Categorical        DateOffset         ExcelWriter        IndexSlice         NA                 
NaT                describe_option    get_option         options            reset_option       
set_option   

先给出56个库函数的原版帮助,有252K之多单篇博文放不下,只能以连载方式罗列:

A~G: Function01~09

Types['Function'][:9]
['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'eval', 'factorize', 'get_dummies']

Function01

array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'

Help on function array in module pandas.core.construction:

array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'
    Create an array.
    
    Parameters
    ----------
    data : Sequence of objects
        The scalars inside `data` should be instances of the
        scalar type for `dtype`. It's expected that `data`
        represents a 1-dimensional array of data.
    
        When `data` is an Index or Series, the underlying array
        will be extracted from `data`.
    
    dtype : str, np.dtype, or ExtensionDtype, optional
        The dtype to use for the array. This may be a NumPy
        dtype or an extension type registered with pandas using
        :meth:`pandas.api.extensions.register_extension_dtype`.
    
        If not specified, there are two possibilities:
    
        1. When `data` is a :class:`Series`, :class:`Index`, or
           :class:`ExtensionArray`, the `dtype` will be taken
           from the data.
        2. Otherwise, pandas will attempt to infer the `dtype`
           from the data.
    
        Note that when `data` is a NumPy array, ``data.dtype`` is
        *not* used for inferring the array type. This is because
        NumPy cannot represent all the types of data that can be
        held in extension arrays.
    
        Currently, pandas will infer an extension dtype for sequences of
    
        ============================== =======================================
        Scalar Type                    Array Type
        ============================== =======================================
        :class:`pandas.Interval`       :class:`pandas.arrays.IntervalArray`
        :class:`pandas.Period`         :class:`pandas.arrays.PeriodArray`
        :class:`datetime.datetime`     :class:`pandas.arrays.DatetimeArray`
        :class:`datetime.timedelta`    :class:`pandas.arrays.TimedeltaArray`
        :class:`int`                   :class:`pandas.arrays.IntegerArray`
        :class:`float`                 :class:`pandas.arrays.FloatingArray`
        :class:`str`                   :class:`pandas.arrays.StringArray` or
                                       :class:`pandas.arrays.ArrowStringArray`
        :class:`bool`                  :class:`pandas.arrays.BooleanArray`
        ============================== =======================================
    
        The ExtensionArray created when the scalar type is :class:`str` is determined by
        ``pd.options.mode.string_storage`` if the dtype is not explicitly given.
    
        For all other cases, NumPy's usual inference rules will be used.
    
        .. versionchanged:: 1.0.0
    
           Pandas infers nullable-integer dtype for integer data,
           string dtype for string data, and nullable-boolean dtype
           for boolean data.
    
        .. versionchanged:: 1.2.0
    
            Pandas now also infers nullable-floating dtype for float-like
            input data
    
    copy : bool, default True
        Whether to copy the data, even if not necessary. Depending
        on the type of `data`, creating the new array may require
        copying data, even if ``copy=False``.
    
    Returns
    -------
    ExtensionArray
        The newly created array.
    
    Raises
    ------
    ValueError
        When `data` is not 1-dimensional.
    
    See Also
    --------
    numpy.array : Construct a NumPy array.
    Series : Construct a pandas Series.
    Index : Construct a pandas Index.
    arrays.PandasArray : ExtensionArray wrapping a NumPy array.
    Series.array : Extract the array stored within a Series.
    
    Notes
    -----
    Omitting the `dtype` argument means pandas will attempt to infer the
    best array type from the values in the data. As new array types are
    added by pandas and 3rd party libraries, the "best" array type may
    change. We recommend specifying `dtype` to ensure that
    
    1. the correct array type for the data is returned
    2. the returned array type doesn't change as new extension types
       are added by pandas and third-party libraries
    
    Additionally, if the underlying memory representation of the returned
    array matters, we recommend specifying the `dtype` as a concrete object
    rather than a string alias or allowing it to be inferred. For example,
    a future version of pandas or a 3rd-party library may include a
    dedicated ExtensionArray for string data. In this event, the following
    would no longer return a :class:`arrays.PandasArray` backed by a NumPy
    array.
    
    >>> pd.array(['a', 'b'], dtype=str)
    <PandasArray>
    ['a', 'b']
    Length: 2, dtype: str32
    
    This would instead return the new ExtensionArray dedicated for string
    data. If you really need the new array to be backed by a  NumPy array,
    specify that in the dtype.
    
    >>> pd.array(['a', 'b'], dtype=np.dtype("<U1"))
    <PandasArray>
    ['a', 'b']
    Length: 2, dtype: str32
    
    Finally, Pandas has arrays that mostly overlap with NumPy
    
      * :class:`arrays.DatetimeArray`
      * :class:`arrays.TimedeltaArray`
    
    When data with a ``datetime64[ns]`` or ``timedelta64[ns]`` dtype is
    passed, pandas will always return a ``DatetimeArray`` or ``TimedeltaArray``
    rather than a ``PandasArray``. This is for symmetry with the case of
    timezone-aware data, which NumPy does not natively support.
    
    >>> pd.array(['2015', '2016'], dtype='datetime64[ns]')
    <DatetimeArray>
    ['2015-01-01 00:00:00', '2016-01-01 00:00:00']
    Length: 2, dtype: datetime64[ns]
    
    >>> pd.array(["1H", "2H"], dtype='timedelta64[ns]')
    <TimedeltaArray>
    ['0 days 01:00:00', '0 days 02:00:00']
    Length: 2, dtype: timedelta64[ns]
    
    Examples
    --------
    If a dtype is not specified, pandas will infer the best dtype from the values.
    See the description of `dtype` for the types pandas infers for.
    
    >>> pd.array([1, 2])
    <IntegerArray>
    [1, 2]
    Length: 2, dtype: Int64
    
    >>> pd.array([1, 2, np.nan])
    <IntegerArray>
    [1, 2, <NA>]
    Length: 3, dtype: Int64
    
    >>> pd.array([1.1, 2.2])
    <FloatingArray>
    [1.1, 2.2]
    Length: 2, dtype: Float64
    
    >>> pd.array(["a", None, "c"])
    <StringArray>
    ['a', <NA>, 'c']
    Length: 3, dtype: string
    
    >>> with pd.option_context("string_storage", "pyarrow"):
    ...     arr = pd.array(["a", None, "c"])
    ...
    >>> arr
    <ArrowStringArray>
    ['a', <NA>, 'c']
    Length: 3, dtype: string
    
    >>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
    <PeriodArray>
    ['2000-01-01', '2000-01-01']
    Length: 2, dtype: period[D]
    
    You can use the string alias for `dtype`
    
    >>> pd.array(['a', 'b', 'a'], dtype='category')
    ['a', 'b', 'a']
    Categories (2, object): ['a', 'b']
    
    Or specify the actual dtype
    
    >>> pd.array(['a', 'b', 'a'],
    ...          dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True))
    ['a', 'b', 'a']
    Categories (3, object): ['a' < 'b' < 'c']
    
    If pandas does not infer a dedicated extension type a
    :class:`arrays.PandasArray` is returned.
    
    >>> pd.array([1 + 1j, 3 + 2j])
    <PandasArray>
    [(1+1j), (3+2j)]
    Length: 2, dtype: complex128
    
    As mentioned in the "Notes" section, new extension types may be added
    in the future (by pandas or 3rd party libraries), causing the return
    value to no longer be a :class:`arrays.PandasArray`. Specify the `dtype`
    as a NumPy dtype if you need to ensure there's no future change in
    behavior.
    
    >>> pd.array([1, 2], dtype=np.dtype("int32"))
    <PandasArray>
    [1, 2]
    Length: 2, dtype: int32
    
    `data` must be 1-dimensional. A ValueError is raised when the input
    has the wrong dimensionality.
    
    >>> pd.array(1)
    Traceback (most recent call last):
      ...
    ValueError: Cannot pass scalar '1' to 'pandas.array'.

Function02

bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'

Help on function bdate_range in module pandas.core.indexes.datetimes:

bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'
    Return a fixed frequency DatetimeIndex, with business day as the default
    frequency.
    
    Parameters
    ----------
    start : str or datetime-like, default None
        Left bound for generating dates.
    end : str or datetime-like, default None
        Right bound for generating dates.
    periods : int, default None
        Number of periods to generate.
    freq : str or DateOffset, default 'B' (business daily)
        Frequency strings can have multiples, e.g. '5H'.
    tz : str or None
        Time zone name for returning localized DatetimeIndex, for example
        Asia/Beijing.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    weekmask : str or None, default None
        Weekmask of valid business days, passed to ``numpy.busdaycalendar``,
        only used when custom frequency strings are passed.  The default
        value None is equivalent to 'Mon Tue Wed Thu Fri'.
    holidays : list-like or None, default None
        Dates to exclude from the set of valid business days, passed to
        ``numpy.busdaycalendar``, only used when custom frequency strings
        are passed.
    closed : str, default None
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    DatetimeIndex
    
    Notes
    -----
    Of the four parameters: ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified.  Specifying ``freq`` is a requirement
    for ``bdate_range``.  Use ``date_range`` if specifying ``freq`` is not
    desired.
    
    To learn more about the frequency strings, please see `this link
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    Note how the two weekend days are skipped in the result.
    
    >>> pd.bdate_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-08'],
              dtype='datetime64[ns]', freq='B')

Function03

concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'

Help on function concat in module pandas.core.reshape.concat:

concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'
    Concatenate pandas objects along a particular axis with optional set logic
    along the other axes.
    
    Can also add a layer of hierarchical indexing on the concatenation axis,
    which may be useful if the labels are the same (or overlapping) on
    the passed axis number.
    
    Parameters
    ----------
    objs : a sequence or mapping of Series or DataFrame objects
        If a mapping is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which case a ValueError will be raised.
    axis : {0/'index', 1/'columns'}, default 0
        The axis to concatenate along.
    join : {'inner', 'outer'}, default 'outer'
        How to handle indexes on other axis (or axes).
    ignore_index : bool, default False
        If True, do not use the index values along the concatenation axis. The
        resulting axis will be labeled 0, ..., n - 1. This is useful if you are
        concatenating objects where the concatenation axis does not have
        meaningful indexing information. Note the index values on the other
        axes are still respected in the join.
    keys : sequence, default None
        If multiple levels passed, should contain tuples. Construct
        hierarchical index using the passed keys as the outermost level.
    levels : list of sequences, default None
        Specific levels (unique values) to use for constructing a
        MultiIndex. Otherwise they will be inferred from the keys.
    names : list, default None
        Names for the levels in the resulting hierarchical index.
    verify_integrity : bool, default False
        Check whether the new concatenated axis contains duplicates. This can
        be very expensive relative to the actual data concatenation.
    sort : bool, default False
        Sort non-concatenation axis if it is not already aligned when `join`
        is 'outer'.
        This has no effect when ``join='inner'``, which already preserves
        the order of the non-concatenation axis.
    
        .. versionchanged:: 1.0.0
    
           Changed to not sort by default.
    
    copy : bool, default True
        If False, do not copy data unnecessarily.
    
    Returns
    -------
    object, type of objs
        When concatenating all ``Series`` along the index (axis=0), a
        ``Series`` is returned. When ``objs`` contains at least one
        ``DataFrame``, a ``DataFrame`` is returned. When concatenating along
        the columns (axis=1), a ``DataFrame`` is returned.
    
    See Also
    --------
    Series.append : Concatenate Series.
    DataFrame.append : Concatenate DataFrames.
    DataFrame.join : Join DataFrames using indexes.
    DataFrame.merge : Merge DataFrames by indexes or columns.
    
    Notes
    -----
    The keys, levels, and names arguments are all optional.
    
    A walkthrough of how this method fits in with other tools for combining
    pandas objects can be found `here
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>`__.
    
    Examples
    --------
    Combine two ``Series``.
    
    >>> s1 = pd.Series(['a', 'b'])
    >>> s2 = pd.Series(['c', 'd'])
    >>> pd.concat([s1, s2])
    0    a
    1    b
    0    c
    1    d
    dtype: object
    
    Clear the existing index and reset it in the result
    by setting the ``ignore_index`` option to ``True``.
    
    >>> pd.concat([s1, s2], ignore_index=True)
    0    a
    1    b
    2    c
    3    d
    dtype: object
    
    Add a hierarchical index at the outermost level of
    the data with the ``keys`` option.
    
    >>> pd.concat([s1, s2], keys=['s1', 's2'])
    s1  0    a
        1    b
    s2  0    c
        1    d
    dtype: object
    
    Label the index keys you create with the ``names`` option.
    
    >>> pd.concat([s1, s2], keys=['s1', 's2'],
    ...           names=['Series name', 'Row ID'])
    Series name  Row ID
    s1           0         a
                 1         b
    s2           0         c
                 1         d
    dtype: object
    
    Combine two ``DataFrame`` objects with identical columns.
    
    >>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
    ...                    columns=['letter', 'number'])
    >>> df1
      letter  number
    0      a       1
    1      b       2
    >>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
    ...                    columns=['letter', 'number'])
    >>> df2
      letter  number
    0      c       3
    1      d       4
    >>> pd.concat([df1, df2])
      letter  number
    0      a       1
    1      b       2
    0      c       3
    1      d       4
    
    Combine ``DataFrame`` objects with overlapping columns
    and return everything. Columns outside the intersection will
    be filled with ``NaN`` values.
    
    >>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
    ...                    columns=['letter', 'number', 'animal'])
    >>> df3
      letter  number animal
    0      c       3    cat
    1      d       4    dog
    >>> pd.concat([df1, df3], sort=False)
      letter  number animal
    0      a       1    NaN
    1      b       2    NaN
    0      c       3    cat
    1      d       4    dog
    
    Combine ``DataFrame`` objects with overlapping columns
    and return only those that are shared by passing ``inner`` to
    the ``join`` keyword argument.
    
    >>> pd.concat([df1, df3], join="inner")
      letter  number
    0      a       1
    1      b       2
    0      c       3
    1      d       4
    
    Combine ``DataFrame`` objects horizontally along the x axis by
    passing in ``axis=1``.
    
    >>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
    ...                    columns=['animal', 'name'])
    >>> pd.concat([df1, df4], axis=1)
      letter  number  animal    name
    0      a       1    bird   polly
    1      b       2  monkey  george
    
    Prevent the result from including duplicate index values with the
    ``verify_integrity`` option.
    
    >>> df5 = pd.DataFrame([1], index=['a'])
    >>> df5
       0
    a  1
    >>> df6 = pd.DataFrame([2], index=['a'])
    >>> df6
       0
    a  2
    >>> pd.concat([df5, df6], verify_integrity=True)
    Traceback (most recent call last):
        ...
    ValueError: Indexes have overlapping values: ['a']

​​​​​​​Function04

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'

Help on function crosstab in module pandas.core.reshape.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
    Returns
    -------
    DataFrame
        Cross tabulation of the data.
    
    See Also
    --------
    DataFrame.pivot : Reshape data based on column values.
    pivot_table : Create a pivot table as a DataFrame.
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified.
    
    Any input passed containing Categorical data will have **all** of its
    categories included in the cross-tabulation, even if the actual data does
    not contain any instances of a particular category.
    
    In the event that there aren't overlapping indexes an empty DataFrame will
    be returned.
    
    Examples
    --------
    >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
    ...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
    >>> b = np.array(["one", "one", "one", "two", "one", "one",
    ...               "one", "two", "two", "two", "one"], dtype=object)
    >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
    ...               "shiny", "dull", "shiny", "shiny", "shiny"],
    ...              dtype=object)
    >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
    Here 'c' and 'f' are not represented in the data and will not be
    shown in the output because dropna is True by default. Set
    dropna=False to preserve categories with no data.
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> pd.crosstab(foo, bar)
    col_0  d  e
    row_0
    a      1  0
    b      0  1
    >>> pd.crosstab(foo, bar, dropna=False)
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0

​​​​​​​Function05

cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)

Help on function cut in module pandas.core.reshape.tile:

cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)
    Bin values into discrete intervals.
    
    Use `cut` when you need to segment and sort data values into bins. This
    function is also useful for going from a continuous variable to a
    categorical variable. For example, `cut` could convert ages to groups of
    age ranges. Supports binning into an equal number of bins, or a
    pre-specified array of bins.
    
    Parameters
    ----------
    x : array-like
        The input array to be binned. Must be 1-dimensional.
    bins : int, sequence of scalars, or IntervalIndex
        The criteria to bin by.
    
        * int : Defines the number of equal-width bins in the range of `x`. The
          range of `x` is extended by .1% on each side to include the minimum
          and maximum values of `x`.
        * sequence of scalars : Defines the bin edges allowing for non-uniform
          width. No extension of the range of `x` is done.
        * IntervalIndex : Defines the exact bins to be used. Note that
          IntervalIndex for `bins` must be non-overlapping.
    
    right : bool, default True
        Indicates whether `bins` includes the rightmost edge or not. If
        ``right == True`` (the default), then the `bins` ``[1, 2, 3, 4]``
        indicate (1,2], (2,3], (3,4]. This argument is ignored when
        `bins` is an IntervalIndex.
    labels : array or False, default None
        Specifies the labels for the returned bins. Must be the same length as
        the resulting bins. If False, returns only integer indicators of the
        bins. This affects the type of the output container (see below).
        This argument is ignored when `bins` is an IntervalIndex. If True,
        raises an error. When `ordered=False`, labels must be provided.
    retbins : bool, default False
        Whether to return the bins or not. Useful when bins is provided
        as a scalar.
    precision : int, default 3
        The precision at which to store and display the bins labels.
    include_lowest : bool, default False
        Whether the first interval should be left-inclusive or not.
    duplicates : {default 'raise', 'drop'}, optional
        If bin edges are not unique, raise ValueError or drop non-uniques.
    ordered : bool, default True
        Whether the labels are ordered or not. Applies to returned types
        Categorical and Series (with Categorical dtype). If True,
        the resulting categorical will be ordered. If False, the resulting
        categorical will be unordered (labels must be provided).
    
        .. versionadded:: 1.1.0
    
    Returns
    -------
    out : Categorical, Series, or ndarray
        An array-like object representing the respective bin for each value
        of `x`. The type depends on the value of `labels`.
    
        * True (default) : returns a Series for Series `x` or a
          Categorical for all other inputs. The values stored within
          are Interval dtype.
    
        * sequence of scalars : returns a Series for Series `x` or a
          Categorical for all other inputs. The values stored within
          are whatever the type in the sequence is.
    
        * False : returns an ndarray of integers.
    
    bins : numpy.ndarray or IntervalIndex.
        The computed or specified bins. Only returned when `retbins=True`.
        For scalar or sequence `bins`, this is an ndarray with the computed
        bins. If set `duplicates=drop`, `bins` will drop non-unique bin. For
        an IntervalIndex `bins`, this is equal to `bins`.
    
    See Also
    --------
    qcut : Discretize variable into equal-sized buckets based on rank
        or based on sample quantiles.
    Categorical : Array type for storing data that come from a
        fixed set of values.
    Series : One-dimensional array with axis labels (including time series).
    IntervalIndex : Immutable Index implementing an ordered, sliceable set.
    
    Notes
    -----
    Any NA values will be NA in the result. Out of bounds values will be NA in
    the resulting Series or Categorical object.
    
    Examples
    --------
    Discretize into three equal-sized bins.
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
    ... # doctest: +ELLIPSIS
    [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
    Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
    ... # doctest: +ELLIPSIS
    ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
    Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
    array([0.994, 3.   , 5.   , 7.   ]))
    
    Discovers the same bins, but assign them specific labels. Notice that
    the returned Categorical's categories are `labels` and is ordered.
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
    ...        3, labels=["bad", "medium", "good"])
    ['bad', 'good', 'medium', 'medium', 'good', 'bad']
    Categories (3, object): ['bad' < 'medium' < 'good']
    
    ``ordered=False`` will result in unordered categories when labels are passed.
    This parameter can be used to allow non-unique labels:
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
    ...        labels=["B", "A", "B"], ordered=False)
    ['B', 'B', 'A', 'A', 'B', 'B']
    Categories (2, object): ['A', 'B']
    
    ``labels=False`` implies you just want the bins back.
    
    >>> pd.cut([0, 1, 1, 2], bins=4, labels=False)
    array([0, 1, 1, 3])
    
    Passing a Series as an input returns a Series with categorical dtype:
    
    >>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
    ...               index=['a', 'b', 'c', 'd', 'e'])
    >>> pd.cut(s, 3)
    ... # doctest: +ELLIPSIS
    a    (1.992, 4.667]
    b    (1.992, 4.667]
    c    (4.667, 7.333]
    d     (7.333, 10.0]
    e     (7.333, 10.0]
    dtype: category
    Categories (3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...
    
    Passing a Series as an input returns a Series with mapping value.
    It is used to map numerically to intervals based on bins.
    
    >>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
    ...               index=['a', 'b', 'c', 'd', 'e'])
    >>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False)
    ... # doctest: +ELLIPSIS
    (a    1.0
     b    2.0
     c    3.0
     d    4.0
     e    NaN
     dtype: float64,
     array([ 0,  2,  4,  6,  8, 10]))
    
    Use `drop` optional when bins is not unique
    
    >>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True,
    ...        right=False, duplicates='drop')
    ... # doctest: +ELLIPSIS
    (a    1.0
     b    2.0
     c    3.0
     d    3.0
     e    NaN
     dtype: float64,
     array([ 0,  2,  4,  6, 10]))
    
    Passing an IntervalIndex for `bins` results in those categories exactly.
    Notice that values not covered by the IntervalIndex are set to NaN. 0
    is to the left of the first bin (which is closed on the right), and 1.5
    falls between two bins.
    
    >>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
    >>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
    [NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]]
    Categories (3, interval[int64, right]): [(0, 1] < (2, 3] < (4, 5]]

​​​​​​​Function06

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'

Help on function date_range in module pandas.core.indexes.datetimes:

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'
    Return a fixed frequency DatetimeIndex.
    
    Returns the range of equally spaced time points (where the difference between any
    two adjacent points is specified by the given frequency) such that they all
    satisfy `start <[=] x <[=] end`, where the first one and the last one are, resp.,
    the first and last time points in that range that fall on the boundary of ``freq``
    (if given as a frequency string) or that are valid for ``freq`` (if given as a
    :class:`pandas.tseries.offsets.DateOffset`). (If exactly one of ``start``,
    ``end``, or ``freq`` is *not* specified, this missing parameter can be computed
    given ``periods``, the number of timesteps in the range. See the note below.)
    
    Parameters
    ----------
    start : str or datetime-like, optional
        Left bound for generating dates.
    end : str or datetime-like, optional
        Right bound for generating dates.
    periods : int, optional
        Number of periods to generate.
    freq : str or DateOffset, default 'D'
        Frequency strings can have multiples, e.g. '5H'. See
        :ref:`here <timeseries.offset_aliases>` for a list of
        frequency aliases.
    tz : str or tzinfo, optional
        Time zone name for returning localized DatetimeIndex, for example
        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
        timezone-naive.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    closed : {None, 'left', 'right'}, optional
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None, the default).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    rng : DatetimeIndex
    
    See Also
    --------
    DatetimeIndex : An immutable container for datetimes.
    timedelta_range : Return a fixed frequency TimedeltaIndex.
    period_range : Return a fixed frequency PeriodIndex.
    interval_range : Return a fixed frequency IntervalIndex.
    
    Notes
    -----
    Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified. If ``freq`` is omitted, the resulting
    ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
    ``start`` and ``end`` (closed on both sides).
    
    To learn more about the frequency strings, please see `this link
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    **Specifying the values**
    
    The next four examples generate the same `DatetimeIndex`, but vary
    the combination of `start`, `end` and `periods`.
    
    Specify `start` and `end`, with the default daily frequency.
    
    >>> pd.date_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start` and `periods`, the number of periods (days).
    
    >>> pd.date_range(start='1/1/2018', periods=8)
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `end` and `periods`, the number of periods (days).
    
    >>> pd.date_range(end='1/1/2018', periods=8)
    DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                   '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start`, `end`, and `periods`; the frequency is generated
    automatically (linearly spaced).
    
    >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
    DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                   '2018-04-27 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    
    **Other Parameters**
    
    Changed the `freq` (frequency) to ``'M'`` (month end frequency).
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
    DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                   '2018-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    Multiples are allowed
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    `freq` can also be specified as an Offset object.
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    Specify `tz` to set the timezone.
    
    >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
    DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                   '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                   '2018-01-05 00:00:00+09:00'],
                  dtype='datetime64[ns, Asia/Tokyo]', freq='D')
    
    `closed` controls whether to include `start` and `end` that are on the
    boundary. The default includes boundary points on either end.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='left'`` to exclude `end` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='right'`` to exclude `start` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
    DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')

​​​​​​​Function07

eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)

Help on function eval in module pandas.core.computation.eval:

eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)
    Evaluate a Python expression as a string using various backends.
    
    The following arithmetic operations are supported: ``+``, ``-``, ``*``,
    ``/``, ``**``, ``%``, ``//`` (python engine only) along with the following
    boolean operations: ``|`` (or), ``&`` (and), and ``~`` (not).
    Additionally, the ``'pandas'`` parser allows the use of :keyword:`and`,
    :keyword:`or`, and :keyword:`not` with the same semantics as the
    corresponding bitwise operators.  :class:`~pandas.Series` and
    :class:`~pandas.DataFrame` objects are supported and behave as they would
    with plain ol' Python evaluation.
    
    Parameters
    ----------
    expr : str
        The expression to evaluate. This string cannot contain any Python
        `statements
        <https://docs.python.org/3/reference/simple_stmts.html#simple-statements>`__,
        only Python `expressions
        <https://docs.python.org/3/reference/simple_stmts.html#expression-statements>`__.
    parser : {'pandas', 'python'}, default 'pandas'
        The parser to use to construct the syntax tree from the expression. The
        default of ``'pandas'`` parses code slightly different than standard
        Python. Alternatively, you can parse an expression using the
        ``'python'`` parser to retain strict Python semantics.  See the
        :ref:`enhancing performance <enhancingperf.eval>` documentation for
        more details.
    engine : {'python', 'numexpr'}, default 'numexpr'
    
        The engine used to evaluate the expression. Supported engines are
    
        - None         : tries to use ``numexpr``, falls back to ``python``
        - ``'numexpr'``: This default engine evaluates pandas objects using
                         numexpr for large speed ups in complex expressions
                         with large frames.
        - ``'python'``: Performs operations as if you had ``eval``'d in top
                        level python. This engine is generally not that useful.
    
        More backends may be available in the future.
    
    truediv : bool, optional
        Whether to use true division, like in Python >= 3.
    
        .. deprecated:: 1.0.0
    
    local_dict : dict or None, optional
        A dictionary of local variables, taken from locals() by default.
    global_dict : dict or None, optional
        A dictionary of global variables, taken from globals() by default.
    resolvers : list of dict-like or None, optional
        A list of objects implementing the ``__getitem__`` special method that
        you can use to inject an additional collection of namespaces to use for
        variable lookup. For example, this is used in the
        :meth:`~DataFrame.query` method to inject the
        ``DataFrame.index`` and ``DataFrame.columns``
        variables that refer to their respective :class:`~pandas.DataFrame`
        instance attributes.
    level : int, optional
        The number of prior stack frames to traverse and add to the current
        scope. Most users will **not** need to change this parameter.
    target : object, optional, default None
        This is the target object for assignment. It is used when there is
        variable assignment in the expression. If so, then `target` must
        support item assignment with string keys, and if a copy is being
        returned, it must also support `.copy()`.
    inplace : bool, default False
        If `target` is provided, and the expression mutates `target`, whether
        to modify `target` inplace. Otherwise, return a copy of `target` with
        the mutation.
    
    Returns
    -------
    ndarray, numeric scalar, DataFrame, Series, or None
        The completion value of evaluating the given code or None if ``inplace=True``.
    
    Raises
    ------
    ValueError
        There are many instances where such an error can be raised:
    
        - `target=None`, but the expression is multiline.
        - The expression is multiline, but not all them have item assignment.
          An example of such an arrangement is this:
    
          a = b + 1
          a + 2
    
          Here, there are expressions on different lines, making it multiline,
          but the last line has no variable assigned to the output of `a + 2`.
        - `inplace=True`, but the expression is missing item assignment.
        - Item assignment is provided, but the `target` does not support
          string item assignment.
        - Item assignment is provided and `inplace=False`, but the `target`
          does not support the `.copy()` method
    
    See Also
    --------
    DataFrame.query : Evaluates a boolean expression to query the columns
            of a frame.
    DataFrame.eval : Evaluate a string describing operations on
            DataFrame columns.
    
    Notes
    -----
    The ``dtype`` of any objects involved in an arithmetic ``%`` operation are
    recursively cast to ``float64``.
    
    See the :ref:`enhancing performance <enhancingperf.eval>` documentation for
    more details.
    
    Examples
    --------
    >>> df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
    >>> df
      animal  age
    0    dog   10
    1    pig   20
    
    We can add a new column using ``pd.eval``:
    
    >>> pd.eval("double_age = df.age * 2", target=df)
      animal  age  double_age
    0    dog   10          20
    1    pig   20          40

​​​​​​​Function08

factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'

Help on function factorize in module pandas.core.algorithms:

factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'
    Encode the object as an enumerated type or categorical variable.
    
    This method is useful for obtaining a numeric representation of an
    array when all that matters is identifying distinct values. `factorize`
    is available as both a top-level function :func:`pandas.factorize`,
    and as a method :meth:`Series.factorize` and :meth:`Index.factorize`.
    
    Parameters
    ----------
    values : sequence
        A 1-D sequence. Sequences that aren't pandas objects are
        coerced to ndarrays before factorization.
    sort : bool, default False
        Sort `uniques` and shuffle `codes` to maintain the
        relationship.
    
    na_sentinel : int or None, default -1
        Value to mark "not found". If None, will not drop the NaN
        from the uniques of the values.
    
        .. versionchanged:: 1.1.2
    size_hint : int, optional
        Hint to the hashtable sizer.
    
    Returns
    -------
    codes : ndarray
        An integer ndarray that's an indexer into `uniques`.
        ``uniques.take(codes)`` will have the same values as `values`.
    uniques : ndarray, Index, or Categorical
        The unique valid values. When `values` is Categorical, `uniques`
        is a Categorical. When `values` is some other pandas object, an
        `Index` is returned. Otherwise, a 1-D ndarray is returned.
    
        .. note ::
    
           Even if there's a missing value in `values`, `uniques` will
           *not* contain an entry for it.
    
    See Also
    --------
    cut : Discretize continuous-valued array.
    unique : Find the unique value in an array.
    
    Examples
    --------
    These examples all show factorize as a top-level method like
    ``pd.factorize(values)``. The results are identical for methods like
    :meth:`Series.factorize`.
    
    >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
    >>> codes
    array([0, 0, 1, 2, 0]...)
    >>> uniques
    array(['b', 'a', 'c'], dtype=object)
    
    With ``sort=True``, the `uniques` will be sorted, and `codes` will be
    shuffled so that the relationship is the maintained.
    
    >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
    >>> codes
    array([1, 1, 0, 2, 1]...)
    >>> uniques
    array(['a', 'b', 'c'], dtype=object)
    
    Missing values are indicated in `codes` with `na_sentinel`
    (``-1`` by default). Note that missing values are never
    included in `uniques`.
    
    >>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
    >>> codes
    array([ 0, -1,  1,  2,  0]...)
    >>> uniques
    array(['b', 'a', 'c'], dtype=object)
    
    Thus far, we've only factorized lists (which are internally coerced to
    NumPy arrays). When factorizing pandas objects, the type of `uniques`
    will differ. For Categoricals, a `Categorical` is returned.
    
    >>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
    >>> codes, uniques = pd.factorize(cat)
    >>> codes
    array([0, 0, 1]...)
    >>> uniques
    ['a', 'c']
    Categories (3, object): ['a', 'b', 'c']
    
    Notice that ``'b'`` is in ``uniques.categories``, despite not being
    present in ``cat.values``.
    
    For all other pandas objects, an Index of the appropriate type is
    returned.
    
    >>> cat = pd.Series(['a', 'a', 'c'])
    >>> codes, uniques = pd.factorize(cat)
    >>> codes
    array([0, 0, 1]...)
    >>> uniques
    Index(['a', 'c'], dtype='object')
    
    If NaN is in the values, and we want to include NaN in the uniques of the
    values, it can be achieved by setting ``na_sentinel=None``.
    
    >>> values = np.array([1, 2, 1, np.nan])
    >>> codes, uniques = pd.factorize(values)  # default: na_sentinel=-1
    >>> codes
    array([ 0,  1,  0, -1])
    >>> uniques
    array([1., 2.])
    
    >>> codes, uniques = pd.factorize(values, na_sentinel=None)
    >>> codes
    array([0, 1, 0, 2])
    >>> uniques
    array([ 1.,  2., nan])

​​​​​​​Function09

get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'

Help on function get_dummies in module pandas.core.reshape.reshape:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'
    Convert categorical variable into dummy/indicator variables.
    
    Parameters
    ----------
    data : array-like, Series, or DataFrame
        Data of which to get dummy indicators.
    prefix : str, list of str, or dict of str, default None
        String to append DataFrame column names.
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternatively, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : str, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix`.
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy-encoded columns should be backed by
        a :class:`SparseArray` (True) or a regular NumPy array (False).
    drop_first : bool, default False
        Whether to get k-1 dummies out of k categorical levels by removing the
        first level.
    dtype : dtype, default np.uint8
        Data type for new columns. Only a single dtype is allowed.
    
    Returns
    -------
    DataFrame
        Dummy-coded data.
    
    See Also
    --------
    Series.str.get_dummies : Convert Series to dummy codes.
    
    Examples
    --------
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0

​​​​待续......

下一篇链接:

https://blog.csdn.net/boysoft2002/article/details/128428096

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/112573.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

54三数之和55 56有无重复元素的全排列

54 三数之和 首先想到的就是之前的两数之和&#xff0c;只要在外层遍历一遍&#xff0c;对每个元素用之前的两数之和的哈希做法&#xff0c;就刚好是O(n^2) 但是有坑的地方在于需要去重&#xff0c;并且输出的三元组也是需要顺序的&#xff01;&#xff01;然后我用set去重和重…

c语言复习之预编译(十四)

1.以#开头的行&#xff0c;都称为编译器指令 #define定义宏#if #else #elif #endif条件编译#ifdef #ifndef判断是否定义了某个宏#error错误#program设定状态或指定完成&#xff08;编译器&#xff09;#undef取消宏定义 2.预定义宏 __LINE__行号__FILE__源文件名__DATE__创建…

你以为架构师天天就画图写PPT吗,告诉你其他事儿多了去了~

V-xin&#xff1a;ruyuan0330 获得600页原创精品文章汇总PDF 目录 一、多系统订阅数据回顾二、核心数据的监控系统三、电商库存数据如何监控四、数据计算链路追踪五、百亿流量下的数据链路追踪六、自动化数据链路分析七、下篇预告 上篇文章《为什么我建议线上高并发量的代码&a…

rip综合实验

目录 1.拓扑图 2.要求 3.要求分析 4.主要配置 5.测试 6.实验总结 1.拓扑图 2.要求 R1代表运营商&#xff0c;R1远程登录R2实际登录R9R3访问R7的环回&#xff0c;实际走下面全网可达 3.要求分析 将R2包括右边所有设备理解为一个局域网&#xff0c;在R2的出接口上配置NAT…

BIT.3_Linux进程概念

目录冯诺依曼体系结构操作系统(Operator System)概念设计OS的目的定位如何理解 "管理"总结系统调用和库函数概念进程概念知识点基本概念描述进程-PCBtask_struct-PCB的一种task_ struct内容分类组织进程查看进程通过系统调用获取进程标示符通过系统调用创建进程-fork…

上次面试跪在了Redis上,刷完腾讯云大神亲码的“redis深度笔记”,终面进腾讯!

前言 作为这个时代码代码的秃头人员&#xff0c;对Redis肯定是不陌生的&#xff0c;如果连Redis都没用过&#xff0c;还真不好意思出去面试&#xff0c;指不定被面试官吊打多少次。 毕竟现在互联网公司和一些创业公司都要用到Redis&#xff0c;像亚马逊、谷歌、阿里、腾讯都要使…

使用PicGo+阿里云OSS实现md文档图片上传

使用PicGo阿里云OSS实现md文档图片上传 这次给大家带来的是PicG0阿里云Osstypora的图床环境搭建&#xff0c;帮助大家提高工作效率写博客速度&#xff01; 1.typora安装 给大家一个链接&#xff1a;typora&#xff0c;打开后点击下载就行&#xff0c;正式版收费&#xff0c;…

Qt扫盲-QTextBrowser理论总结

QTextBrowser理论总结1. 简述2. 文档来源和内容3. 导航4. 用途1. 简述 QTextBrowser 顾名思义其实就是一个文本阅读器&#xff0c;但是这个类类扩展了 QTextEdit 的只读模式功能&#xff0c;当然了 QTextEdit 其实也就是QTextBrowser 的父类。添加的功能主要是一些导航 naviga…

对路由的基本理解和使用

一、认识路由 1、生活中的路由器 构造&#xff1a;路由器上有多个网线接口&#xff0c;每一个接口都对应一台设备 功能&#xff1a;多台设备能同时上网 2、编程中的路由和路由器 &#xff08;1&#xff09;后端渲染和后端路由 ①概念 早期的网站开发整个HTML页面是由服务器…

【附源码】如何C语言打印出心形表白?

使用C语言实现打印出心形&#xff0c;初学者的表白神器~ 我们分了4部分&#xff0c;前3行一部分&#xff0c;4-6行一部分&#xff0c;7-13行一部分&#xff0c;最后一行一部分&#xff0c;读者请仔细阅读注释&#xff0c;已经写的很详细了。 前三行输出&#xff0c;为了让初学…

6.1 文本情感倾向性分析

文章目录一、自然语言情感分析1.1 使用深度神经网络完成情感分析任务1.2 处理变长数据1.3 学习句子的语义二、循环神经网络RNN和长短时记忆网络LSTM2.1 RNN和LSTM网络的设计思考2.2 RNN网络结构2.3 LSTM网络结构2.4 使用LSTM完成情感分析任务三、使用飞桨实现基于LSTM的情感分析…

Python pandas有好几百个库函数,你都用过吗(4)

上一篇链接&#xff1a; https://blog.csdn.net/boysoft2002/article/details/128428569 S~W&#xff1a; Function46~56 Types[Function][45:] [set_eng_float_format, show_versions, test, timedelta_range, to_datetime, to_numeric, to_pickle, to_timedelta, unique,…

技术分享 Oracle下启用块跟踪

创建存放块跟踪文件目录 [oraclehost01 ~]$ cd /u01/app [oraclehost01 app]$ mkdir BCT 启用块跟踪 SQL> alter database enable block change tracking using file /u01/app/BCT/rman.bct; 检查块跟踪状态 SQL> col filename for a22 SQL> select filename, status,…

linux中级——libcurl库访问百度

libcurl简介: 在linux底下用c语言做http编程方法&#xff0c;依赖libcurl。 libcurl是一个跨平台的网络协议库&#xff0c;支持http, https, ftp, gopher, telnet, dict, file, 和ldap 协议。libcurl同样支持HTTPS证书授权&#xff0c;HTTP POST, HTTP PUT, FTP 上传, HTTP基…

1549:最大数——线段树

【题目描述】 原题来自&#xff1a;JSOI 2008 给定一个正整数数列 a1,a2,a3,⋯,an &#xff0c;每一个数都在 0∼p–1 之间。可以对这列数进行两种操作&#xff1a; 添加操作&#xff1a;向序列后添加一个数&#xff0c;序列长度变成 n1&#xff1b; 询问操作&#xff1a;询…

数据库系统概论笔记

数据库系统概论(王珊 萨师煊 编著)笔记。 第一章 绪论 1.数据库系统概述 1.1数据库的4个基本概念 数据&#xff1a;描述事物的符号记录称为数据。数据的含义称为数据的语义&#xff0c;数据与其语义是不可分的。数据库&#xff1a;长期存储在计算机内、有组织的、可共享的大…

fpga实操训练(仿真和状态机)

【 声明&#xff1a;版权所有&#xff0c;欢迎转载&#xff0c;请勿用于商业用途。 联系信箱&#xff1a;feixiaoxing 163.com】 在进行fpga上板子实验之前&#xff0c;相信很多同学都是通过仿真的方式来实现verilog学习的。仿真比较容易&#xff0c;也不需要依赖物理硬件&…

offsetTop、clientTop、scrollTop等属性详解【概念+详细例子分析】

文章目录一、 offsetscrollclient详细讲解1-1 offset系列1-2 client系列1-3 scroll系列二、 一张图片即可理解一、 offsetscrollclient详细讲解 1-1 offset系列 MDN中offset… offsetWidth/offsetHeight :对象的可见宽度offsetLeft/offsetTop&#xff1a; 当前元素距浏览器边界…

AcWing第 82 场周赛

第k个数 给定一个长度为 nn 的整数数列 a1,a2,…,ana1,a2,…,an&#xff0c;以及一个整数 kk。 请你计算并输出该数列从大到小排序后的第 kk 个数。 输入格式 第一行包含两个整数 n,kn,k。 第二行包含 nn 个整数 a1,a2,…,ana1,a2,…,an。 输出格式 一个整数&#xff0c…

map与set详解

&#x1f9f8;&#x1f9f8;&#x1f9f8;各位大佬大家好&#xff0c;我是猪皮兄弟&#x1f9f8;&#x1f9f8;&#x1f9f8; 文章目录一、两个概念二、set①set的两种遍历方式②set的erase③set的count三、map①SGI-STL中关于键值对的定义②map的insert③访问键值对④map的op…