Python pandas有几千个库函数,你用过几个?(1)

news2024/11/18 11:37:34

对Python的 pandas 库所有的内置元类、函数、子模块等全部浏览一遍,然后挑选一些重点学习一下。我安装的库版本号为1.3.5,如下:

>>> import pandas as pd
>>> pd.__version__
'1.3.5'
>>> print(pd.__doc__)

pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, date shifting and lagging.

百度翻译如下: 

panda——一个强大的Python数据分析和操作库
=====================================================================
**pandas**是一个Python包,提供快速、灵活和富有表现力的数据设计用于同时处理“关系”或“标记”数据的结构简单直观。它旨在成为使用Python进行实际的**真实世界**数据分析。此外,它还有成为**最强大、最灵活的开源数据的更广泛目标任何语言的分析/操作工具**。它已经很好了朝着这个目标前进。

主要特点
-------------
下面是pandas做得很好的几件事:

-轻松处理浮点和非浮点丢失的数据点数据。

-大小可变性:可以从DataFrame和高维对象

-自动和显式数据对齐:对象可以显式对齐或者用户可以忽略这些标签,让`“系列”、“DataFrame”等自动为您对齐计算。

-功能强大、灵活的分组功能,可执行拆分应用组合对数据集的操作,用于聚合和转换数据。

-在其他Python中轻松转换不规则的、不同索引的数据将NumPy数据结构转换为DataFrame对象。

-基于标签的智能切片、花式索引和大型数据的子集数据集。

-直观地合并和连接数据集。

-数据集的灵活重塑和数据透视。

-轴的分层标签(每个刻度可能有多个标签)。

-用于从平面文件(CSV和分隔)加载数据的强大IO工具,Excel文件、数据库以及从超快HDF5保存/加载数据总体安排

-时间序列特定功能:日期范围生成和频率转换、移动窗口统计、日期偏移和滞后。

共119个pandas库函数(包含元类、函数、子模块等):

>>> import pandas as pd
>>> funcs = [_ for _ in dir(pd) if not _.startswith('_')]
>>> len(funcs)
119
>>> for i,f in enumerate(funcs,1):
	print(f'{f:18}',end='' if i%5 else '\n')

	
BooleanDtype      Categorical       CategoricalDtype  CategoricalIndex  DataFrame         
DateOffset        DatetimeIndex     DatetimeTZDtype   ExcelFile         ExcelWriter       
Flags             Float32Dtype      Float64Dtype      Float64Index      Grouper           
HDFStore          Index             IndexSlice        Int16Dtype        Int32Dtype        
Int64Dtype        Int64Index        Int8Dtype         Interval          IntervalDtype     
IntervalIndex     MultiIndex        NA                NaT               NamedAgg          
Period            PeriodDtype       PeriodIndex       RangeIndex        Series            
SparseDtype       StringDtype       Timedelta         TimedeltaIndex    Timestamp         
UInt16Dtype       UInt32Dtype       UInt64Dtype       UInt64Index       UInt8Dtype        
api               array             arrays            bdate_range       compat            
concat            core              crosstab          cut               date_range        
describe_option   errors            eval              factorize         get_dummies       
get_option        infer_freq        interval_range    io                isna              
isnull            json_normalize    lreshape          melt              merge             
merge_asof        merge_ordered     notna             notnull           offsets           
option_context    options           pandas            period_range      pivot             
pivot_table       plotting          qcut              read_clipboard    read_csv          
read_excel        read_feather      read_fwf          read_gbq          read_hdf          
read_html         read_json         read_orc          read_parquet      read_pickle       
read_sas          read_spss         read_sql          read_sql_query    read_sql_table    
read_stata        read_table        read_xml          reset_option      set_eng_float_format
set_option        show_versions     test              testing           timedelta_range   
to_datetime       to_numeric        to_pickle         to_timedelta      tseries           
unique            util              value_counts      wide_to_long      

按元类、函数、子模块等,先分个类:

import pandas as pd
funcs = [_ for _ in dir(pd) if not _.startswith('_')]
types = type(pd.DataFrame), type(pd.array), type(pd)
Names = 'Type','Function','Module','Other'
Types = {}

for f in funcs:
    t = type(eval("pd."+f))
    t = Names[-1 if t not in types else types.index(type(eval("pd."+f)))]
    Types[t] = Types.get(t,[])+[f]

for n in Names:
    print(f"\n\n【{n}】:{len(Types[n])}")
    for i,f in enumerate(Types[n],1):
        print(f'{f:18} ',end='' if i%5 or i==len(Types[n]) else '\n')

分类结果: 

【Type】:40
BooleanDtype       CategoricalDtype   CategoricalIndex   DataFrame        DatetimeIndex      
DatetimeTZDtype    ExcelFile          Flags              Float32Dtype       Float64Dtype       
Float64Index       Grouper            HDFStore           Index              Int16Dtype         
Int32Dtype         Int64Dtype         Int64Index         Int8Dtype          Interval           
IntervalDtype      IntervalIndex      MultiIndex         NamedAgg           Period             
PeriodDtype        PeriodIndex        RangeIndex         Series             SparseDtype        
StringDtype        Timedelta          TimedeltaIndex     Timestamp          UInt16Dtype        
UInt32Dtype        UInt64Dtype        UInt64Index        UInt8Dtype         option_context     

【Function】:56
array              bdate_range        concat             crosstab           cut                
date_range         eval               factorize          get_dummies        infer_freq         
interval_range     isna               isnull             json_normalize     lreshape           
melt               merge              merge_asof         merge_ordered      notna              
notnull            period_range       pivot              pivot_table        qcut               
read_clipboard     read_csv           read_excel         read_feather       read_fwf           
read_gbq           read_hdf           read_html          read_json          read_orc           
read_parquet       read_pickle        read_sas           read_spss          read_sql           
read_sql_query     read_sql_table     read_stata         read_table         read_xml           
set_eng_float_format show_versions      test               timedelta_range    to_datetime        
to_numeric         to_pickle          to_timedelta       unique             value_counts       
wide_to_long       

【Module】:12
api                arrays             compat             core               errors             
io                 offsets            pandas             plotting           testing            
tseries            util               

【Other】:11
Categorical        DateOffset         ExcelWriter        IndexSlice         NA                 
NaT                describe_option    get_option         options            reset_option       
set_option   

先给出56个库函数的原版帮助,有252K之多单篇博文放不下,只能以连载方式罗列:

A~G: Function01~09

Types['Function'][:9]
['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'eval', 'factorize', 'get_dummies']

Function01

array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'

Help on function array in module pandas.core.construction:

array(data: 'Sequence[object] | AnyArrayLike', dtype: 'Dtype | None' = None, copy: 'bool' = True) -> 'ExtensionArray'
    Create an array.
    
    Parameters
    ----------
    data : Sequence of objects
        The scalars inside `data` should be instances of the
        scalar type for `dtype`. It's expected that `data`
        represents a 1-dimensional array of data.
    
        When `data` is an Index or Series, the underlying array
        will be extracted from `data`.
    
    dtype : str, np.dtype, or ExtensionDtype, optional
        The dtype to use for the array. This may be a NumPy
        dtype or an extension type registered with pandas using
        :meth:`pandas.api.extensions.register_extension_dtype`.
    
        If not specified, there are two possibilities:
    
        1. When `data` is a :class:`Series`, :class:`Index`, or
           :class:`ExtensionArray`, the `dtype` will be taken
           from the data.
        2. Otherwise, pandas will attempt to infer the `dtype`
           from the data.
    
        Note that when `data` is a NumPy array, ``data.dtype`` is
        *not* used for inferring the array type. This is because
        NumPy cannot represent all the types of data that can be
        held in extension arrays.
    
        Currently, pandas will infer an extension dtype for sequences of
    
        ============================== =======================================
        Scalar Type                    Array Type
        ============================== =======================================
        :class:`pandas.Interval`       :class:`pandas.arrays.IntervalArray`
        :class:`pandas.Period`         :class:`pandas.arrays.PeriodArray`
        :class:`datetime.datetime`     :class:`pandas.arrays.DatetimeArray`
        :class:`datetime.timedelta`    :class:`pandas.arrays.TimedeltaArray`
        :class:`int`                   :class:`pandas.arrays.IntegerArray`
        :class:`float`                 :class:`pandas.arrays.FloatingArray`
        :class:`str`                   :class:`pandas.arrays.StringArray` or
                                       :class:`pandas.arrays.ArrowStringArray`
        :class:`bool`                  :class:`pandas.arrays.BooleanArray`
        ============================== =======================================
    
        The ExtensionArray created when the scalar type is :class:`str` is determined by
        ``pd.options.mode.string_storage`` if the dtype is not explicitly given.
    
        For all other cases, NumPy's usual inference rules will be used.
    
        .. versionchanged:: 1.0.0
    
           Pandas infers nullable-integer dtype for integer data,
           string dtype for string data, and nullable-boolean dtype
           for boolean data.
    
        .. versionchanged:: 1.2.0
    
            Pandas now also infers nullable-floating dtype for float-like
            input data
    
    copy : bool, default True
        Whether to copy the data, even if not necessary. Depending
        on the type of `data`, creating the new array may require
        copying data, even if ``copy=False``.
    
    Returns
    -------
    ExtensionArray
        The newly created array.
    
    Raises
    ------
    ValueError
        When `data` is not 1-dimensional.
    
    See Also
    --------
    numpy.array : Construct a NumPy array.
    Series : Construct a pandas Series.
    Index : Construct a pandas Index.
    arrays.PandasArray : ExtensionArray wrapping a NumPy array.
    Series.array : Extract the array stored within a Series.
    
    Notes
    -----
    Omitting the `dtype` argument means pandas will attempt to infer the
    best array type from the values in the data. As new array types are
    added by pandas and 3rd party libraries, the "best" array type may
    change. We recommend specifying `dtype` to ensure that
    
    1. the correct array type for the data is returned
    2. the returned array type doesn't change as new extension types
       are added by pandas and third-party libraries
    
    Additionally, if the underlying memory representation of the returned
    array matters, we recommend specifying the `dtype` as a concrete object
    rather than a string alias or allowing it to be inferred. For example,
    a future version of pandas or a 3rd-party library may include a
    dedicated ExtensionArray for string data. In this event, the following
    would no longer return a :class:`arrays.PandasArray` backed by a NumPy
    array.
    
    >>> pd.array(['a', 'b'], dtype=str)
    <PandasArray>
    ['a', 'b']
    Length: 2, dtype: str32
    
    This would instead return the new ExtensionArray dedicated for string
    data. If you really need the new array to be backed by a  NumPy array,
    specify that in the dtype.
    
    >>> pd.array(['a', 'b'], dtype=np.dtype("<U1"))
    <PandasArray>
    ['a', 'b']
    Length: 2, dtype: str32
    
    Finally, Pandas has arrays that mostly overlap with NumPy
    
      * :class:`arrays.DatetimeArray`
      * :class:`arrays.TimedeltaArray`
    
    When data with a ``datetime64[ns]`` or ``timedelta64[ns]`` dtype is
    passed, pandas will always return a ``DatetimeArray`` or ``TimedeltaArray``
    rather than a ``PandasArray``. This is for symmetry with the case of
    timezone-aware data, which NumPy does not natively support.
    
    >>> pd.array(['2015', '2016'], dtype='datetime64[ns]')
    <DatetimeArray>
    ['2015-01-01 00:00:00', '2016-01-01 00:00:00']
    Length: 2, dtype: datetime64[ns]
    
    >>> pd.array(["1H", "2H"], dtype='timedelta64[ns]')
    <TimedeltaArray>
    ['0 days 01:00:00', '0 days 02:00:00']
    Length: 2, dtype: timedelta64[ns]
    
    Examples
    --------
    If a dtype is not specified, pandas will infer the best dtype from the values.
    See the description of `dtype` for the types pandas infers for.
    
    >>> pd.array([1, 2])
    <IntegerArray>
    [1, 2]
    Length: 2, dtype: Int64
    
    >>> pd.array([1, 2, np.nan])
    <IntegerArray>
    [1, 2, <NA>]
    Length: 3, dtype: Int64
    
    >>> pd.array([1.1, 2.2])
    <FloatingArray>
    [1.1, 2.2]
    Length: 2, dtype: Float64
    
    >>> pd.array(["a", None, "c"])
    <StringArray>
    ['a', <NA>, 'c']
    Length: 3, dtype: string
    
    >>> with pd.option_context("string_storage", "pyarrow"):
    ...     arr = pd.array(["a", None, "c"])
    ...
    >>> arr
    <ArrowStringArray>
    ['a', <NA>, 'c']
    Length: 3, dtype: string
    
    >>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
    <PeriodArray>
    ['2000-01-01', '2000-01-01']
    Length: 2, dtype: period[D]
    
    You can use the string alias for `dtype`
    
    >>> pd.array(['a', 'b', 'a'], dtype='category')
    ['a', 'b', 'a']
    Categories (2, object): ['a', 'b']
    
    Or specify the actual dtype
    
    >>> pd.array(['a', 'b', 'a'],
    ...          dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True))
    ['a', 'b', 'a']
    Categories (3, object): ['a' < 'b' < 'c']
    
    If pandas does not infer a dedicated extension type a
    :class:`arrays.PandasArray` is returned.
    
    >>> pd.array([1 + 1j, 3 + 2j])
    <PandasArray>
    [(1+1j), (3+2j)]
    Length: 2, dtype: complex128
    
    As mentioned in the "Notes" section, new extension types may be added
    in the future (by pandas or 3rd party libraries), causing the return
    value to no longer be a :class:`arrays.PandasArray`. Specify the `dtype`
    as a NumPy dtype if you need to ensure there's no future change in
    behavior.
    
    >>> pd.array([1, 2], dtype=np.dtype("int32"))
    <PandasArray>
    [1, 2]
    Length: 2, dtype: int32
    
    `data` must be 1-dimensional. A ValueError is raised when the input
    has the wrong dimensionality.
    
    >>> pd.array(1)
    Traceback (most recent call last):
      ...
    ValueError: Cannot pass scalar '1' to 'pandas.array'.

Function02

bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'

Help on function bdate_range in module pandas.core.indexes.datetimes:

bdate_range(start=None, end=None, periods: 'int | None' = None, freq='B', tz=None, normalize: 'bool' = True, name: 'Hashable' = None, weekmask=None, holidays=None, closed=None, **kwargs) -> 'DatetimeIndex'
    Return a fixed frequency DatetimeIndex, with business day as the default
    frequency.
    
    Parameters
    ----------
    start : str or datetime-like, default None
        Left bound for generating dates.
    end : str or datetime-like, default None
        Right bound for generating dates.
    periods : int, default None
        Number of periods to generate.
    freq : str or DateOffset, default 'B' (business daily)
        Frequency strings can have multiples, e.g. '5H'.
    tz : str or None
        Time zone name for returning localized DatetimeIndex, for example
        Asia/Beijing.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    weekmask : str or None, default None
        Weekmask of valid business days, passed to ``numpy.busdaycalendar``,
        only used when custom frequency strings are passed.  The default
        value None is equivalent to 'Mon Tue Wed Thu Fri'.
    holidays : list-like or None, default None
        Dates to exclude from the set of valid business days, passed to
        ``numpy.busdaycalendar``, only used when custom frequency strings
        are passed.
    closed : str, default None
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    DatetimeIndex
    
    Notes
    -----
    Of the four parameters: ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified.  Specifying ``freq`` is a requirement
    for ``bdate_range``.  Use ``date_range`` if specifying ``freq`` is not
    desired.
    
    To learn more about the frequency strings, please see `this link
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    Note how the two weekend days are skipped in the result.
    
    >>> pd.bdate_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-08'],
              dtype='datetime64[ns]', freq='B')

Function03

concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'

Help on function concat in module pandas.core.reshape.concat:

concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion'
    Concatenate pandas objects along a particular axis with optional set logic
    along the other axes.
    
    Can also add a layer of hierarchical indexing on the concatenation axis,
    which may be useful if the labels are the same (or overlapping) on
    the passed axis number.
    
    Parameters
    ----------
    objs : a sequence or mapping of Series or DataFrame objects
        If a mapping is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which case a ValueError will be raised.
    axis : {0/'index', 1/'columns'}, default 0
        The axis to concatenate along.
    join : {'inner', 'outer'}, default 'outer'
        How to handle indexes on other axis (or axes).
    ignore_index : bool, default False
        If True, do not use the index values along the concatenation axis. The
        resulting axis will be labeled 0, ..., n - 1. This is useful if you are
        concatenating objects where the concatenation axis does not have
        meaningful indexing information. Note the index values on the other
        axes are still respected in the join.
    keys : sequence, default None
        If multiple levels passed, should contain tuples. Construct
        hierarchical index using the passed keys as the outermost level.
    levels : list of sequences, default None
        Specific levels (unique values) to use for constructing a
        MultiIndex. Otherwise they will be inferred from the keys.
    names : list, default None
        Names for the levels in the resulting hierarchical index.
    verify_integrity : bool, default False
        Check whether the new concatenated axis contains duplicates. This can
        be very expensive relative to the actual data concatenation.
    sort : bool, default False
        Sort non-concatenation axis if it is not already aligned when `join`
        is 'outer'.
        This has no effect when ``join='inner'``, which already preserves
        the order of the non-concatenation axis.
    
        .. versionchanged:: 1.0.0
    
           Changed to not sort by default.
    
    copy : bool, default True
        If False, do not copy data unnecessarily.
    
    Returns
    -------
    object, type of objs
        When concatenating all ``Series`` along the index (axis=0), a
        ``Series`` is returned. When ``objs`` contains at least one
        ``DataFrame``, a ``DataFrame`` is returned. When concatenating along
        the columns (axis=1), a ``DataFrame`` is returned.
    
    See Also
    --------
    Series.append : Concatenate Series.
    DataFrame.append : Concatenate DataFrames.
    DataFrame.join : Join DataFrames using indexes.
    DataFrame.merge : Merge DataFrames by indexes or columns.
    
    Notes
    -----
    The keys, levels, and names arguments are all optional.
    
    A walkthrough of how this method fits in with other tools for combining
    pandas objects can be found `here
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>`__.
    
    Examples
    --------
    Combine two ``Series``.
    
    >>> s1 = pd.Series(['a', 'b'])
    >>> s2 = pd.Series(['c', 'd'])
    >>> pd.concat([s1, s2])
    0    a
    1    b
    0    c
    1    d
    dtype: object
    
    Clear the existing index and reset it in the result
    by setting the ``ignore_index`` option to ``True``.
    
    >>> pd.concat([s1, s2], ignore_index=True)
    0    a
    1    b
    2    c
    3    d
    dtype: object
    
    Add a hierarchical index at the outermost level of
    the data with the ``keys`` option.
    
    >>> pd.concat([s1, s2], keys=['s1', 's2'])
    s1  0    a
        1    b
    s2  0    c
        1    d
    dtype: object
    
    Label the index keys you create with the ``names`` option.
    
    >>> pd.concat([s1, s2], keys=['s1', 's2'],
    ...           names=['Series name', 'Row ID'])
    Series name  Row ID
    s1           0         a
                 1         b
    s2           0         c
                 1         d
    dtype: object
    
    Combine two ``DataFrame`` objects with identical columns.
    
    >>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
    ...                    columns=['letter', 'number'])
    >>> df1
      letter  number
    0      a       1
    1      b       2
    >>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
    ...                    columns=['letter', 'number'])
    >>> df2
      letter  number
    0      c       3
    1      d       4
    >>> pd.concat([df1, df2])
      letter  number
    0      a       1
    1      b       2
    0      c       3
    1      d       4
    
    Combine ``DataFrame`` objects with overlapping columns
    and return everything. Columns outside the intersection will
    be filled with ``NaN`` values.
    
    >>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
    ...                    columns=['letter', 'number', 'animal'])
    >>> df3
      letter  number animal
    0      c       3    cat
    1      d       4    dog
    >>> pd.concat([df1, df3], sort=False)
      letter  number animal
    0      a       1    NaN
    1      b       2    NaN
    0      c       3    cat
    1      d       4    dog
    
    Combine ``DataFrame`` objects with overlapping columns
    and return only those that are shared by passing ``inner`` to
    the ``join`` keyword argument.
    
    >>> pd.concat([df1, df3], join="inner")
      letter  number
    0      a       1
    1      b       2
    0      c       3
    1      d       4
    
    Combine ``DataFrame`` objects horizontally along the x axis by
    passing in ``axis=1``.
    
    >>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
    ...                    columns=['animal', 'name'])
    >>> pd.concat([df1, df4], axis=1)
      letter  number  animal    name
    0      a       1    bird   polly
    1      b       2  monkey  george
    
    Prevent the result from including duplicate index values with the
    ``verify_integrity`` option.
    
    >>> df5 = pd.DataFrame([1], index=['a'])
    >>> df5
       0
    a  1
    >>> df6 = pd.DataFrame([2], index=['a'])
    >>> df6
       0
    a  2
    >>> pd.concat([df5, df6], verify_integrity=True)
    Traceback (most recent call last):
        ...
    ValueError: Indexes have overlapping values: ['a']

Function04

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'

Help on function crosstab in module pandas.core.reshape.pivot:

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: 'str' = 'All', dropna: 'bool' = True, normalize=False) -> 'DataFrame'
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.
    
    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.
    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.
    
        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.
    
    Returns
    -------
    DataFrame
        Cross tabulation of the data.
    
    See Also
    --------
    DataFrame.pivot : Reshape data based on column values.
    pivot_table : Create a pivot table as a DataFrame.
    
    Notes
    -----
    Any Series passed will have their name attributes used unless row or column
    names for the cross-tabulation are specified.
    
    Any input passed containing Categorical data will have **all** of its
    categories included in the cross-tabulation, even if the actual data does
    not contain any instances of a particular category.
    
    In the event that there aren't overlapping indexes an empty DataFrame will
    be returned.
    
    Examples
    --------
    >>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
    ...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
    >>> b = np.array(["one", "one", "one", "two", "one", "one",
    ...               "one", "two", "two", "two", "one"], dtype=object)
    >>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
    ...               "shiny", "dull", "shiny", "shiny", "shiny"],
    ...              dtype=object)
    >>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
    b   one        two
    c   dull shiny dull shiny
    a
    bar    1     2    1     0
    foo    2     2    1     2
    
    Here 'c' and 'f' are not represented in the data and will not be
    shown in the output because dropna is True by default. Set
    dropna=False to preserve categories with no data.
    
    >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
    >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
    >>> pd.crosstab(foo, bar)
    col_0  d  e
    row_0
    a      1  0
    b      0  1
    >>> pd.crosstab(foo, bar, dropna=False)
    col_0  d  e  f
    row_0
    a      1  0  0
    b      0  1  0
    c      0  0  0

​​​​​​​Function05

cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)

Help on function cut in module pandas.core.reshape.tile:

cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise', ordered: bool = True)
    Bin values into discrete intervals.
    
    Use `cut` when you need to segment and sort data values into bins. This
    function is also useful for going from a continuous variable to a
    categorical variable. For example, `cut` could convert ages to groups of
    age ranges. Supports binning into an equal number of bins, or a
    pre-specified array of bins.
    
    Parameters
    ----------
    x : array-like
        The input array to be binned. Must be 1-dimensional.
    bins : int, sequence of scalars, or IntervalIndex
        The criteria to bin by.
    
        * int : Defines the number of equal-width bins in the range of `x`. The
          range of `x` is extended by .1% on each side to include the minimum
          and maximum values of `x`.
        * sequence of scalars : Defines the bin edges allowing for non-uniform
          width. No extension of the range of `x` is done.
        * IntervalIndex : Defines the exact bins to be used. Note that
          IntervalIndex for `bins` must be non-overlapping.
    
    right : bool, default True
        Indicates whether `bins` includes the rightmost edge or not. If
        ``right == True`` (the default), then the `bins` ``[1, 2, 3, 4]``
        indicate (1,2], (2,3], (3,4]. This argument is ignored when
        `bins` is an IntervalIndex.
    labels : array or False, default None
        Specifies the labels for the returned bins. Must be the same length as
        the resulting bins. If False, returns only integer indicators of the
        bins. This affects the type of the output container (see below).
        This argument is ignored when `bins` is an IntervalIndex. If True,
        raises an error. When `ordered=False`, labels must be provided.
    retbins : bool, default False
        Whether to return the bins or not. Useful when bins is provided
        as a scalar.
    precision : int, default 3
        The precision at which to store and display the bins labels.
    include_lowest : bool, default False
        Whether the first interval should be left-inclusive or not.
    duplicates : {default 'raise', 'drop'}, optional
        If bin edges are not unique, raise ValueError or drop non-uniques.
    ordered : bool, default True
        Whether the labels are ordered or not. Applies to returned types
        Categorical and Series (with Categorical dtype). If True,
        the resulting categorical will be ordered. If False, the resulting
        categorical will be unordered (labels must be provided).
    
        .. versionadded:: 1.1.0
    
    Returns
    -------
    out : Categorical, Series, or ndarray
        An array-like object representing the respective bin for each value
        of `x`. The type depends on the value of `labels`.
    
        * True (default) : returns a Series for Series `x` or a
          Categorical for all other inputs. The values stored within
          are Interval dtype.
    
        * sequence of scalars : returns a Series for Series `x` or a
          Categorical for all other inputs. The values stored within
          are whatever the type in the sequence is.
    
        * False : returns an ndarray of integers.
    
    bins : numpy.ndarray or IntervalIndex.
        The computed or specified bins. Only returned when `retbins=True`.
        For scalar or sequence `bins`, this is an ndarray with the computed
        bins. If set `duplicates=drop`, `bins` will drop non-unique bin. For
        an IntervalIndex `bins`, this is equal to `bins`.
    
    See Also
    --------
    qcut : Discretize variable into equal-sized buckets based on rank
        or based on sample quantiles.
    Categorical : Array type for storing data that come from a
        fixed set of values.
    Series : One-dimensional array with axis labels (including time series).
    IntervalIndex : Immutable Index implementing an ordered, sliceable set.
    
    Notes
    -----
    Any NA values will be NA in the result. Out of bounds values will be NA in
    the resulting Series or Categorical object.
    
    Examples
    --------
    Discretize into three equal-sized bins.
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
    ... # doctest: +ELLIPSIS
    [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
    Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
    ... # doctest: +ELLIPSIS
    ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
    Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
    array([0.994, 3.   , 5.   , 7.   ]))
    
    Discovers the same bins, but assign them specific labels. Notice that
    the returned Categorical's categories are `labels` and is ordered.
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
    ...        3, labels=["bad", "medium", "good"])
    ['bad', 'good', 'medium', 'medium', 'good', 'bad']
    Categories (3, object): ['bad' < 'medium' < 'good']
    
    ``ordered=False`` will result in unordered categories when labels are passed.
    This parameter can be used to allow non-unique labels:
    
    >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
    ...        labels=["B", "A", "B"], ordered=False)
    ['B', 'B', 'A', 'A', 'B', 'B']
    Categories (2, object): ['A', 'B']
    
    ``labels=False`` implies you just want the bins back.
    
    >>> pd.cut([0, 1, 1, 2], bins=4, labels=False)
    array([0, 1, 1, 3])
    
    Passing a Series as an input returns a Series with categorical dtype:
    
    >>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
    ...               index=['a', 'b', 'c', 'd', 'e'])
    >>> pd.cut(s, 3)
    ... # doctest: +ELLIPSIS
    a    (1.992, 4.667]
    b    (1.992, 4.667]
    c    (4.667, 7.333]
    d     (7.333, 10.0]
    e     (7.333, 10.0]
    dtype: category
    Categories (3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...
    
    Passing a Series as an input returns a Series with mapping value.
    It is used to map numerically to intervals based on bins.
    
    >>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
    ...               index=['a', 'b', 'c', 'd', 'e'])
    >>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False)
    ... # doctest: +ELLIPSIS
    (a    1.0
     b    2.0
     c    3.0
     d    4.0
     e    NaN
     dtype: float64,
     array([ 0,  2,  4,  6,  8, 10]))
    
    Use `drop` optional when bins is not unique
    
    >>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True,
    ...        right=False, duplicates='drop')
    ... # doctest: +ELLIPSIS
    (a    1.0
     b    2.0
     c    3.0
     d    3.0
     e    NaN
     dtype: float64,
     array([ 0,  2,  4,  6, 10]))
    
    Passing an IntervalIndex for `bins` results in those categories exactly.
    Notice that values not covered by the IntervalIndex are set to NaN. 0
    is to the left of the first bin (which is closed on the right), and 1.5
    falls between two bins.
    
    >>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
    >>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
    [NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]]
    Categories (3, interval[int64, right]): [(0, 1] < (2, 3] < (4, 5]]

​​​​​​​Function06

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'

Help on function date_range in module pandas.core.indexes.datetimes:

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize: 'bool' = False, name: 'Hashable' = None, closed=None, **kwargs) -> 'DatetimeIndex'
    Return a fixed frequency DatetimeIndex.
    
    Returns the range of equally spaced time points (where the difference between any
    two adjacent points is specified by the given frequency) such that they all
    satisfy `start <[=] x <[=] end`, where the first one and the last one are, resp.,
    the first and last time points in that range that fall on the boundary of ``freq``
    (if given as a frequency string) or that are valid for ``freq`` (if given as a
    :class:`pandas.tseries.offsets.DateOffset`). (If exactly one of ``start``,
    ``end``, or ``freq`` is *not* specified, this missing parameter can be computed
    given ``periods``, the number of timesteps in the range. See the note below.)
    
    Parameters
    ----------
    start : str or datetime-like, optional
        Left bound for generating dates.
    end : str or datetime-like, optional
        Right bound for generating dates.
    periods : int, optional
        Number of periods to generate.
    freq : str or DateOffset, default 'D'
        Frequency strings can have multiples, e.g. '5H'. See
        :ref:`here <timeseries.offset_aliases>` for a list of
        frequency aliases.
    tz : str or tzinfo, optional
        Time zone name for returning localized DatetimeIndex, for example
        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
        timezone-naive.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    closed : {None, 'left', 'right'}, optional
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None, the default).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    rng : DatetimeIndex
    
    See Also
    --------
    DatetimeIndex : An immutable container for datetimes.
    timedelta_range : Return a fixed frequency TimedeltaIndex.
    period_range : Return a fixed frequency PeriodIndex.
    interval_range : Return a fixed frequency IntervalIndex.
    
    Notes
    -----
    Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified. If ``freq`` is omitted, the resulting
    ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
    ``start`` and ``end`` (closed on both sides).
    
    To learn more about the frequency strings, please see `this link
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    **Specifying the values**
    
    The next four examples generate the same `DatetimeIndex`, but vary
    the combination of `start`, `end` and `periods`.
    
    Specify `start` and `end`, with the default daily frequency.
    
    >>> pd.date_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start` and `periods`, the number of periods (days).
    
    >>> pd.date_range(start='1/1/2018', periods=8)
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `end` and `periods`, the number of periods (days).
    
    >>> pd.date_range(end='1/1/2018', periods=8)
    DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                   '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start`, `end`, and `periods`; the frequency is generated
    automatically (linearly spaced).
    
    >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
    DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                   '2018-04-27 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    
    **Other Parameters**
    
    Changed the `freq` (frequency) to ``'M'`` (month end frequency).
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
    DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                   '2018-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    Multiples are allowed
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    `freq` can also be specified as an Offset object.
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    Specify `tz` to set the timezone.
    
    >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
    DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                   '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                   '2018-01-05 00:00:00+09:00'],
                  dtype='datetime64[ns, Asia/Tokyo]', freq='D')
    
    `closed` controls whether to include `start` and `end` that are on the
    boundary. The default includes boundary points on either end.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='left'`` to exclude `end` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='right'`` to exclude `start` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
    DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')

​​​​​​​Function07

eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)

Help on function eval in module pandas.core.computation.eval:

eval(expr: 'str | BinOp', parser: 'str' = 'pandas', engine: 'str | None' = None, truediv=<no_default>, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)
    Evaluate a Python expression as a string using various backends.
    
    The following arithmetic operations are supported: ``+``, ``-``, ``*``,
    ``/``, ``**``, ``%``, ``//`` (python engine only) along with the following
    boolean operations: ``|`` (or), ``&`` (and), and ``~`` (not).
    Additionally, the ``'pandas'`` parser allows the use of :keyword:`and`,
    :keyword:`or`, and :keyword:`not` with the same semantics as the
    corresponding bitwise operators.  :class:`~pandas.Series` and
    :class:`~pandas.DataFrame` objects are supported and behave as they would
    with plain ol' Python evaluation.
    
    Parameters
    ----------
    expr : str
        The expression to evaluate. This string cannot contain any Python
        `statements
        <https://docs.python.org/3/reference/simple_stmts.html#simple-statements>`__,
        only Python `expressions
        <https://docs.python.org/3/reference/simple_stmts.html#expression-statements>`__.
    parser : {'pandas', 'python'}, default 'pandas'
        The parser to use to construct the syntax tree from the expression. The
        default of ``'pandas'`` parses code slightly different than standard
        Python. Alternatively, you can parse an expression using the
        ``'python'`` parser to retain strict Python semantics.  See the
        :ref:`enhancing performance <enhancingperf.eval>` documentation for
        more details.
    engine : {'python', 'numexpr'}, default 'numexpr'
    
        The engine used to evaluate the expression. Supported engines are
    
        - None         : tries to use ``numexpr``, falls back to ``python``
        - ``'numexpr'``: This default engine evaluates pandas objects using
                         numexpr for large speed ups in complex expressions
                         with large frames.
        - ``'python'``: Performs operations as if you had ``eval``'d in top
                        level python. This engine is generally not that useful.
    
        More backends may be available in the future.
    
    truediv : bool, optional
        Whether to use true division, like in Python >= 3.
    
        .. deprecated:: 1.0.0
    
    local_dict : dict or None, optional
        A dictionary of local variables, taken from locals() by default.
    global_dict : dict or None, optional
        A dictionary of global variables, taken from globals() by default.
    resolvers : list of dict-like or None, optional
        A list of objects implementing the ``__getitem__`` special method that
        you can use to inject an additional collection of namespaces to use for
        variable lookup. For example, this is used in the
        :meth:`~DataFrame.query` method to inject the
        ``DataFrame.index`` and ``DataFrame.columns``
        variables that refer to their respective :class:`~pandas.DataFrame`
        instance attributes.
    level : int, optional
        The number of prior stack frames to traverse and add to the current
        scope. Most users will **not** need to change this parameter.
    target : object, optional, default None
        This is the target object for assignment. It is used when there is
        variable assignment in the expression. If so, then `target` must
        support item assignment with string keys, and if a copy is being
        returned, it must also support `.copy()`.
    inplace : bool, default False
        If `target` is provided, and the expression mutates `target`, whether
        to modify `target` inplace. Otherwise, return a copy of `target` with
        the mutation.
    
    Returns
    -------
    ndarray, numeric scalar, DataFrame, Series, or None
        The completion value of evaluating the given code or None if ``inplace=True``.
    
    Raises
    ------
    ValueError
        There are many instances where such an error can be raised:
    
        - `target=None`, but the expression is multiline.
        - The expression is multiline, but not all them have item assignment.
          An example of such an arrangement is this:
    
          a = b + 1
          a + 2
    
          Here, there are expressions on different lines, making it multiline,
          but the last line has no variable assigned to the output of `a + 2`.
        - `inplace=True`, but the expression is missing item assignment.
        - Item assignment is provided, but the `target` does not support
          string item assignment.
        - Item assignment is provided and `inplace=False`, but the `target`
          does not support the `.copy()` method
    
    See Also
    --------
    DataFrame.query : Evaluates a boolean expression to query the columns
            of a frame.
    DataFrame.eval : Evaluate a string describing operations on
            DataFrame columns.
    
    Notes
    -----
    The ``dtype`` of any objects involved in an arithmetic ``%`` operation are
    recursively cast to ``float64``.
    
    See the :ref:`enhancing performance <enhancingperf.eval>` documentation for
    more details.
    
    Examples
    --------
    >>> df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
    >>> df
      animal  age
    0    dog   10
    1    pig   20
    
    We can add a new column using ``pd.eval``:
    
    >>> pd.eval("double_age = df.age * 2", target=df)
      animal  age  double_age
    0    dog   10          20
    1    pig   20          40

​​​​​​​Function08

factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'

Help on function factorize in module pandas.core.algorithms:

factorize(values, sort: 'bool' = False, na_sentinel: 'int | None' = -1, size_hint: 'int | None' = None) -> 'tuple[np.ndarray, np.ndarray | Index]'
    Encode the object as an enumerated type or categorical variable.
    
    This method is useful for obtaining a numeric representation of an
    array when all that matters is identifying distinct values. `factorize`
    is available as both a top-level function :func:`pandas.factorize`,
    and as a method :meth:`Series.factorize` and :meth:`Index.factorize`.
    
    Parameters
    ----------
    values : sequence
        A 1-D sequence. Sequences that aren't pandas objects are
        coerced to ndarrays before factorization.
    sort : bool, default False
        Sort `uniques` and shuffle `codes` to maintain the
        relationship.
    
    na_sentinel : int or None, default -1
        Value to mark "not found". If None, will not drop the NaN
        from the uniques of the values.
    
        .. versionchanged:: 1.1.2
    size_hint : int, optional
        Hint to the hashtable sizer.
    
    Returns
    -------
    codes : ndarray
        An integer ndarray that's an indexer into `uniques`.
        ``uniques.take(codes)`` will have the same values as `values`.
    uniques : ndarray, Index, or Categorical
        The unique valid values. When `values` is Categorical, `uniques`
        is a Categorical. When `values` is some other pandas object, an
        `Index` is returned. Otherwise, a 1-D ndarray is returned.
    
        .. note ::
    
           Even if there's a missing value in `values`, `uniques` will
           *not* contain an entry for it.
    
    See Also
    --------
    cut : Discretize continuous-valued array.
    unique : Find the unique value in an array.
    
    Examples
    --------
    These examples all show factorize as a top-level method like
    ``pd.factorize(values)``. The results are identical for methods like
    :meth:`Series.factorize`.
    
    >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
    >>> codes
    array([0, 0, 1, 2, 0]...)
    >>> uniques
    array(['b', 'a', 'c'], dtype=object)
    
    With ``sort=True``, the `uniques` will be sorted, and `codes` will be
    shuffled so that the relationship is the maintained.
    
    >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
    >>> codes
    array([1, 1, 0, 2, 1]...)
    >>> uniques
    array(['a', 'b', 'c'], dtype=object)
    
    Missing values are indicated in `codes` with `na_sentinel`
    (``-1`` by default). Note that missing values are never
    included in `uniques`.
    
    >>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
    >>> codes
    array([ 0, -1,  1,  2,  0]...)
    >>> uniques
    array(['b', 'a', 'c'], dtype=object)
    
    Thus far, we've only factorized lists (which are internally coerced to
    NumPy arrays). When factorizing pandas objects, the type of `uniques`
    will differ. For Categoricals, a `Categorical` is returned.
    
    >>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
    >>> codes, uniques = pd.factorize(cat)
    >>> codes
    array([0, 0, 1]...)
    >>> uniques
    ['a', 'c']
    Categories (3, object): ['a', 'b', 'c']
    
    Notice that ``'b'`` is in ``uniques.categories``, despite not being
    present in ``cat.values``.
    
    For all other pandas objects, an Index of the appropriate type is
    returned.
    
    >>> cat = pd.Series(['a', 'a', 'c'])
    >>> codes, uniques = pd.factorize(cat)
    >>> codes
    array([0, 0, 1]...)
    >>> uniques
    Index(['a', 'c'], dtype='object')
    
    If NaN is in the values, and we want to include NaN in the uniques of the
    values, it can be achieved by setting ``na_sentinel=None``.
    
    >>> values = np.array([1, 2, 1, np.nan])
    >>> codes, uniques = pd.factorize(values)  # default: na_sentinel=-1
    >>> codes
    array([ 0,  1,  0, -1])
    >>> uniques
    array([1., 2.])
    
    >>> codes, uniques = pd.factorize(values, na_sentinel=None)
    >>> codes
    array([0, 1, 0, 2])
    >>> uniques
    array([ 1.,  2., nan])

​​​​​​​Function09

get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'

Help on function get_dummies in module pandas.core.reshape.reshape:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na: 'bool' = False, columns=None, sparse: 'bool' = False, drop_first: 'bool' = False, dtype: 'Dtype | None' = None) -> 'DataFrame'
    Convert categorical variable into dummy/indicator variables.
    
    Parameters
    ----------
    data : array-like, Series, or DataFrame
        Data of which to get dummy indicators.
    prefix : str, list of str, or dict of str, default None
        String to append DataFrame column names.
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternatively, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : str, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix`.
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy-encoded columns should be backed by
        a :class:`SparseArray` (True) or a regular NumPy array (False).
    drop_first : bool, default False
        Whether to get k-1 dummies out of k categorical levels by removing the
        first level.
    dtype : dtype, default np.uint8
        Data type for new columns. Only a single dtype is allowed.
    
    Returns
    -------
    DataFrame
        Dummy-coded data.
    
    See Also
    --------
    Series.str.get_dummies : Convert Series to dummy codes.
    
    Examples
    --------
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0

​​​​待续......

下一篇链接:

https://blog.csdn.net/boysoft2002/article/details/128428096

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/112948.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

C++ STL vector list set map容器循环通过迭代器删除元素注意事项

先说说写这篇博客的原因吧&#xff0c;同事转部门了&#xff0c;把他手头的工作交接给了我。他以前维护的一个模块&#xff0c;会将外部输入的数据缓存起来分段处理&#xff0c;处理完了就会清除缓存数据&#xff0c;最近出现了一个bug&#xff0c;缓存数据一直不清除&#xff…

【SpringMVC】非注解的处理器映射器和适配器

项目目录 1.导入的依赖 pom.xml <properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.source>1.8</maven.compiler.source><maven.compiler.target>1.8</maven.compiler.target><…

【K3s】第2篇 一篇文章学习实践K3s部署安装

目录 1、docker安装 2、docker-compose安装 3、K3s安装 3.1 k3s与install.sh文件准备 3.2 k3s 安装步骤 4、查看k3s部署状态 1、docker安装 方式一 https://fanjufei.blog.csdn.net/article/details/123500511https://fanjufei.blog.csdn.net/article/details/123500511 …

12.24

接口测试 ​ <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8"><meta http-equiv"X-UA-Compatible" content"IEedge"><meta name"viewport" content"widthdevice-width…

Unity3D异步加载场景SceneManager.LoadSceneAsync()卡住,并不异步,获取process直接到0.9的问题

问题阐述&#xff1a; 一般来说&#xff0c;在加载场景的时候&#xff0c;会因为所加载资源的大小、复杂度、电脑配置等因素导致加载过程耗费一定的时间。虽然这个加载时间是不可避免的&#xff0c;但是在这一小段卡着的时间里如果就这样卡着的话会大大降低玩家体验。 所以很多…

(matlab编程基础)数组的基本操作

目录 1、数组寻址 2、数组元素删除 3、数组查找和排序 &#xff08;1&#xff09;数组查找 &#xff08;2&#xff09;数组排序 4、数组运算 5、数组操作函数 数组操作主要从以下5部分进行介绍:数组寻址、数组元素的删除、数组查找和排序、数组运算和数组操作函数。 1、…

PS CS6视频剪辑基本技巧(五)添加logo、动画和画中画

系列讲座导读 PS CS6视频剪辑基本技巧&#xff08;一&#xff09;CS6可以实现的视频剪辑功能 PS CS6视频剪辑基本技巧&#xff08;二&#xff09;视频剪接和添加图片 PS CS6视频剪辑基本技巧&#xff08;三&#xff09;添加声音和字幕 PS CS6视频剪辑基本技巧&#xff08;四&am…

RMQ - ST表

RMQ - ST表 1、RMQ 简介 RMQ (Range Minimum / Maximum Query) 问题是指&#xff1a;对于长度为 nnn 的数列 AAA&#xff0c;回答若干询问 (A,i,j)(A, i, j)(A,i,j) (1≤i,j≤n)(1≤i,j≤n)(1≤i,j≤n)&#xff0c;返回数列 A 中区间在 [i,j][i,j][i,j] 中的最小 (大) 值所在…

Vue2.0全面教程

Vue2.0 学习视频地址 文章目录Vue2.01.vue 简介1.1.什么是vue1.2.vue的两个特性1.2.1.数据驱动视图1.2.2.双向数据绑定1.3.MVVM概述1.4.MVVM工作原理2.vue的基本使用2.1.基本使用步骤2.2.基本代码与MVVM的对应关系3.vue的调试工具3.1.安装vue-devtool调试工具3.2.配置Chrome浏…

Axios(三)

目录 1.Axios的默认配置 2.Axios创建Ajax实例对象发送请求 3.Axios拦截器 4.Axios取消请求 5.Axios文件结构说明 6.Axios创建过程 7.Axios对象创建过程模拟实现 8.Axios发送请求过程详解 9.模拟实现Axios发送请求 1.Axios的默认配置 <!doctype html> <html …

QT系列第7节 标准对话框使用

QT编程中对话框作为最常用的窗口经常被使用&#xff0c;本节介绍一些QT常用的标准对话框使用。 目录 1.选择单文件 2.选择多文件 3.选择目录 4.文件存储 5.选择颜色 6.选择字体 7.输入字符换/整数/浮点数/条目 8.消息对话框 9.进度对话框 10.向导对话框 1.选择单文件…

【圣诞节限定】今天教你如何用Html+JS+CSS绘制3D动画圣诞树

一、前言 应CSDN的邀请&#xff0c;这次给大家展示一波&#xff0c;如何用H5技术绘制3D圣诞树。 二、创意名 只采用简单的HtmlJSCSS 技术绘制。 三、效果展示 圣诞树修过如下&#xff1a; 四、编码实现 将源码复制保存到html中打开即可。 <!DOCTYPE html> <html lang…

ChatGPT与BimAnt的1小时对话实录【数字孪生】

本文为BimAnt和ChatGPT对数字孪生相关问题的解答&#xff0c;感觉这个AI真的已经“懂”了很多东西&#xff0c;让人恍惚间忘了是在和bot对话。 BimAnt&#xff1a;hello ChatGPT&#xff1a;Hello! How can I help you today? BimAnt&#xff1a;can you speak chinese&am…

鲸鱼优化算法及其在无线网络资源分配中的应用(Matlab代码实现)

目录 &#x1f4a5;1 概述 &#x1f4da;2 运行结果 &#x1f389;3 参考文献 &#x1f468;‍&#x1f4bb;4 Matlab代码 &#x1f4a5;1 概述 鲸鱼优化算法&#xff08;Whale Optimization Algorithm&#xff09;是一种新兴的智能优化算法&#xff0c;在2016年提出。算法…

JWT渗透与攻防(二)

目录 前言 JWT漏洞演示之CTFhub&#xff08;一&#xff09; JWT漏洞演示之CTFhub&#xff08;二&#xff09; 前言 我们在之前的文章中已经讲解过了JWT漏洞相关的原理和利用&#xff0c;今天我们就通过这篇文章再来了解一下JWT的漏洞。 JWT漏洞演示之CTFhub&#xff08;一&…

Linux-信号

文章目录信号准备知识&#xff1a;信号产生的方式实验验证&#xff1a;9号信号是不可被捕捉&#xff08;自定义的&#xff09;信号处理&#xff1a;信号产生前&#xff1a;信号产生的方式&#xff1a;键盘实验显示&#xff1a;段错误&#xff08;野指针&#xff09;实验验证&am…

SSRF ME XCTF

题目 就是一个验证框和URL框&#xff0c;两个都必须有参数 解法 验证码 做一个粗略的脚本&#xff0c;一般验证码都是数字&#xff0c;所以直接开md5&#xff1a; def cmpcapt(substr):for i in range(1,100001):md5 hashlib.md5(str(i).encode(utf-8))hexmd5 md5.hexd…

机器学习任务功略

目录 机器学习的结构机器学习攻略 训练集loss较大 model bias问题optimization问题 gradient descent的问题解决 如何区分训练集loss大是model bias还是优化器的问题 测试集loss较大 overfitting过拟合 为什么会有overfitting解决过拟合的方法 训练集与测试集的不匹配 如何选择…

Linux内核基础篇——常用调试技巧汇总

文章目录printk动态输出BUG()和BUG_ON()dump_stack()devmemprintk printk共有8个等级&#xff0c;从0-7&#xff0c;等级依次降低。 打印等级可以通过修改/proc/sys/kernel/printk来改变。 查看printk等级&#xff1a; cat /proc/sys/kernel/printk 7 4 1 7打开内核所有打印…

2022圣诞树(C语言摇钱树版本)

逐梦编程&#xff0c;让中华屹立世界之巅。 简单的事情重复做,重复的事情用心做,用心的事情坚持做&#xff1b; 文章目录前言一、个人感悟二、圣诞树由来三、圣诞树发展历史演变四、常见的圣诞树种类五、摇钱圣诞树效果展示六、实现思路七、编码实现总结新壁纸前言 时光飞逝&a…