day03_pandas

文章目录

- pandas介绍
- 为什么使用pandas
- DataFrame
- DataFrame属性
- DataFrame的索引
- - 修改行列的索引值
  - 重设索引值
  - 以某列设置新索引
- MultiIndex
- Serias
- 索引操作
- - 直接索引
  - 按名字索引
  - 按数值索引
- 赋值操作
- 排序
- - 对内容排序
  - 按索引排序
- DataFrame的运算
- - 算术运算
  - 逻辑运算
  - - 逻辑运算符号 < > | &
    - 逻辑运算函数 query() isin()
- 统计运算
- 自定义运算
- Pandas画图
- scv文件读取与存储
- hdf5文件读取与存储
- json文件读取与存储
- 总结

pandas介绍

pandas= panel + data + analysis 面板数据分析
panel面板数据-计量经济学三维数据
以numpy为基础，借力numpy模块在计算方面性能高的优势
基于matplotlib，能够简便的画图
独特的数据结构

为什么使用pandas

便捷的数据处理能力
读取文件方便
封装了matplotlib、numpy的画图和计算能力

DataFrame

## 结构：既有行索引，又有列索引的二维数组
import pandas as pd
import numpy as np
stock_change = np.random.normal(0, 1, (10, 5))
stock_change

array([[ 0.52652359, -0.42210135,  0.45506419, -0.1319933 , -0.85892243],
       [-2.80978824,  0.68502373, -0.72809275, -1.56716962,  0.24278934],
       [ 0.1423945 , -0.14913827, -0.30118759,  0.80841083,  0.56448585],
       [-1.11053808, -0.91833131, -0.82696531,  0.33592674, -1.81590623],
       [-0.7972349 , -0.38960542, -0.64822525, -1.67732846, -1.1320404 ],
       [-0.83075257, -0.96589613,  1.21458607, -0.54116531,  0.5416992 ],
       [ 0.2346827 ,  0.38728822,  0.5534352 ,  0.49615629,  0.03958449],
       [ 1.32743523,  0.8559906 , -0.35473279, -0.40734067,  0.23585156],
       [ 2.217162  ,  0.43897264,  1.39278121, -0.17076621,  1.25111371],
       [-1.84123059, -1.00666366,  2.07583716,  1.03959872,  1.20092384]])

stock_change1 = pd.DataFrame(stock_change)  ## 添加默认的索引
stock_change1

	0	1	2	3	4
0	-0.230423	-0.108677	2.116127	-0.405135	-0.600457
1	1.422377	-1.136674	-0.462335	0.795195	-0.013265
2	0.708261	-0.197826	-0.177992	-1.078743	0.357987
3	-0.325432	0.264337	0.856580	-1.035939	-0.228252
4	0.016734	1.007554	0.454911	0.252380	-0.691905
5	-0.471790	0.557541	-0.703171	0.344268	-0.083205
6	-0.013339	-0.300371	1.424916	0.028338	1.101670
7	0.061438	-0.802730	-0.746614	-0.919655	-1.336464
8	0.369274	0.515427	0.661126	-0.550260	-1.560633
9	-1.087217	-1.164305	-0.408748	1.198835	-0.389584

# 添加行索引
stock_code = ['股票{}'.format(i+1) for i in range(10)]
stock_code
pd.DataFrame(stock_change, index=stock_code)  ## 这里需要注意第一个参数是ndarray，不是DataFrame结构，否则数据会变为nan

	0	1	2	3	4
股票1	-1.796149	0.063469	0.922334	-0.338207	2.157024
股票2	-0.064218	0.969453	0.223896	-0.795105	-2.020499
股票3	-0.039286	0.046665	-0.408812	-0.284145	1.852426
股票4	-1.811617	0.588799	-1.020581	-0.421300	-1.068160
股票5	-0.867187	0.070269	0.362412	0.595810	0.005319
股票6	-2.384285	0.185213	-0.094201	0.559706	1.156052
股票7	1.231396	0.226930	-0.284544	1.056286	-0.765503
股票8	1.451832	-0.518495	0.115510	0.578233	0.174324
股票9	1.184461	-0.327693	-1.405433	1.480470	0.049133
股票10	0.891309	0.780864	-0.858295	-1.154474	0.127319

pd.date_range(start=None,end=None,periods=None,freq=‘B’)

start : 开始时间
end : 结束时间
periods : 时间天数
freq : 递进单位，默认1天，'B’默认略过周末

date = pd.date_range(start='20231021', end=None, periods=5, freq='B')
date

DatetimeIndex(['2023-10-23', '2023-10-24', '2023-10-25', '2023-10-26',
               '2023-10-27'],
              dtype='datetime64[ns]', freq='B')

stock_c = pd.DataFrame(stock_change, index=stock_code, columns=date)
stock_c

	2023-10-23	2023-10-24	2023-10-25	2023-10-26	2023-10-27
股票1	-1.796149	0.063469	0.922334	-0.338207	2.157024
股票2	-0.064218	0.969453	0.223896	-0.795105	-2.020499
股票3	-0.039286	0.046665	-0.408812	-0.284145	1.852426
股票4	-1.811617	0.588799	-1.020581	-0.421300	-1.068160
股票5	-0.867187	0.070269	0.362412	0.595810	0.005319
股票6	-2.384285	0.185213	-0.094201	0.559706	1.156052
股票7	1.231396	0.226930	-0.284544	1.056286	-0.765503
股票8	1.451832	-0.518495	0.115510	0.578233	0.174324
股票9	1.184461	-0.327693	-1.405433	1.480470	0.049133
股票10	0.891309	0.780864	-0.858295	-1.154474	0.127319

DataFrame属性

对象.shape 获取形状
对象.index 获取行索引
对象.columns 获取列索引
对象.values 获取值
对象.T 获取行列转换
对象.head() 查看前几行，默认是5
对象.tail() 查看最后几行默认是5

stock_c.shape

(10, 5)

stock_c.index

Index(['股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9', '股票10'], dtype='object')

stcok_c.columns

DatetimeIndex(['2023-10-23', '2023-10-24', '2023-10-25', '2023-10-26',
               '2023-10-27'],
              dtype='datetime64[ns]', freq='B')

stock_c.values

array([[-1.7961491 ,  0.06346948,  0.92233413, -0.33820729,  2.15702396],
       [-0.06421753,  0.96945298,  0.22389647, -0.79510515, -2.02049945],
       [-0.03928641,  0.04666511, -0.40881248, -0.28414454,  1.85242648],
       [-1.81161734,  0.5887991 , -1.02058093, -0.42130023, -1.06816   ],
       [-0.86718681,  0.07026887,  0.36241195,  0.59581008,  0.00531913],
       [-2.38428482,  0.18521273, -0.09420118,  0.55970591,  1.15605167],
       [ 1.23139579,  0.22693018, -0.28454449,  1.05628637, -0.76550258],
       [ 1.45183169, -0.51849484,  0.11550995,  0.57823283,  0.17432416],
       [ 1.18446114, -0.3276933 , -1.40543347,  1.48046993,  0.04913251],
       [ 0.89130874,  0.78086438, -0.85829505, -1.15447368,  0.12731851]])

stock_c.T

	股票1	股票2	股票3	股票4	股票5	股票6	股票7	股票8	股票9	股票10
2023-10-23	-1.796149	-0.064218	-0.039286	-1.811617	-0.867187	-2.384285	1.231396	1.451832	1.184461	0.891309
2023-10-24	0.063469	0.969453	0.046665	0.588799	0.070269	0.185213	0.226930	-0.518495	-0.327693	0.780864
2023-10-25	0.922334	0.223896	-0.408812	-1.020581	0.362412	-0.094201	-0.284544	0.115510	-1.405433	-0.858295
2023-10-26	-0.338207	-0.795105	-0.284145	-0.421300	0.595810	0.559706	1.056286	0.578233	1.480470	-1.154474
2023-10-27	2.157024	-2.020499	1.852426	-1.068160	0.005319	1.156052	-0.765503	0.174324	0.049133	0.127319

stock_c.head()

	2023-10-23	2023-10-24	2023-10-25	2023-10-26	2023-10-27
股票1	-1.796149	0.063469	0.922334	-0.338207	2.157024
股票2	-0.064218	0.969453	0.223896	-0.795105	-2.020499
股票3	-0.039286	0.046665	-0.408812	-0.284145	1.852426
股票4	-1.811617	0.588799	-1.020581	-0.421300	-1.068160
股票5	-0.867187	0.070269	0.362412	0.595810	0.005319

stock_c.tail()

	2023-10-23	2023-10-24	2023-10-25	2023-10-26	2023-10-27
股票6	-2.384285	0.185213	-0.094201	0.559706	1.156052
股票7	1.231396	0.226930	-0.284544	1.056286	-0.765503
股票8	1.451832	-0.518495	0.115510	0.578233	0.174324
股票9	1.184461	-0.327693	-1.405433	1.480470	0.049133
股票10	0.891309	0.780864	-0.858295	-1.154474	0.127319

DataFrame的索引

修改行列的索引值

stock_c.index = [f'股票_{i+1}' for i in range(10)]
## 不能直接索引改变
## stock_c.index[2] = '123'  ## pandas不支持这样的索引
stock_c

	2023-10-23	2023-10-24	2023-10-25	2023-10-26	2023-10-27
股票_1	-1.796149	0.063469	0.922334	-0.338207	2.157024
股票_2	-0.064218	0.969453	0.223896	-0.795105	-2.020499
股票_3	-0.039286	0.046665	-0.408812	-0.284145	1.852426
股票_4	-1.811617	0.588799	-1.020581	-0.421300	-1.068160
股票_5	-0.867187	0.070269	0.362412	0.595810	0.005319
股票_6	-2.384285	0.185213	-0.094201	0.559706	1.156052
股票_7	1.231396	0.226930	-0.284544	1.056286	-0.765503
股票_8	1.451832	-0.518495	0.115510	0.578233	0.174324
股票_9	1.184461	-0.327693	-1.405433	1.480470	0.049133
股票_10	0.891309	0.780864	-0.858295	-1.154474	0.127319

重设索引值

## stock_c.reset_index(drop=True)  当drop=True就会删除之前的索引，为Fasle就不会删除之前的索引
stock_c.reset_index()

	index	2023-10-23 00:00:00	2023-10-24 00:00:00	2023-10-25 00:00:00	2023-10-26 00:00:00	2023-10-27 00:00:00
0	股票_1	-1.796149	0.063469	0.922334	-0.338207	2.157024
1	股票_2	-0.064218	0.969453	0.223896	-0.795105	-2.020499
2	股票_3	-0.039286	0.046665	-0.408812	-0.284145	1.852426
3	股票_4	-1.811617	0.588799	-1.020581	-0.421300	-1.068160
4	股票_5	-0.867187	0.070269	0.362412	0.595810	0.005319
5	股票_6	-2.384285	0.185213	-0.094201	0.559706	1.156052
6	股票_7	1.231396	0.226930	-0.284544	1.056286	-0.765503
7	股票_8	1.451832	-0.518495	0.115510	0.578233	0.174324
8	股票_9	1.184461	-0.327693	-1.405433	1.480470	0.049133
9	股票_10	0.891309	0.780864	-0.858295	-1.154474	0.127319

stock_c.reset_index(drop=True)

	2023-10-23	2023-10-24	2023-10-25	2023-10-26	2023-10-27
0	-1.796149	0.063469	0.922334	-0.338207	2.157024
1	-0.064218	0.969453	0.223896	-0.795105	-2.020499
2	-0.039286	0.046665	-0.408812	-0.284145	1.852426
3	-1.811617	0.588799	-1.020581	-0.421300	-1.068160
4	-0.867187	0.070269	0.362412	0.595810	0.005319
5	-2.384285	0.185213	-0.094201	0.559706	1.156052
6	1.231396	0.226930	-0.284544	1.056286	-0.765503
7	1.451832	-0.518495	0.115510	0.578233	0.174324
8	1.184461	-0.327693	-1.405433	1.480470	0.049133
9	0.891309	0.780864	-0.858295	-1.154474	0.127319

以某列设置新索引

df = pd.DataFrame({'year':[2021, 2021, 2023, 2024],
                  'month':[1, 2, 3, 4],
                  'sale':[22, 100, 222, 113]})
df

	year	month	sale
0	2021	1	22
1	2021	2	100
2	2023	3	222
3	2024	4	113

df.index

RangeIndex(start=0, stop=4, step=1)

## set_index(keys=, drop=True)  keys列索引名称或者列索引名称列表 drop表示是否将列索引数据删除
df.set_index(keys=['year'])

	month	sale
year
2021	1	22
2021	2	100
2023	3	222
2024	4	113

new_df = df.set_index(keys=['year', 'month'], drop=False)
new_df

		year	month	sale
year	month
2021	1	2021	1	22
2021	2	2021	2	100
2023	3	2023	3	222
2024	4	2024	4	113

new_df.index

MultiIndex([(2021, 1),
            (2021, 2),
            (2023, 3),
            (2024, 4)],
           names=['year', 'month'])

MultiIndex

new_df.index.names

FrozenList(['year', 'month'])

tuples = [('bar', 'one'),
     ('bar', 'two'),
     ('baz', 'one'),
     ('baz', 'two'),
     ('foo', 'one'),
     ('foo', 'two'),
     ('qux', 'one'),
     ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

pd.Series(np.random.randn(8), index=index)

first  second
bar    one      -0.816907
       two       0.660782
baz    one      -1.032361
       two      -0.595878
foo    one      -0.658145
       two      -0.891936
qux    one       0.385722
       two      -0.192622
dtype: float64

arrays = [
np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

		0	1	2	3
bar	one	-0.162790	2.799107	1.070652	0.034360
bar	two	-0.283814	-0.551970	-1.270871	-0.813390
baz	one	0.422166	1.380131	0.593804	0.776062
baz	two	1.888835	-0.176970	-0.568067	-1.343601
foo	one	-0.532914	1.206831	-0.367705	0.912403
foo	two	-1.576118	-0.082882	-0.122176	1.521598
qux	one	-0.074543	-0.359237	0.309770	0.895598
qux	two	0.905186	0.670022	-1.549954	-0.539559

pd.DataFrame(np.random.randn(8, 4), index=index)

		0	1	2	3
first	second
bar	one	-1.208274	-0.810972	-1.820593	-0.833156
bar	two	-1.501657	0.683875	0.923321	-0.710930
baz	one	-0.008496	-3.645099	2.125764	1.406796
baz	two	-0.440605	0.645926	-1.640536	1.002207
foo	one	0.264713	0.182264	-1.410930	0.837404
foo	two	0.683733	-0.300426	1.281374	0.440129
qux	one	-0.179653	-0.331090	-0.817277	0.583263
qux	two	-0.305134	-0.934428	-0.479319	-0.179533

MultiIndex.from_arrays()：传入一个数组列表
MultiIndex.from_tuples()：传入一个元组数组、
MultiIndex.from_product()：传入一个交叉的迭代集合
MultiIndex.from_frame()：传入一个 DataFrame

Serias

对象[flag1][flag2][flag3] 先列后行
对象.loc[] # 先行后列，可以使用切片操作
对象.iloc[] # 先行后列，通过索引去进行索引

new_df['year'][2021][1]  ## 一定是先列后行

df.loc[0:4, 'sale']   ## 先行后列，可以使用切片操作

0     22
1    100
2    222
3    113
Name: sale, dtype: int64

df.iloc[0:3, :5] ## 前3行前5列 先行后列，通过索引去进行索引

	year	month	sale
0	2021	1	22
1	2021	2	100
2	2023	3	222

new_df.iloc[0:3, :5]

		year	month	sale
year	month
2021	1	2021	1	22
2021	2	2021	2	100
2023	3	2023	3	222

sr = pd.Series(np.arange(2,10,2), index=['数值{}'.format(i+1) for i in range(4)])
sr

数值1    2
数值2    4
数值3    6
数值4    8
dtype: int32

sr.values

array([2, 4, 6, 8])

sr.index

Index(['数值1', '数值2', '数值3', '数值4'], dtype='object')

索引操作

import numpy as np
import pandas as pd
mydata = np.random.normal(0, 1, (5, 5))
mydata_index = ['index{}'.format(i+1) for i in range(5)]
mydata_col =  ['col{}'.format(i+1) for i in range(5)]
data = pd.DataFrame(mydata, index=mydata_index, columns=mydata_col)
data

	col1	col2	col3	col4	col5
index1	0.178961	0.849560	-0.077123	-0.550173	-0.821073
index2	-0.479774	-0.986681	-0.934725	0.010318	-0.736170
index3	-0.384807	-0.636485	0.056328	-1.383175	-0.451370
index4	-0.770427	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.068687	-0.361269	1.827731	0.034858	1.239907

直接索引

data['col1']['index1']  ## 先列后行

-0.31201088599026405

按名字索引

data.loc['index1']['col1']

-0.31201088599026405

data.loc['index1', 'col1']

-0.31201088599026405

data.loc[['index1', 'index2'], 'col1']

index1    0.178961
index2   -0.479774
Name: col1, dtype: float64

按数值索引

data.iloc[1, 0]

-0.2269501796329433

data.iloc[:4, :1]

	col1
index1	0.178961
index2	-0.479774
index3	-0.384807
index4	-0.770427

赋值操作

data['col1'] = 0.01
data

	col1	col2	col3	col4	col5
index1	0.01	0.849560	-0.077123	-0.550173	-0.821073
index2	0.01	-0.986681	-0.934725	0.010318	-0.736170
index3	0.01	-0.636485	0.056328	-1.383175	-0.451370
index4	0.01	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.01	-0.361269	1.827731	0.034858	1.239907

data.col1 = 0.02
data

	col1	col2	col3	col4	col5
index1	0.02	0.849560	-0.077123	-0.550173	-0.821073
index2	0.02	-0.986681	-0.934725	0.010318	-0.736170
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

data.col1.index1 = 0.1
data

	col1	col2	col3	col4	col5
index1	0.10	0.849560	-0.077123	-0.550173	-0.821073
index2	0.02	-0.986681	-0.934725	0.010318	-0.736170
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

data['col1']['index2'] = 0.3
data

	col1	col2	col3	col4	col5
index1	0.10	0.849560	-0.077123	-0.550173	-0.821073
index2	0.30	-0.986681	-0.934725	0.010318	-0.736170
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

排序

对内容排序

对象.sort_values(by=, key=, ascending=) 单个键或者多个键进行排序，默认升序 True升序 False降序

data.sort_values(by=['col1'], ascending=False)

	col1	col2	col3	col4	col5
index2	0.30	-0.986681	-0.934725	0.010318	-0.736170
index1	0.10	0.849560	-0.077123	-0.550173	-0.821073
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

data.sort_values(by=['col1', 'col2'], ascending=False)

	col1	col2	col3	col4	col5
index2	0.30	-0.986681	-0.934725	0.010318	-0.736170
index1	0.10	0.849560	-0.077123	-0.550173	-0.821073
index5	0.02	-0.361269	1.827731	0.034858	1.239907
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639

sr = data['col1']  ## 对serias进行排序
sr

index1    0.10
index2    0.30
index3    0.02
index4    0.02
index5    0.02
Name: col1, dtype: float64

sr.sort_values()

index3    0.02
index4    0.02
index5    0.02
index1    0.10
index2    0.30
Name: col1, dtype: float64

按索引排序

对象.sort_index()

data.sort_index()

	col1	col2	col3	col4	col5
index1	0.10	0.849560	-0.077123	-0.550173	-0.821073
index2	0.30	-0.986681	-0.934725	0.010318	-0.736170
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

sr.sort_index()

index1    0.10
index2    0.30
index3    0.02
index4    0.02
index5    0.02
Name: col1, dtype: float64

DataFrame的运算

算术运算

data.col1 + 2

index1    2.10
index2    2.30
index3    2.02
index4    2.02
index5    2.02
Name: col1, dtype: float64

data.col1.add(3)

index1    3.10
index2    3.30
index3    3.02
index4    3.02
index5    3.02
Name: col1, dtype: float64

data.sub(10).head(2)  ## data - 10

	col1	col2	col3	col4	col5
index1	-9.9	-9.150440	-10.077123	-10.550173	-10.821073
index2	-9.7	-10.986681	-10.934725	-9.989682	-10.736170

data.col1.sub(data.col2).head(3)

index1   -0.749560
index2    1.286681
index3    0.656485
dtype: float64

逻辑运算

逻辑运算符号 < > | &

## 筛选col1的数据大于0.1的
data.col1 > 0.1

index1    False
index2     True
index3    False
index4    False
index5    False
Name: col1, dtype: bool

data[data.col1 > 0.1]

	col1	col2	col3	col4	col5
index2	0.3	-0.986681	-0.934725	0.010318	-0.73617

data[(data.col1 < 0.1) & (data.col2 < 0.1)]

	col1	col2	col3	col4	col5
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

逻辑运算函数 query() isin()

## 对象.query(expr) expr： 查询的字符串
data.query('col1 < 0.1 & col2 < 0.1')

	col1	col2	col3	col4	col5
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

## 对象.isin(values) values数据列表 判断数据是否等于列表中的值
data[data.col1.isin([0.02, 0.01])]

	col1	col2	col3	col4	col5
index3	0.02	-0.636485	0.056328	-1.383175	-0.451370
index4	0.02	-1.009373	-0.283575	-0.923803	-1.502639
index5	0.02	-0.361269	1.827731	0.034858	1.239907

统计运算

统计函数：count、mean、std、min、max、var、prod、mode、abs、idmax、idmin
上面的idmax、idmin、表示获取最小值最大值的位置和numpy的argmax、argmin函数是类似的
对象.describe() 一次性的获取平均值、标准差、最大值、最小值等值
累计统计函数 cumsum、cummax、cummin、cumprod 分别是计算n个数的和、最大值、最小值、积

data.max()

col1    0.300000
col2    0.849560
col3    1.827731
col4    0.034858
col5    1.239907
dtype: float64

data.describe()

	col1	col2	col3	col4	col5
count	5.000000	5.000000	5.000000	5.000000	5.000000
mean	0.092000	-0.428850	0.117727	-0.562395	-0.454269
std	0.121326	0.763248	1.028901	0.610155	1.022660
min	0.020000	-1.009373	-0.934725	-1.383175	-1.502639
25%	0.020000	-0.986681	-0.283575	-0.923803	-0.821073
50%	0.020000	-0.636485	-0.077123	-0.550173	-0.736170
75%	0.100000	-0.361269	0.056328	0.010318	-0.451370
max	0.300000	0.849560	1.827731	0.034858	1.239907

data.col1.cumsum()

index1    0.10
index2    0.40
index3    0.42
index4    0.44
index5    0.46
Name: col1, dtype: float64

data.col1.cumsum().plot()

在这里插入图片描述

自定义运算

apply(func, axis=0)

func:自定义函数
axis=0:默认是列，axis=1表示进行行计算

# 计算col1和col2列最大值减去最小值的值
data[['col1', 'col2']].apply(lambda x: x.max() - x.min(), axis=0)

col1    0.280000
col2    1.858932
dtype: float64

Pandas画图

Pandas.DataFrame.plot(x=None, y=None, kind=‘line’)
- line折线图 bar柱状图 barh水平柱状图 hist直方图 pie饼图 scatter散点图
Pandas.Serias.plot

import pandas as pd
import numpy as np
mydata = np.random.normal(0, 1, (5, 5))
mydata_index = ['index{}'.format(i+1) for i in range(5)]
mydata_col =  ['col{}'.format(i+1) for i in range(5)]
data = pd.DataFrame(mydata, index=mydata_index, columns=mydata_col)
data

	col1	col2	col3	col4	col5
index1	0.706740	1.059931	-0.290975	0.480027	0.869103
index2	-0.461089	2.278285	0.118369	-0.141536	-1.054914
index3	0.871724	-1.184708	-0.729994	-0.291118	0.606099
index4	-0.300855	-0.784571	-1.815973	0.791439	0.861675
index5	1.380419	1.675737	0.400070	0.130281	0.501257

data.plot()

在这里插入图片描述

data.plot(x='col1', y='col2', kind='barh')

在这里插入图片描述

scv文件读取与存储

pandas.read_csv(filepath_or_buffer, sep=‘,’, header=‘infer’, names=None, usecols=[])

DataFrame.to_csv(path_or_buf=None, sep=‘,’, na_rep=‘’, index=False, header=True, mode=‘w’, encoding=None)
- path_or_buf：写入CSV文件的路径或文件对象
- sep：列分隔符，默认为逗
- na_rep：缺失值的表示，默认为空字
- index：是否写入行索引，默认为 False
- header：是否写入列名，默认为True
- mode:写入模式默认是w重写，还有a追加模式

read_data = pd.read_csv('E:/Project/PyCharm_Projects/pandas_test/read.csv', encoding='GBK')
read_data

	Name	Age
0	李白	21
1	杜甫	32
2	孟浩然	34

data = {'Name': ['Alice', 'Bob', 'Carol'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df.to_csv('E:/Project/PyCharm_Projects/pandas_test/output.csv', index=False, mode='a', header=False)
df_read = pd.read_csv('E:/Project/PyCharm_Projects/pandas_test/output.csv')
df_read

	Name	Age
0	Alice	25
1	Bob	30
2	Carol	35
3	Alice	25
4	Bob	30
5	Carol	35

hdf5文件读取与存储

pandas.read_hdf(path_or_buf, key=None, **kwargs)
- path_or_buf 文件路径
- key:读取的键
- mode:打开模式
- return Theseselected objects
DataFrame.to_hdf(path_or_buf, key, **kwargs)
hdf5是使用键值对来存储数据的，他也是可以存储三维数据的
跨平台、支持压缩、节省空间

json文件读取与存储

pandas.read_json(path_or_buf=None, orient=None, type=‘frame’, lines=‘False’)
- 将json格式数据转换为默认的Pandas DataFrame格式的数据、
- orient:一般选择records
- lines:是否把每行作为一个json
DataFrame.to_json(path_or_buf=None, orient=None, lines=‘False’)

总结

Pandas基础数据处理
Pandas介绍：
- 面板数据数据处理工具便捷的数据处理能力
- 继承了Numpy和matplotlib，读取文件方便
- Series：一维数据 DataFrame多维数据
- Series属性：index values
- DataFrame属性：shape、index、columns、values、T
- DataFrame常用方法：head() tail()
- Multiindex，多维数据存储方式
Pandas基本操作
- 索引操作：直接索引(先列后行)、按名字索引loc、按数字索引iloc
- 赋值操作
- 排序操作：sort_values() sort_index()
Pandas运算:
- 算术运算：
- 逻辑运算：逻辑运算符 & 布尔索引 query() isin()
- 统计运算：describe()、min、max、std、idmax、idmin、cumsum、cummax
- 自定义运算:apply()
Pandas画图:
- df.plot()
- sr.plot()
PandasIO操作:
- csv:pd.read_csv(path, names, usecols) pd.to_csv(path, header, mode, index)
- hdf5:pd.read_hdf5(path, key) pd.to_hdf5(path, key)
- json:pd.read_json(path, records, lines) pd.to_json(path, records, lines)
  =‘w’, encoding=None)
- path_or_buf：写入CSV文件的路径或文件对象
- sep：列分隔符，默认为逗
- na_rep：缺失值的表示，默认为空字
- index：是否写入行索引，默认为 False
- header：是否写入列名，默认为True
- mode:写入模式默认是w重写，还有a追加模式