Pandas库建立在NumPy之上,并为Python编程语言提供了易于使用的数据结构和数据分析工具。
1.安装及调用
pip install pandas
import pandas as pd
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
一种具有潜在不同类型的列的二维标记数据结构
>>> data = { 'Country' : [ 'Belgium' , 'India' , 'Brazil' ], 'Capital' : [ 'Brussels' , 'New Delhi' , 'Brasília' ], 'Population' : [11190846, 1303171035, 207847528]}>>> df = pd.DataFrame(data, columns=[ 'Country' , 'Capital' , 'Population' ])
2.读写
2.1读取和写到CSV
>>> pd.read_csv( 'file.csv' , header=None, nrows=5)>>> df.to_csv( 'myDataFrame.csv',index=False )
2.1读取和写到Excel
>>> pd.to_excel( 'dir/myDataFrame.xlsx' , sheet_name= 'Sheet1' )>>>>>> xlsx = pd.ExcelFile( 'file.xls' )>>> df = pd.read_excel(xlsx, 'Sheet1' )
3.帮助信息
>>> help(pd.Series.loc)
4.获取元素
>>> s[ 'b' ]-5>>> df[1:]Country Capital Population1 India New Delhi 13031710352 Brazil Brasília 207847528>>> df.iloc[0][0] # 根据索引获取第一行第一列的值'Belgium'>>> df.loc([0], [ 'Country' ]) # 根据标签索引获取行为0,列为“country”的值'Belgium'
>>> from sqlalchemy import create_engine>>> engine = create_engine( 'sqlite:///:memory:' )>>> pd.read_sql( "SELECT * FROM my_table;" , engine)>>> pd.read_sql_table( 'my_table' , engine)>>> pd.read_sql_query( "SELECT * FROM my_table;" , engine)>>> pd.to_sql( 'myDf' , engine)
>>> s.drop([ 'a' , 'c' ])b -5
d 4>>> df.drop( 'Country' , axis=1)
7.排序
>>> df.sort_index()>>> df.sort_values(by= 'Country' )>>> df.rank()
>>> df.shape # 返回行列(3,3)>>> df.index # 返回索引信息 RangeIndex(start=0, stop=3, step=1)>>> df.columns #返回 Index(['Country', 'Capital', 'Population'], dtype='object')>>> df.info() #返回dataframe的基本信息>>> df.count()Country 3
Capital 3
Population 3
9. 信息概要
>>>df.sum()
Country BelgiumIndiaBrazil
Capital BrusselsNew DelhiBrasília
Population 1522209409>>>df.cumsum()
Country Capital Population
0 Belgium Brussels 11190846
1 BelgiumIndia BrusselsNew Delhi 1314361881
2 BelgiumIndiaBrazil BrusselsNew DelhiBrasília 1522209409>>>df.max()
Country India
Capital New Delhi
Population 1303171035>>>df.min()
Country Belgium
Capital Brasília
Population 11190846>>>df.describe()
Population
count 3.000000e+00
mean 5.074031e+08
std 6.961346e+08
min 1.119085e+07
25% 1.095192e+08
50% 2.078475e+08
75% 7.555093e+08
max 1.303171e+09>>>df.mean()
Population 5.074031e+08
>>>df.median()
Population 207847528.0
10.函数应用
>>>f=lambda x:x*2
>>>df.apply(f)
Country Capital Population
0 BelgiumBelgium BrusselsBrussels 22381692
1 IndiaIndia New DelhiNew Delhi 2606342070
2 BrazilBrazil BrasíliaBrasília 415695056
11.数据计算
>>>s3 = pd.Series([7, -2, 3], index=[ 'a' , 'c' , 'd' ])>>>s + s3a 10.0b NaNc 5.0d 7.0>>>s.add(s3, fill_value=0)a 10.0
b -5.0
c 5.0
d 7.0>>> s.sub(s3, fill_value=2)a -4.0
b -7.0
c 9.0
d 1.0>>> s.div(s3, fill_value=4)a 0.428571
b -1.250000
c -3.500000
d 1.333333>>> s.mul(s3, fill_value=3)a 21.0
b -20.0
c -14.0
d 12.0