pandas提供了iterrows()、itertuples()、apply等行遍历的方式,还是比较方便的。
polars的列操作功能非常强大,这个在其官网上有详细的介绍。由于polars底层的arrow是列存储模式,行操作效率低下,官方也不推荐以行方式进行数据操作。但是还是有部分场景可能会用到行遍历的情况。
polars如何进行行遍历,今天尝试一下非apply的方式。
场景:polars读取相应的关于历史股价的csv文件,其中有基本的行情信息,那么,如何对读取到的文件进行快速的行遍历?这种场景在行情驱动的策略回测中比较常见。
一、初步方案:
1、总体方案
1、csv => dataframe
2、dataframe =>into_struct ,得到structchunked
3、struchchunked =>在bars进行行遍历。
2、Bar类型
至于Bar类型的设计,存在两种方案:
(1)值类型的Bar
#[warn(dead_code)]
struct Bar{
code:String,
date:String,
open:f32,
high:f32,
close:f32,
low:f32,
volume:f32,
amount:f32,
is_fq:bool,
}
(2)有引用类型的Bar
#[warn(dead_code)]
struct Bar2<'a>{
code:&'a str,
date:&'a str,
open:f32,
high:f32,
close:f32,
low:f32,
volume:f32,
amount:f32,
is_fq:bool,
}
二、toml
注意,polars对features的设置要求高,有些用到的特性需要准确打开,否则代码编译会通不过。这一点在polars文档中经常没有写清楚,也算是一个坑。
[package]
name = "my_duckdb"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
polars = { version = "*", features = ["lazy","dtype-struct"] }
注意,features中,一定要加上"dtype-struct"。
三、main.rs
根据上面的设计,全部代码如下:
use polars::prelude::*;
use std::time::Instant;
#[warn(dead_code)]
struct Bar{
code:String,
date:String,
open:f32,
high:f32,
close:f32,
low:f32,
volume:f32,
amount:f32,
is_fq:bool,
}
#[warn(dead_code)]
struct Bar2<'a>{
code:&'a str,
date:&'a str,
open:f32,
high:f32,
close:f32,
low:f32,
volume:f32,
amount:f32,
is_fq:bool,
}
fn main() {
let time0 = Instant::now();
// test2.csv:64w行
let csv = "test2.csv";
let df = polars_lazy_read_csv(csv);
println!("read raw csv cost time : {:?} seconds",time0.elapsed().as_secs_f32());
let time1 = Instant::now();
let rows = df.into_struct("bars");
println!("dataframe => structs cost time : {:?} seconds",time1.elapsed().as_secs_f32());
let time2 = Instant::now();
let bars = get_vec_bars(&rows);
println!("dataframe => bars cost time : {:?} seconds",time2.elapsed().as_secs_f32());
let time3 = Instant::now();
let bar2s = get_vec_bar2s(&rows);
println!("dataframe => bar2s cost time : {:?} seconds",time3.elapsed().as_secs_f32());
println!("bars length :{:?}",bars.len());
println!("bar2s length:{:?}",bar2s.len());
}
fn get_bar(row:&[AnyValue])->Bar{
let code = row.get(0).unwrap();
let mut new_code = "";
if let &AnyValue::Utf8(value) = code{
new_code = value;
}
let mut new_date = "";
let date = row.get(1).unwrap();
if let &AnyValue::Utf8(v) = date {
new_date = v;
}
let open =row[2].extract::<f32>().unwrap();
let high:f32 = row[3].extract::<f32>().unwrap();
let close =row[4].extract::<f32>().unwrap();
let low:f32 = row[5].extract::<f32>().unwrap();
let volume =row[6].extract::<f32>().unwrap();
let amount:f32 = row[7].extract::<f32>().unwrap();
let mut is_fq = false;
if let &AnyValue::Boolean(b) = row.get(8).unwrap(){
is_fq = b;
}
let bar = Bar{
code: String::from(new_code),
date: String::from(new_date),
open:open,
high:high,
close:close,
low:low,
volume:volume,
amount,
is_fq:is_fq,
};
bar
}
fn get_bar2<'a>(row:&'a [AnyValue])->Bar2<'a>{
let code = row.get(0).unwrap();
let mut new_code = "";
if let &AnyValue::Utf8(value) = code{
new_code = value;
}
let mut new_date = "";
let date = row.get(1).unwrap();
if let &AnyValue::Utf8(v) = date {
new_date = v;
}
let open =row[2].extract::<f32>().unwrap();
let high:f32 = row[3].extract::<f32>().unwrap();
let close =row[4].extract::<f32>().unwrap();
let low:f32 = row[5].extract::<f32>().unwrap();
let volume =row[6].extract::<f32>().unwrap();
let amount:f32 = row[7].extract::<f32>().unwrap();
let mut is_fq = false;
if let &AnyValue::Boolean(b) = row.get(8).unwrap(){
is_fq = b;
}
let bar = Bar2{
code: new_code,
date: new_date,
open:open,
high:high,
close:close,
low:low,
volume:volume,
amount,
is_fq:is_fq,
};
bar
}
fn get_vec_bars(data: &StructChunked)-> Vec<Bar>{
let mut bars = Vec::new();
for row in data{
let bar = get_bar(row);
bars.push(bar);
}
bars
}
fn get_vec_bar2s(data: &StructChunked)-> Vec<Bar2>{
let mut bars = Vec::new();
for row in data{
let bar = get_bar2(row);
bars.push(bar);
}
bars
}
fn polars_lazy_read_csv(filepath:&str) ->DataFrame{
let polars_lazy_csv_time = Instant::now();
let p = LazyCsvReader::new(filepath)
.has_header(true)
.finish().unwrap();
let mut df = p.collect().expect("error to dataframe!");
println!("polars lazy 读出csv的行和列数:{:?}",df.shape());
println!("polars lazy 读csv 花时: {:?} 秒!", polars_lazy_csv_time.elapsed().as_secs_f32());
df
}
四、输出与比较
对于一个64万行,9列的csv文件,需要遍历转换Vec< Bar >类型,
1、输出如下:
polars lazy 读出csv的行和列数:(640710, 9)
polars lazy 读csv 花时: 0.058484446 秒!
read raw csv cost time : 0.058487203 seconds
dataframe => structs cost time : 2.8842e-5 seconds
dataframe => bars cost time : 0.131985 seconds
dataframe => bar2s cost time : 0.10357016 seconds
bars length :640710
bar2s length:640710
总体上看,从dataframe到struct这层,效率比较高,主要的时间花在了structchunked至bars这部分上面。
2、值类型Bar和引用类型Bar
从输出结果,可以看出,引用类型的Bar的效率要高一些,提效了20%。因为减少了堆分配所需要的时间。
五、其它
polars目前还没有发现有类似pandas的行遍历的方式,后面将持续跟踪。
此外,dataframe转bars的效率并不高,期待找到更高效的方式替代。