背景
在做文本相关的任务时,难免会遇见csv,tsv等格式的数据,但有时只是读取,然后传入到下一个任务中而已,并不会做过多的操作。在这种情况下,可以使用pandas读取,但是难免有些臃肿,还引入了pandas中的数据结构,虽然在NLP任务中torchtext可以处理相关格式的文本数据,但是其更侧重于为了模型训练做准备的,这个时候也不太适用了。
其实,Python中自带了csv格式数据的读取和写入的包,使用起来也是比较简洁的,如果经常处理csv,tsv格式数据的话,可以将其封装成对应的工具包。下面记录一下csv这个包的具体使用。当然,最权威的使用当属于官方文档:https://docs.python.org/zh-cn/3.8/library/csv.html?highlight=csv#module-csv。
下文代码使用vscode编写,使用vscode中的Excel Viewer插件查看。
生成csv文件
这里主要介绍生成csv文件的两种方式:csv.writer
、csv.DictWriter
。
按行写入程序案例
csv格式数据写入案例:
import csv
head = ["head" + str(i) for i in range(1, 6)] # 生成csv head
data_list = [[(i + 1) * j for i in range(5)] for j in range(1, 10)] # 生成数据
# newline='',否则会在两行之间插入空白行
with open('write_by_line.csv', 'w', encoding='utf8', newline="") as f:
writer = csv.writer(f) # 构建csv按行书写对象
writer.writerow(head) # 单行写入表头
writer.writerows(data_list) # 多行数据写入
csv数据如下:
tsv格式数据写入案例,其与csv格式的差别就是在实例化csv按行写入对象中设置delimiter参数为\t
。
import csv
head = ["head" + str(i) for i in range(1, 6)] # 生成csv head
data_list = [[(i + 1) * j for i in range(5)] for j in range(1, 10)] # 生成数据
# newline='',否则会在两行之间插入空白行
with open('write_by_line.tsv', 'w', encoding='utf8', newline="") as f:
writer = csv.writer(f, delimiter='\t') # 构建csv按行书写对象
writer.writerow(head) # 单行写入表头
writer.writerows(data_list) # 多行数据写入
生成数据如下:
如果生成的文件后缀名不修改成tsv的话,使用Excel Viewer会出现显示异常的情况。
按字典写入对象案例
有时候需要已有的字典数据写入到csv数据中,csv也提供根据字典的key值进行写入,为了更具通用性,下面的程序将按行写入的模块也加入了,最后的代码如下:
import csv
def write_data(style='csv', write_style='list') -> None:
head = ["head" + str(i) for i in range(1, 6)] # 生成csv head
if write_style == 'list':
data_list = [[(i + 1) * j for i in range(5)] for j in range(1, 10)] # 生成数据
elif write_style == 'dict':
data_list = [{key: j for key in head} for j in range(1, 10)]
else:
raise Exception(f'this write_style {write_style}, not in ["list", "dict"]')
if style == 'csv':
delimiter = ','
file_name = write_style+"_writer_by_list.csv"
elif style == 'tsv':
delimiter = '\t'
file_name = write_style+"_writer_by_list.tsv"
else:
raise Exception(f"this style {style} not in ['csv', 'tsv']")
# newline='',否则会在两行之间插入空白行
with open(file_name, 'w', encoding='utf8', newline="") as f:
if write_style == 'list':
writer = csv.writer(f, delimiter=delimiter) # 构建csv按行书写对象
writer.writerow(head) # 单行写入表头
else:
writer = csv.DictWriter(f, head)
writer.writeheader()
writer.writerows(data_list) # 多行数据写入
if __name__ == '__main__':
write_data(write_style='dict')
生成的数据如下:
csv数据读取
有了csv文件后,按理也应该有两种读取数据的方式。实例如下:
def read_data(file_path, style='csv', delimiter=',') -> Any:
assert delimiter in [',', '\t'], f"delimiter must in [',','\t'], you input {delimiter}"
data = {}
with open(file_path, 'r', encoding='utf8') as f:
if style == 'csv':
reader = csv.reader(f, delimiter=delimiter)
else:
reader = csv.DictReader(f, delimiter=delimiter)
data["head"] = next(reader)
data['content'] = list(reader)
return data
if __name__ == '__main__':
import pprint
data = read_data("dict_writer_by_list.csv")
pprint.pprint(data)
运行结果如下:
总结
总的来说,使用csv内建包,无论数据写入还是数据读取都比较简洁方便,定制化也比较容易,可以根据自己实际的工程需要进行定制化开发。以上的全部代码如下:
import csv
from typing import Any
def write_data(style='csv', write_style='list') -> None:
head = ["head" + str(i) for i in range(1, 6)] # 生成csv head
if write_style == 'list':
data_list = [[(i + 1) * j for i in range(5)] for j in range(1, 10)] # 生成数据
elif write_style == 'dict':
data_list = [{key: j for key in head} for j in range(1, 10)]
else:
raise Exception(f'this write_style {write_style}, not in ["list", "dict"]')
if style == 'csv':
delimiter = ','
file_name = write_style+"_writer_by_list.csv"
elif style == 'tsv':
delimiter = '\t'
file_name = write_style+"_writer_by_list.tsv"
else:
raise Exception(f"this style {style} not in ['csv', 'tsv']")
# newline='',否则会在两行之间插入空白行
with open(file_name, 'w', encoding='utf8', newline="") as f:
if write_style == 'list':
writer = csv.writer(f, delimiter=delimiter) # 构建csv按行书写对象
writer.writerow(head) # 单行写入表头
else:
writer = csv.DictWriter(f, head) # 传入头
writer.writeheader() # 写入头
writer.writerows(data_list) # 多行数据写入
def read_data(file_path, style='csv', delimiter=',') -> Any:
assert delimiter in [',', '\t'], f"delimiter must in [',','\t'], you input {delimiter}"
data = {}
with open(file_path, 'r', encoding='utf8') as f:
if style == 'csv':
reader = csv.reader(f, delimiter=delimiter)
else:
reader = csv.DictReader(f, delimiter=delimiter)
data["head"] = next(reader)
data['content'] = list(reader)
return data
if __name__ == '__main__':
import pprint
# write_data(write_style='dict')
data = read_data("dict_writer_by_list.csv")
pprint.pprint(data)
感谢不错的话,记得给我 “一键三连” 哦