笔记为自我总结整理的学习笔记,若有错误欢迎指出哟~
【python】SemEval 2014数据集 xml文件格式转换为csv+txt
- SemEval2014简介
- 4个子任务
- 数据格式
- xml转csv
- xml转txt
SemEval2014简介
SemEval2014,ABSA( Aspect Based Sentiment Analysis)任务关注的领域是NLP中的细粒度情感分析,即给定一个句子判断其中的aspect以及它的情感极性。数据是基于laptop评论和restaurant评论
4个子任务
1.Aspect term extraction,方面术语抽取
2.Aspect term polarity,方面术语极性分类
3.Aspect category detection,方面类别抽取
4.Aspect category polarity,方面类别极性分类
对于子任务1和子任务2提供了laptop和restaurant数据,而对于子任务3和子任务4只提供了restaurant数据。
数据格式
原始数据为xml格式
xml转csv
xml_csv.py
import xml.etree.cElementTree as ET
import pandas as pd
def xml_csv(listlist):
xml = ['SemEval2014_Restaurants.xml']
csv_name = ['SemEval2014_Restaurants.csv']
# 解析XML文件
tree = ET.parse(xml[listlist])
root = tree.getroot()
# 提取所有sentence元素
sentences = root.findall('sentence')
# 修复提取数据的方法,处理没有<aspectCategories>子元素的情况
data = []
# 遍历每个sentence元素
for sentence in sentences:
# 提取text内容
text = sentence.find('text').text
# 检查是否存在<aspectCategories>子元素
aspect_categories_element = sentence.find('aspectCategories')
if aspect_categories_element is not None:
# 提取aspectCategories中的所有aspectCategory元素
aspect_categories = aspect_categories_element.findall('aspectCategory')
# 提取每个aspectCategories的term和polarity
for aspect_category in aspect_categories:
category = aspect_category.get('category')
polarity = aspect_category.get('polarity')
data.append([text, category, polarity])
df = pd.DataFrame(data, columns=['text', 'category', 'polarity'])
df = df[df['polarity'].isin(['positive', 'negative', 'neutral'])]
df['polarity'] = df['polarity'].map(
{'positive': 1, 'neutral': 0, 'negative': -1})
df.to_csv(path_or_buf=csv_name[listlist], index=0)
for i in range(1):
xml_csv(i)
SemEval2014_Restaurants.csv
xml转txt
xml_txt.py
import xml.etree.cElementTree as ET
# 写入文件
output_file = open('SemEval2014_Restaurants.txt', 'a', encoding='utf-8')
def xml_txt(listlist):
xml = ['SemEval2014_Restaurants.xml']
csv_name = ['SemEval2014_Restaurants.csv']
# 解析XML文件
tree = ET.parse(xml[listlist])
root = tree.getroot()
# 提取所有sentence元素
sentences = root.findall('sentence')
# 修复提取数据的方法,处理没有<aspectCategories>子元素的情况
data = []
# 遍历每个sentence元素
for sentence in sentences:
# 提取text内容
text = sentence.find('text').text
# 检查是否存在<aspectCategories>子元素
aspect_categories_element = sentence.find('aspectCategories')
if aspect_categories_element is not None:
# 提取aspectCategories中的所有aspectCategory元素
aspect_categories = aspect_categories_element.findall('aspectCategory')
# 提取每个aspectCategories的term和polarity
for aspect_category in aspect_categories:
category = aspect_category.get('category')
polarity = aspect_category.get('polarity')
if polarity == "negative":
polarity = -1
elif polarity =="positive":
polarity = 1
else:
polarity = 0
# print(polarity)
output_file.write(f"{polarity}\t{category}\t{text}\n")
for i in range(1):
xml_txt(i)
SemEval2014_Restaurants.txt