【python】SemEval 2014数据集 xml文件格式转换为csv+txt

news2026/2/15 8:56:56

笔记为自我总结整理的学习笔记，若有错误欢迎指出哟~

【python】SemEval 2014数据集 xml文件格式转换为csv+txt

SemEval2014简介
4个子任务
数据格式
xml转csv
xml转txt

SemEval2014简介

SemEval2014，ABSA（ Aspect Based Sentiment Analysis）任务关注的领域是NLP中的细粒度情感分析，即给定一个句子判断其中的aspect以及它的情感极性。数据是基于laptop评论和restaurant评论

4个子任务

1.Aspect term extraction，方面术语抽取
2.Aspect term polarity，方面术语极性分类
3.Aspect category detection，方面类别抽取
4.Aspect category polarity，方面类别极性分类
对于子任务1和子任务2提供了laptop和restaurant数据，而对于子任务3和子任务4只提供了restaurant数据。

数据格式

原始数据为xml格式
在这里插入图片描述

xml转csv

xml_csv.py

import xml.etree.cElementTree as ET
import pandas as pd

def xml_csv(listlist):
    xml = ['SemEval2014_Restaurants.xml']
    csv_name = ['SemEval2014_Restaurants.csv']
    # 解析XML文件
    tree = ET.parse(xml[listlist])
    root = tree.getroot()
    # 提取所有sentence元素
    sentences = root.findall('sentence')
    # 修复提取数据的方法，处理没有<aspectCategories>子元素的情况
    data = []

    # 遍历每个sentence元素
    for sentence in sentences:
        # 提取text内容
        text = sentence.find('text').text

        # 检查是否存在<aspectCategories>子元素
        aspect_categories_element = sentence.find('aspectCategories')
        if aspect_categories_element is not None:
            # 提取aspectCategories中的所有aspectCategory元素
            aspect_categories = aspect_categories_element.findall('aspectCategory')

            # 提取每个aspectCategories的term和polarity
            for aspect_category in aspect_categories:
                category = aspect_category.get('category')
                polarity = aspect_category.get('polarity')
                data.append([text, category, polarity])

    df = pd.DataFrame(data, columns=['text', 'category', 'polarity'])
    df = df[df['polarity'].isin(['positive', 'negative', 'neutral'])]
    df['polarity'] = df['polarity'].map(
        {'positive': 1, 'neutral': 0, 'negative': -1})

    df.to_csv(path_or_buf=csv_name[listlist], index=0)

for i in range(1):
    xml_csv(i)

SemEval2014_Restaurants.csv
在这里插入图片描述

xml转txt

xml_txt.py

import xml.etree.cElementTree as ET

# 写入文件
output_file = open('SemEval2014_Restaurants.txt', 'a', encoding='utf-8')

def xml_txt(listlist):
    xml = ['SemEval2014_Restaurants.xml']
    csv_name = ['SemEval2014_Restaurants.csv']
    # 解析XML文件
    tree = ET.parse(xml[listlist])
    root = tree.getroot()
    # 提取所有sentence元素
    sentences = root.findall('sentence')
    # 修复提取数据的方法，处理没有<aspectCategories>子元素的情况
    data = []

    # 遍历每个sentence元素
    for sentence in sentences:
        # 提取text内容
        text = sentence.find('text').text

        # 检查是否存在<aspectCategories>子元素
        aspect_categories_element = sentence.find('aspectCategories')
        if aspect_categories_element is not None:
            # 提取aspectCategories中的所有aspectCategory元素
            aspect_categories = aspect_categories_element.findall('aspectCategory')

            # 提取每个aspectCategories的term和polarity
            for aspect_category in aspect_categories:
                category = aspect_category.get('category')
                polarity = aspect_category.get('polarity')
                if polarity == "negative":
                    polarity = -1
                elif polarity =="positive":
                    polarity = 1
                else:
                    polarity = 0
                # print(polarity)
                output_file.write(f"{polarity}\t{category}\t{text}\n")

for i in range(1):
    xml_txt(i)