公司论坛数据构建情感标注数据集思考

news2025/4/11 5:22:02

公司论坛有一个评论区，会有小伙伴在上面进行评论，聊天，大部份都是积极向上的，但是也有小小的一部分消极的言论，“就像白纸上的一个黑点”，和产品对接的大佬如是说。所以想思考做一个情感标注数据集，对负面的言论有快的处理方案，当然公司采用了一套成熟的流程，但是作者本人也进行了思考，从数据分析到LLM，常见的对文本处理的需求包含：

1、实体抽取，实体关系分析

2、文本情感分析

3、文本简介

4、文本构建次韵

5、文本分类标注

等等（嘿嘿嘿）

大佬们聊的在我的理解当中就是对现有的论坛数据进行标注或者对已经在前几年人事运用的数据基础上训练一个情感标注数据集，然后对之后的评论进行分析，所以自己有了以下思考，欢迎各位大佬指点：

整体思路

构建情感标准数据集的核心流程包括：数据收集、数据清洗、情感标注、质量控制和数据集划分。公司论坛数据通常包含丰富的用户表达，是构建情感分析数据集的优质来源。

实施步骤

1. 数据收集与初步处理

步骤说明：

从公司论坛API或数据库导出原始数据
提取相关字段（如帖子内容、评论、时间戳、用户ID等）
去除明显无关的内容（如广告、版规等）

代码示例：

import pandas as pd
import sqlite3

# 从SQLite数据库导出数据
def extract_forum_data(db_path):
    conn = sqlite3.connect(db_path)
    query = """
    SELECT post_id, user_id, content, timestamp, likes 
    FROM forum_posts 
    WHERE is_deleted = 0 AND is_ad = 0
    """
    df = pd.read_sql(query, conn)
    conn.close()
    return df

# 示例使用
forum_data = extract_forum_data('company_forum.db')
print(forum_data.head())

2. 数据清洗与预处理

步骤说明：

去除HTML标签、特殊字符
处理缩写、拼写错误
分词与词性标注
去除停用词

代码示例：

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt')
nltk.download('stopwords')

def clean_text(text):
    # 去除HTML标签
    text = re.sub(r'<[^>]+>', '', text)
    # 去除特殊字符和多余空格
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def preprocess_text(text):
    text = clean_text(text)
    # 分词
    tokens = word_tokenize(text.lower())
    # 去除停用词
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# 应用预处理
forum_data['cleaned_content'] = forum_data['content'].apply(preprocess_text)

3. 情感标注策略

标注方法选择：

人工标注：最准确但成本高
半自动标注：结合规则和人工校验
自动标注：使用已有情感词典或预训练模型初步标注

这里结合业务场景，后来了解到确实有人事部的同时对现有的评论（尤其不好的评论）进行标注和处理，所以可以采用人工标准，但是还是把半自动标注的思路给大家列出来一些，不知道对不对，还请大家多多指点。

代码示例（半自动标注）：

from textblob import TextBlob
import numpy as np

def auto_sentiment_label(text):
    analysis = TextBlob(text)
    # TextBlob返回极性得分在[-1,1]之间
    if analysis.sentiment.polarity > 0.1:
        return 'positive'
    elif analysis.sentiment.polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'

# 自动标注
forum_data['auto_label'] = forum_data['cleaned_content'].apply(auto_sentiment_label)

# 抽样人工校验
sample_for_review = forum_data.sample(frac=0.1, random_state=42)
sample_for_review['manual_label'] = None  # 留待人工填写

4. 质量控制与标注一致性

步骤说明：

计算标注者间一致性（如Cohen's Kappa）
解决标注分歧
建立标注指南

代码示例：

from sklearn.metrics import cohen_kappa_score

# 假设我们有三位标注者的结果
annotator1 = ['positive', 'negative', 'neutral', 'positive']
annotator2 = ['positive', 'neutral', 'neutral', 'positive']
annotator3 = ['positive', 'negative', 'negative', 'positive']

# 计算两两之间的一致性
print(f"Annotator 1 & 2: {cohen_kappa_score(annotator1, annotator2)}")
print(f"Annotator 1 & 3: {cohen_kappa_score(annotator1, annotator3)}")
print(f"Annotator 2 & 3: {cohen_kappa_score(annotator2, annotator3)}")

5. 数据集划分与平衡

步骤说明：

按比例划分训练集、验证集和测试集
处理类别不平衡问题

代码示例：

from sklearn.model_selection import train_test_split

# 假设我们已经有最终标注的DataFrame
labeled_data = forum_data.dropna(subset=['final_label'])

# 划分训练集和测试集
train_df, test_df = train_test_split(
    labeled_data, 
    test_size=0.2, 
    random_state=42,
    stratify=labeled_data['final_label']  # 保持类别比例
)

# 处理类别不平衡（可选）
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(
    train_df[['cleaned_content']], 
    train_df['final_label']
)

6. 数据集保存与文档编写

步骤说明：

保存为标准格式（CSV/JSON）
编写数据集文档（README）

代码示例：

# 保存数据集
final_dataset = pd.DataFrame({
    'text': X_resampled['cleaned_content'],
    'label': y_resampled
})

final_dataset.to_csv('company_forum_sentiment_dataset.csv', index=False)

# 保存测试集
test_df[['cleaned_content', 'final_label']].to_csv(
    'company_forum_sentiment_test.csv', 
    index=False
)