Lecture 4 Text Classification

- - Classification 分类
  - Text Classification Tasks 文本分类任务
  - - Topic Classification 主题分类
    - Sentiment Analysis 情感分析
    - Native-Language Identification 母语识别
    - Natural Language Inference 自然语言推理
  - Building a Text Classifier 构建文本分类器
  - Choosing a Classification Algorithm 选择分类算法
  - Naive Bayes 朴素贝叶斯
  - Logistic Regression 逻辑回归
  - Support Vector Machines 支持向量机
  - K-Nearest Neighbor K-最近邻
  - Decision Tree 决策树
  - Random Forests 随机森林
  - Neural Networks 神经网络
  - Hyperparameter Tuning 超参数调优
  - Confusion Matrix 混淆矩阵
  - Evaluation Metrics 评估指标

Fundamentals of Classification

Classification 分类

Input:
- A document d: Often represented as a vector of features 一个文档 d：通常表示为特征向量
- A fixed output set of classes C = {c₁, c₂, …, c_k}: Categorical, not continuous or ordinal 一个固定的输出类别集合 C = {c₁, c₂, …, c_k}：类别是离散的，不是连续或有序的
Output:
- A predicted class c ∈ C 一个预测的类别 c ∈ C

Text Classification Tasks 文本分类任务

Some common examples:
- Topic classification 主题分类
- Sentiment analysis 情感分析
- Native-language identification 母语识别
- Natural language interence 自然语言推理
- Automatic fact-checking 自动事实检查
- Paraphrase 释义（paraphrase）
Input may not be a long document 输入可能不是一个长文档

Topic Classification 主题分类

Motivation: Library science, information retrieval 动机：图书馆科学，信息检索
Classes: Topic categories E.g. “jobs”, “international news” 类别：主题类别例如"工作"，“国际新闻”
Features:
- Unigram bag-of-words, with stop-words removed 一元词袋模型，移除了停用词
- Longer n-grams for phrases 长的n-grams表示短语
Examples of corpora:
- Reuters news corpus. E.g. RCV1, NLTK
- Pubmed abstracts
- Tweets with hashtags

Sentiment Analysis 情感分析

Motivation: Opinion mining, business analytics 动机：观点挖掘，商业分析
Classes: Positive/Negative/(Neutral) 类别：积极/消极/(中立)
Features:
- N-grams N-grams
- Polarity lexicons(极性词词典)
Examples of corpora:
- Movie review dataset in NLTK
- SEMEVAL Twitter polarity datasets

Native-Language Identification 母语识别

Motivation: Forensic linguistics, educational applications 动机：法证语言学，教育应用
Classes: First language of author 类别：作者的第一语言
Features:
- Word N-grams
- Syntactic patterns (POS, parse trees) 句法模式
- Phonological features 语音特征
Example of corpora:
- TOEFL/IELTS essay corpora

Natural Language Inference 自然语言推理

Also called textual entailment 也称为文本蕴含
Motivation: Language understanding 动机：语言理解
Classes: entailment, contradiction, neutral 类别：蕴含，矛盾，中立
Features:
- Word overlap 词重叠
- Length difference between the sentences 句子之间的长度差异
- N-grams N-grams
Examples of corpora:
- SNLI, MNLI

Building a Text Classifier 构建文本分类器

Identify a task of interest 确定感兴趣的任务
Collect an appropriate corpus 收集适当的语料库
Carry out annotation 执行注释
Select features 选择特征
Choose a machine learning algorithm 选择机器学习算法
Train model and tune hyperparameters using hold-out development data 使用留出法开发数据训练模型并调整超参数
Repeat earlier steps as needed 根据需要重复前面的步骤
Train fianl model 训练最终模型
Evaluate model on hold-out test data 在留出测试数据上评估模型

Algorithms for Classification

Choosing a Classification Algorithm 选择分类算法

Bias vs. Variance 偏差与方差
- Bias: Assumptions made in the model 偏差：模型中做出的假设
- Variance: Sensitivity to training set 方差：对训练集的敏感度
Underlying assumptions E.g. Independence 潜在假设例如独立性
Complexity 复杂度
Speed 速度

Naive Bayes 朴素贝叶斯

Find the class with the highest likelihood under Bayes Law: 根据贝叶斯定理找到最有可能的类别
$placeholder$
- Probability of the class times probability of features given the class 类别的概率乘以给定类别的特征概率
Naively assumes features are independent: 天真地假设特征是独立的
$placeholder$
Pros:
- Fast to train and classify 训练和分类速度快
- Robust, low-variance -> good for low data situations 稳健，方差小 -> 对于数据量小的情况效果好
- Optimal classifier if independence assumption is correct 如果独立性假设正确，那么它是最优分类器
- Extremely simple to implement 实现极其简单
Cons:
- Independence assumption rarely holds 独立性假设很少成立
- Low accuracy compared to similar methods in most situations 在大多数情况下，与类似方法相比，精度较低
- Smoothing required for unseen class/feature combinations 对于未见过的类别/特征组合，需要进行平滑处理

Logistic Regression 逻辑回归

A classifier, even called regression 一个分类器，甚至被称为回归
A linear model, but users softmax function squashing to get valid probability 一个线性模型，但使用 softmax 函数将其压缩成有效的概率
$placeholder$
Training maximizes probability of training data subject to regularization which encourages low or sparse weights 训练过程最大化训练数据的概率，同时通过正则化来鼓励低权重或稀疏权重
Pros:
- Unlike Naive Bayes not confounded by diverse, correlated features gives better performance 与朴素贝叶斯不同，不会被多样性、相关特征所混淆，性能更好
Cons:
- Slow to train 训练速度慢
- Feature scaling needed 需要特征缩放
- Requires a lot of data to work well in practice 在实际应用中需要大量数据才能表现良好
- Choosing regularization strategy is important since overfitting is a big problem 选择正则化策略很重要，因为过拟合是一个大问题

Support Vector Machines 支持向量机

Finds hyperplane which separates the training data with maximum margin 找到一个超平面，该超平面最大限度地分离训练数据
Pros:
- Fast and accurate linear classifier 快速且准确的线性分类器
- Can do non-linearity with kernel trick 可以使用核技巧进行非线性处理
- Works well with huge feature sets 对于大规模特征集合工作良好
Cons:
- Multiclass classification awkward 多类别分类不方便
- Feature scaling needed 需要特征缩放
- Deals poorly with class imbalances 对类别不平衡处理不佳
- Interpretability 可解释性差

K-Nearest Neighbor K-最近邻

Classify based on majority class of k-nearest training examples in feature space 根据特征空间中最近的 k 个训练样本的多数类别进行分类
Definition of the nearest can vary: "最近"的定义可以变化
- Euclidean distance 欧几里得距离
- Cosine distance 余弦距离
Pros:
- Simple but surprisingly effective 简单但效果惊人
- No training required 无需训练
- Inherently multiclass 内置多类别
- Optimal classifier with infinite data 在无穷大的数据下是最优分类器
Cons:
- Have to select k 必须选择 k
- Issues with imbalanced classes 类别不平衡问题
- Often slow for finding the neighbors 在寻找最近邻时往往很慢
- Features must be selected carefully 特征必须仔细选择

Decision Tree 决策树

Construct a tree where nodes correspond to tests on individual features 构建一个树，其中节点对应于对单个特征的测试
Leaves are final class decisions 叶子节点是最终的类别决策
Based on greedy maximization of mutual information 基于贪心最大化互信息
Pros:
- Fast to build and test 建立和测试速度快
- Feature scaling irrelevant 特征缩放无关
- Good for small feature sets 对于小规模特征集合表现良好
- Handles non-linearly-separable problems 能处理非线性可分问题
Cons:
- In practice, not very interpretable 在实际中，解释性不强
- Highly redundant sub-trees 高度冗余的子树
- Not competitive for large feature sets 对于大规模特征集合，表现不强

Random Forests 随机森林

An ensemble classifier 一种集成分类器
Consists of decision trees trained on different subsets of the training and feature space 由在训练集和特征空间的不同子集上训练的决策树组成
Final class decision is majority voting of sub-classifiers 最终的类别决策是子分类器的多数投票
Pros:
- Usually more accurate and more robust than decision trees 通常比决策树更准确、更稳健
- Great classifier for medium feature sets 对于中等规模的特征集合是很好的分类器
- Training easily parallelized 训练过程可以轻易并行化
Cons:
- Interpretability 可解释性
- Slow with large feature sets 在大规模特征集合中训练慢

Neural Networks 神经网络

An interconnected set of nodes typically arranged in layers 一个由多个节点相互连接的集合，通常按层排列
Input layer(features), output layer(class probabilities), and one or more hidden layers 输入层（特征），输出层（类别概率），以及一个或多个隐藏层
Each node performs a linear weighting of its inputs from previous layer, passes result through activation function to nodes in next layer 每个节点对其来自上一层的输入进行线性加权，然后通过激活函数将结果传递给下一层的节点
Pros:
- Extremely powerful, dominant method in NLP and computer vision 极其强大，是 NLP 和计算机视觉中的主导方法
- Little feature engineering 很少需要特征工程
Cons:
- Not an off-the-shelf classifier 不是开箱即用的分类器
- Many hyperparameters, difficult to optimize 多个超参数，难以优化
- Slow to train 训练慢
- Prone to overfitting 易于过拟合

Hyperparameter Tuning 超参数调优

Dataset for tuning: 调优的数据集
- Development set 开发集
- Not the training set or the test set 不是训练集或测试集
- k-fold cross-validation k-折交叉验证
Specific hyper-parameters are classifier specific. But many hyper-parameters relate to regularization 具体的超参数是分类器特定的。但许多超参数与正则化有关
- Regularization hyperparameters penalize model complexity 正则化超参数惩罚模型复杂性
- Used to prevent overfitting 用于防止过拟合
For multiple hyperparameters, use grid search 对于多个超参数，使用网格搜索

Evaluation

Confusion Matrix 混淆矩阵

	Classified As
Class	A	B
A	True Positive	False Positive
B	False Negative	True Negative

Evaluation Metrics 评估指标

Accuracy = True Positive / (True Positive + False Positive + True Negative + False Negative) 准确率 = 真正例 / (真正例 + 假正例 + 真负例 + 假负例)
Precision = True Positive / (True Positive + False Positive) 精确度 = 真正例 / (真正例 + 假正例)
Recall = True Positive / (True Positive + False Negative) 召回率 = 真正例 / (真正例 + 假负例)
F1-score = (2 * precision * recall) / (precision + recall) F1-分数 = (2 * 精确度 * 召回率) / (精确度 + 召回率)
- Macroaverage: Average F-score across classes 宏平均：所有类别的 F-分数的平均值
- Micreaverage: Calculate F-score using sum of counts 微平均：使用计数和来计算 F-分数