目录
- Classification 分类
- Text Classification Tasks 文本分类任务
- Topic Classification 主题分类
- Sentiment Analysis 情感分析
- Native-Language Identification 母语识别
- Natural Language Inference 自然语言推理
- Building a Text Classifier 构建文本分类器
- Choosing a Classification Algorithm 选择分类算法
- Naive Bayes 朴素贝叶斯
- Logistic Regression 逻辑回归
- Support Vector Machines 支持向量机
- K-Nearest Neighbor K-最近邻
- Decision Tree 决策树
- Random Forests 随机森林
- Neural Networks 神经网络
- Hyperparameter Tuning 超参数调优
- Confusion Matrix 混淆矩阵
- Evaluation Metrics 评估指标
Fundamentals of Classification
Classification 分类
-
Input:
- A document d: Often represented as a vector of features 一个文档 d:通常表示为特征向量
- A fixed output set of classes C = {c1, c2, …, ck}: Categorical, not continuous or ordinal 一个固定的输出类别集合 C = {c1, c2, …, ck}:类别是离散的,不是连续或有序的
-
Output:
- A predicted class c ∈ C 一个预测的类别 c ∈ C
Text Classification Tasks 文本分类任务
-
Some common examples:
- Topic classification 主题分类
- Sentiment analysis 情感分析
- Native-language identification 母语识别
- Natural language interence 自然语言推理
- Automatic fact-checking 自动事实检查
- Paraphrase 释义(paraphrase)
-
Input may not be a long document 输入可能不是一个长文档
Topic Classification 主题分类
-
Motivation: Library science, information retrieval 动机:图书馆科学,信息检索
-
Classes: Topic categories E.g. “jobs”, “international news” 类别:主题类别 例如"工作",“国际新闻”
-
Features:
- Unigram bag-of-words, with stop-words removed 一元词袋模型,移除了停用词
- Longer n-grams for phrases 长的n-grams表示短语
-
Examples of corpora:
- Reuters news corpus. E.g. RCV1, NLTK
- Pubmed abstracts
- Tweets with hashtags
Sentiment Analysis 情感分析
-
Motivation: Opinion mining, business analytics 动机:观点挖掘,商业分析
-
Classes: Positive/Negative/(Neutral) 类别:积极/消极/(中立)
-
Features:
- N-grams N-grams
- Polarity lexicons(极性词词典)
-
Examples of corpora:
- Movie review dataset in NLTK
- SEMEVAL Twitter polarity datasets
Native-Language Identification 母语识别
-
Motivation: Forensic linguistics, educational applications 动机:法证语言学,教育应用
-
Classes: First language of author 类别:作者的第一语言
-
Features:
- Word N-grams
- Syntactic patterns (POS, parse trees) 句法模式
- Phonological features 语音特征
-
Example of corpora:
- TOEFL/IELTS essay corpora
Natural Language Inference 自然语言推理
-
Also called textual entailment 也称为文本蕴含
-
Motivation: Language understanding 动机:语言理解
-
Classes: entailment, contradiction, neutral 类别:蕴含,矛盾,中立
-
Features:
- Word overlap 词重叠
- Length difference between the sentences 句子之间的长度差异
- N-grams N-grams
-
Examples of corpora:
- SNLI, MNLI
Building a Text Classifier 构建文本分类器
- Identify a task of interest 确定感兴趣的任务
- Collect an appropriate corpus 收集适当的语料库
- Carry out annotation 执行注释
- Select features 选择特征
- Choose a machine learning algorithm 选择机器学习算法
- Train model and tune hyperparameters using hold-out development data 使用留出法开发数据训练模型并调整超参数
- Repeat earlier steps as needed 根据需要重复前面的步骤
- Train fianl model 训练最终模型
- Evaluate model on hold-out test data 在留出测试数据上评估模型
Algorithms for Classification
Choosing a Classification Algorithm 选择分类算法
-
Bias vs. Variance 偏差与方差
- Bias: Assumptions made in the model 偏差:模型中做出的假设
- Variance: Sensitivity to training set 方差:对训练集的敏感度
-
Underlying assumptions E.g. Independence 潜在假设 例如 独立性
-
Complexity 复杂度
-
Speed 速度
Naive Bayes 朴素贝叶斯
-
Find the class with the highest likelihood under Bayes Law: 根据贝叶斯定理找到最有可能的类别
- Probability of the class times probability of features given the class 类别的概率乘以给定类别的特征概率
-
Naively assumes features are independent: 天真地假设特征是独立的
-
Pros:
- Fast to train and classify 训练和分类速度快
- Robust, low-variance -> good for low data situations 稳健,方差小 -> 对于数据量小的情况效果好
- Optimal classifier if independence assumption is correct 如果独立性假设正确,那么它是最优分类器
- Extremely simple to implement 实现极其简单
-
Cons:
-
Independence assumption rarely holds 独立性假设很少成立
-
Low accuracy compared to similar methods in most situations 在大多数情况下,与类似方法相比,精度较低
-
Smoothing required for unseen class/feature combinations 对于未见过的类别/特征组合,需要进行平滑处理
-
Logistic Regression 逻辑回归
-
A classifier, even called regression 一个分类器,甚至被称为回归
-
A linear model, but users softmax function squashing to get valid probability 一个线性模型,但使用 softmax 函数将其压缩成有效的概率
-
Training maximizes probability of training data subject to regularization which encourages low or sparse weights 训练过程最大化训练数据的概率,同时通过正则化来鼓励低权重或稀疏权重
-
Pros:
- Unlike Naive Bayes not confounded by diverse, correlated features gives better performance 与朴素贝叶斯不同,不会被多样性、相关特征所混淆,性能更好
-
Cons:
- Slow to train 训练速度慢
- Feature scaling needed 需要特征缩放
- Requires a lot of data to work well in practice 在实际应用中需要大量数据才能表现良好
- Choosing regularization strategy is important since overfitting is a big problem 选择正则化策略很重要,因为过拟合是一个大问题
Support Vector Machines 支持向量机
-
Finds hyperplane which separates the training data with maximum margin 找到一个超平面,该超平面最大限度地分离训练数据
-
Pros:
- Fast and accurate linear classifier 快速且准确的线性分类器
- Can do non-linearity with kernel trick 可以使用核技巧进行非线性处理
- Works well with huge feature sets 对于大规模特征集合工作良好
-
Cons:
- Multiclass classification awkward 多类别分类不方便
- Feature scaling needed 需要特征缩放
- Deals poorly with class imbalances 对类别不平衡处理不佳
- Interpretability 可解释性差
K-Nearest Neighbor K-最近邻
-
Classify based on majority class of k-nearest training examples in feature space 根据特征空间中最近的 k 个训练样本的多数类别进行分类
-
Definition of the nearest can vary: "最近"的定义可以变化
- Euclidean distance 欧几里得距离
- Cosine distance 余弦距离
-
Pros:
- Simple but surprisingly effective 简单但效果惊人
- No training required 无需训练
- Inherently multiclass 内置多类别
- Optimal classifier with infinite data 在无穷大的数据下是最优分类器
-
Cons:
- Have to select k 必须选择 k
- Issues with imbalanced classes 类别不平衡问题
- Often slow for finding the neighbors 在寻找最近邻时往往很慢
- Features must be selected carefully 特征必须仔细选择
Decision Tree 决策树
-
Construct a tree where nodes correspond to tests on individual features 构建一个树,其中节点对应于对单个特征的测试
-
Leaves are final class decisions 叶子节点是最终的类别决策
-
Based on greedy maximization of mutual information 基于贪心最大化互信息
-
Pros:
- Fast to build and test 建立和测试速度快
- Feature scaling irrelevant 特征缩放无关
- Good for small feature sets 对于小规模特征集合表现良好
- Handles non-linearly-separable problems 能处理非线性可分问题
-
Cons:
- In practice, not very interpretable 在实际中,解释性不强
- Highly redundant sub-trees 高度冗余的子树
- Not competitive for large feature sets 对于大规模特征集合,表现不强
Random Forests 随机森林
-
An ensemble classifier 一种集成分类器
-
Consists of decision trees trained on different subsets of the training and feature space 由在训练集和特征空间的不同子集上训练的决策树组成
-
Final class decision is majority voting of sub-classifiers 最终的类别决策是子分类器的多数投票
-
Pros:
- Usually more accurate and more robust than decision trees 通常比决策树更准确、更稳健
- Great classifier for medium feature sets 对于中等规模的特征集合是很好的分类器
- Training easily parallelized 训练过程可以轻易并行化
-
Cons:
- Interpretability 可解释性
- Slow with large feature sets 在大规模特征集合中训练慢
Neural Networks 神经网络
-
An interconnected set of nodes typically arranged in layers 一个由多个节点相互连接的集合,通常按层排列
-
Input layer(features), output layer(class probabilities), and one or more hidden layers 输入层(特征),输出层(类别概率),以及一个或多个隐藏层
-
Each node performs a linear weighting of its inputs from previous layer, passes result through activation function to nodes in next layer 每个节点对其来自上一层的输入进行线性加权,然后通过激活函数将结果传递给下一层的节点
-
Pros:
- Extremely powerful, dominant method in NLP and computer vision 极其强大,是 NLP 和计算机视觉中的主导方法
- Little feature engineering 很少需要特征工程
-
Cons:
- Not an off-the-shelf classifier 不是开箱即用的分类器
- Many hyperparameters, difficult to optimize 多个超参数,难以优化
- Slow to train 训练慢
- Prone to overfitting 易于过拟合
Hyperparameter Tuning 超参数调优
-
Dataset for tuning: 调优的数据集
- Development set 开发集
- Not the training set or the test set 不是训练集或测试集
- k-fold cross-validation k-折交叉验证
-
Specific hyper-parameters are classifier specific. But many hyper-parameters relate to regularization 具体的超参数是分类器特定的。但许多超参数与正则化有关
- Regularization hyperparameters penalize model complexity 正则化超参数惩罚模型复杂性
- Used to prevent overfitting 用于防止过拟合
-
For multiple hyperparameters, use grid search 对于多个超参数,使用网格搜索
Evaluation
Confusion Matrix 混淆矩阵
Classified As | ||
---|---|---|
Class | A | B |
A | True Positive | False Positive |
B | False Negative | True Negative |
Evaluation Metrics 评估指标
- Accuracy = True Positive / (True Positive + False Positive + True Negative + False Negative) 准确率 = 真正例 / (真正例 + 假正例 + 真负例 + 假负例)
- Precision = True Positive / (True Positive + False Positive) 精确度 = 真正例 / (真正例 + 假正例)
- Recall = True Positive / (True Positive + False Negative) 召回率 = 真正例 / (真正例 + 假负例)
- F1-score = (2 * precision * recall) / (precision + recall) F1-分数 = (2 * 精确度 * 召回率) / (精确度 + 召回率)
- Macroaverage: Average F-score across classes 宏平均:所有类别的 F-分数的平均值
- Micreaverage: Calculate F-score using sum of counts 微平均:使用计数和来计算 F-分数