Data Mining数据挖掘—2. Classification分类

news2024/11/25 20:41:18

3. Classification

Given a collection of records (training set)
– each record contains a set of attributes
– one of the attributes is the class (label) that should be predicted
Find a model for class attribute as a function of the values of other attributes
Goal: previously unseen records should be assigned a class as accurately as possible
Application area: Direct Marketing, Fraud Detection
Why need it?
Classic programming is not so easy due to missing knowledge and difficult formalization as an algorithm

3.1 Definition

Given: a set of labeled records, consisting of
• data fields (a.k.a. attributes or features)
• a class label (e.g., true/false)
Generate: a function which can be used for classifying previously unseen records
• input: a record
• output: a class label

3.2 k Nearest Neighbors

The k nearest neighbors of a record x are data points that have the k smallest distance to x.
require: stored records, distance metric, value of k
steps: Compute distance to each training record -> Identify k nearest neighbors –> Use class labels of nearest neighbors to determine the class label of unknown record by taking majority vote or weighing the vote according to distance

choose k
Rule of thumb: Test k values between 1 and 10.
k too small -> sensitive to noise point
k too large -> neighborhood may include points from other classes

It’s very accurate but slow as training data needs to be searched. Can handle decision boundaries that are not parallel to the axes.

3.3 Nearest Centroids = Rocchio classifier

assign each data point to nearest centroid (center of all points of that class) 每个类别视为一个点(质心),然后在测试阶段,根据未标记数据点与这些质心的距离来决定数据点所属的类别。Nearest Centroid is a simple classification algorithm that calculates the centroid (average position) of each class based on training data. During testing, it classifies new instances by assigning them to the class whose centroid is closest.

k-NN vs. Nearest Centroid
k-NN

  • slow at classification time (linear in number of data points)
  • requires much memory (storing all data points)
  • robust to outliers

Nearest Centroid

  • fast at classification time (linear in number of classes)
  • requires only little memory (storing only the centroids)
  • robust to label noise
  • robust to class imbalance

Which classifier is better? Strongly depends on the problem at hand!

3.4 Bayes Classifier

P(C|A): conditional probability (How likely is C, given that we observe A)
Conditional Probability and Bayes Theorem

Example1 - Bayes Throrem
Example2 - Bayes Throrem

Estimating the Prior Probability P(C )
counting the records in the training set that are labeled with class Cj, dividing the count by the overall number of records
Estimating the Conditional Probability P(A | C)
Naïve Bayes assumes that all attributes are statistically independent
The independence assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj).
P(A1,A2,…An|Cj) = P(A1|Cj) * P(A2|Cj) * … * P(An|Cj)
Estimating the Probabilities P(Ai|Cj)
count how often an attribute value co-occurs with class Cj, divide by the overall number of instances in class Cj
Bayes Theorem Example

Handling Numerical Attributes
1.Discretize numerical attributes before learning classifier.
2.Make assumption that numerical attributes have a normal distribution given the class.
Use training data to estimate parameters of the distribution (e.g., mean and standard deviation)
Once the probability distribution is known, it can be used to estimate the conditional probability P(Ai|Cj)
Normal Distribution
Normal Distribution Example 1

Normal Distribution Example 2

Handling Missing Values
Missing values may occur in training and in unseen classification records.
Training: Record is not included into frequency count for attribute value-class combination
Classification: Attribute will be omitted from calculation

Zero Frequency Problem
one of the conditional probabilities is zero -> the entire expression becomes zero
It is not unlikely that an exactly same data point has not yet been observed. 有可能存在尚未观察到完全相同的数据点,因此概率有可能是0。
解决方法:Laplace Smoothing Laplace Smoothing

Decision Boundary of Naive Bayes Classifier
Decision Boundary of Naive Bayes Classifier

Summary

  • Robust to isolated noise points
  • Handle missing values by ignoring the instance during probability estimate calculations在概率估计计算过程中,通过忽略具有缺失值的实例来处理缺失值
  • Robust to irrelevant attributes [reasons: the probabilistic framework + conditional independence assumption + probability smoothing techniques]
  • Independence assumption may not hold for some attributes [Use other techniques such as Bayesian Belief Networks (BBN)]

Naïve Bayes works surprisingly well even if independence assumption is clearly violated [Reasons: Robustness to Violations + Effective Classification + Maximum Probability Assignment + Simple and Efficient]

Too many redundant attributes will cause problems. Solution: Select attribute subset as Naïve Bayes often works as well or better with just a fraction of all attributes.

Technical advantages:
(1) Learning Naïve Bayes classifiers is computationally cheap (probabilities are estimated in one pass over the training data)
(2) Storing the probabilities does not require a lot of memory

Redundant Variables
Violate independence assumption in Naive Bayes [Can, at large scale, skew the result]
May also skew the distance measures in k-NN, but the effect is not as drastic (Depends on the distance measure used)

Irrelevant Variables
For Naive Bayes: p(x=v|A) = p(x=v|B) for any value v, since it is random, it does not depend on the class variable. The overall result does not change

kNN vs. Naïve Bayes

kNNNaïve Bayes
computation\faster
dataless sensitive to outliersuse all data, less sensitive to label noise
redundant attributesless problematic\
irrelevant attributes\less problematic
pre-selectionyesyes

3.5 Lazy vs. Eager Learning

K-NN is a “lazy” methods.
They do not build an explicit model! “learning” is only performed on demand for unseen records.
Nearest Centroid and Naive Bayes are simple “eager” methods (also: decision tree, rule sets, …)
classify unseen instances, learn a model

3.6 Model Evaluation

3.6.1 Metrics for Performance Evaluation

Confusion MatrixConfusion Matrix

Accuracy & Error Rate
Accuracy & Error Rate

Baseline: naive guessing(always predict majority class)

3.6.2 Limitation of Accuracy: Unbalanced Data

1. Precision & Recall
Precision & Recall

2. F1-Measure越大越好
F1-Score combines precision and recall into one measure by using harmonic mean (tends to be closer to the smaller of the two)
For F1-value to be large, both p and r must be large. 当 Precision 和 Recall 都很高时,F1 score 也会趋向于高值,表示模型在正类别和负类别的预测中取得了较好的平衡。
F1-Score
F1-Measure Graph

confidence scores: how sure the algorithms is with its prediction
3. ROC Curves:

  1. Sort classifications according to confidence scores对于每个测试样本,模型输出一个置信度分数,例如,在朴素贝叶斯中可能是预测的概率。首先,将所有测试样本按照这些置信度分数从高到低排序。
  2. Evaluate
    correct prediction -> draw one step up如果模型的预测是正确的,将 ROC 曲线向上移动一步(增加真正例率)。
    incorrect prediction -> draw one step to the right如果模型的预测是错误的,将 ROC 曲线向右移动一步(增加假正例率)。
    Interpreting ROC Curves

False Positive Rate是指在所有实际为负例的样本中,被错误地判定为正例的比例,FPR = FP / (FP+TN)
True Positive Rate指在所有实际为正例的样本中,被正确地判定为正例的比例=召回率,TPR = TP / (TP+FN)
曲线越接近左上角: 表示模型性能越好,具有更高的TPR和更低的FPR。
4. Cost Matrixcost matrix
Computing Cost of Classification

3.7 Decision Tree Classifiers

Decision Tree Classifiers

There can be more than one tree that fits the same data!
Decision Boundary
Decision boundary: border line between two neighboring regions of different classes.
Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Finding an optimal decision tree is NP-hard
Tree building algorithms use a greedy, top-down, recursive partitioning strategy to induce a reasonable solution, also known as: divide and conquer. For example, Hunt’s Algorithm, ID3, CHAID, C4.5

3.7.1 Hunt’s Algorithm

Let Dt be the set of training records that reach a node t.
General Procedure:
If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
Recursively apply the procedure to each subset
Hunt's Algorithm

3.7.2 Split

Splitting Based on Nominal Attributes

Splitting Based on Nominal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Continuous Attributes
Splitting Based on Continuous Attributes

Discretization to form an ordinal categorical attribute
equal-interval binning
equal-frequency binning
binning based on user-provided boundaries

Binary Decision: (A < v) or (A > v)
usually sufficient in practice
consider all possible splits
find the best cut (i.e., the best v) based on a purity measure
can be computationally expensive

3.7.3 Common measures of node impurity

How to determine the Best Split?
Nodes with homogeneous class distribution are preferred. Need a measure of node impurity.
1. Gini Index
GINI
Splitting Based on GINI

Binary Attributes
Binary Attributes: Computing GINI Index

Continuous Attributes
Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

2. Entropy (Information Gain)
Entropy

Splitting Based on Information Gain

How to Find the Best Split

Decision Tree Advantages & Disadvantages
Advantages: Inexpensive to construct; Fast at classifying unknown records; Easy to interpret by humans for small-sized trees; Accuracy is comparable to other classification techniques for many simple data sets
Disadvantages: Decisions are based only one a single attribute at a time; Can only represent decision boundaries that are parallel to the axes

Decision Tree vs. k-NN

k-NNDecision Tree
Decision Boundariesarbitraryrectangular
Sensitivity to Scalesneed normalizationdoes not need normalization
Runtime & Memorycheap to train but expensive for classificationexpensive to train but cheap for classification

3.8 Overfitting

Overfitting: Good accuracy on training data, but poor on test data.
Symptoms: Tree too deep and too many branches
Typical causes of overfitting: too little training data, noise, poor learning algorithm
Which tree do you prefer?
Occam’s Razor
If you have two theories that explain a phenomenon equally well, choose the simpler one!

Learning Curve
Learning Curve

Holdout Method
The holdout method reserves a certain amount for testing and uses the remainder for training
Typical: one third for testing, the rest for training

For unbalanced datasets (few or none instances of some classes), samples might not be representative. -> Stratified sample: balances the data - Make sure that each class is represented with approximately equal proportions in both subsets. Other attributes may also be considered for stratification, e.g., gender, age, …

Leave One Out
Iterate over all examples
– train a model on all examples but the current one
– evaluate on the current one
每个样本都被当作测试集,而其余的样本组成训练集。这个过程会重复执行,直到每个样本都被作为测试集被验证过一次。
Yields a very accurate estimate but is computationally infeasible in most cases

Cross-Validation (k-fold cross-validation)
Compromise of Leave One Out and decent runtime
Cross-validation avoids overlapping test sets
Steps:
Step 1: Data is split into k subsets of equal size (Stratification may be applied)
Step 2: Each subset in turn is used for testing and the remainder for training
The error estimates are averaged to yield an overall error estimate
Frequently used value for k : 10 (the gold standard for folds was long set to 10)

How to Address Overfitting?

  1. Pre-Pruning (Early Stopping Rule)
    Stop the algorithm before it becomes a fully-grown tree
    Typical stopping conditions for a node: Stop if all instances belong to the same class or Stop if all the attribute values are the same
    Less restrictive conditions: Stop if number of instances within a node is less than some user-specified threshold or Stop if expanding the current node only slightly improves the impurity measure (user-specified threshold)
  2. Post-pruning
    Grow decision tree to its entire size -> Trim the nodes of the decision tree in a bottom-up fashion [ using a validation data set or an estimate of the generalization error ] -> If generalization error improves after trimming [ replace sub-tree by a leaf node / Class label of leaf node is determined from majority class of instances in the sub-tree]

Training vs. Generalization Errors
Training error = resubstitution error = apparent error: errors made in training, misclassified training instances -> can be computed
Generalization error: errors made on unseen data, no apparent evidence -> must be estimated

Estimating the Generalization Error
Estimating the Generalization Error

Example of Post-Pruning
Example of Post-Pruning

3.9 Alternative Classification Methods

Some cases are not nicely expressible in trees and rule sets.(Example: if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes)

3.9.1 Artificial Neural Networks (ANN)

ANN

General Structure of ANN

Algorithm for learning ANN

Decision Boundaries of ANN: Arbitrarily shaped objects & Fuzzy boundaries

3.9.2 Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data.SVM

What is computed?
a separating hyper plane, defined by its support vectors (hence the name)

“support vectors” refer to the training data points that are crucial in defining the decision boundary (or hyperplane).
Challenges: Computing an optimal separation is expensive and it requires good approximations
Dealing with noisy data: introducing “slack variables” in margin computation

3.9.3 Nonlinear Support Vector Machines

  • Transform data into higher dimensional space将输入数据从原始特征空间映射到更高维的空间
  • Transformation in higher dimensional space
    Kernel function
    Different variants: polynomial function, radial basis function, …
  • Finding a hyperplane in higher dimensional space

3.10 Hyperparameter Selection

A hyperparameter is a parameter which influences the learning process and whose value is set before the learning begins. For example, pruning thresholds for trees and rules; gamma and C for SVMs; learning rate, hidden layers for ANNs.
parameters are learned from the training data. For example, weights in an ANN, probabilities in Naïve Bayes, splits in a tree.

How to determine good hyperparameters?
(1) manually play around with different hyperparameter settings
(2) have your machine automatically test many different settings(Hyperparameter Optimization)

Hyperparameter Optimization
Goal: Find the combination of hyperparameter values that results in learning the model with the lowest generalization error
How to determine the parameter value combinations to be tested?

  • Grid Search: Test all combinations in user-defined ranges
  • Random Search: Test combinations of random parameter values
  • Evolutionary Search: Keep specific parameter values that worked well

3.11 Model Selection

From all learned models M, select the model mbest that is expected to generalize best to unseen records

Model Selection Using a Validation Set

Model Selection using Cross-Validation
Model Evaluation using Nested Cross Validation

grid search for model selection
cross-validation for model evaluation

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1301295.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【Linux】进程间通信之共享内存/消息队列/信号量

文章目录 一、共享内存的概念及原理二、共享内存相关接口说明1.shmget函数2.ftok函数3.shmat函数4.shmdt函数5.shmctl函数 三、用共享内存实现server&client通信1.shm_server.cc2.shm_client.cc3.comm.hpp4.查看ipc资源及其特征5.共享内存的优缺点6.共享内存的数据结构 四、…

ADAudit Plus:强大的网络安全卫士

随着数字化时代的不断发展&#xff0c;企业面临着越来越复杂和多样化的网络安全威胁。在这个信息爆炸的时代&#xff0c;保护组织的敏感信息和确保网络安全已经成为企业发展不可或缺的一环。为了更好地管理和监控网络安全&#xff0c;ADAudit Plus应运而生&#xff0c;成为网络…

【队列】数据也得排队

目录 引言 队列的概念 队列的实现 单向链表队列 结构 初始化 入队 出队 取队头 取队尾 求个数 判空 内存释放 总结 引言 队列&#xff0c;这个看似普通的数据结构&#xff0c;其实隐藏着无尽的趣味和巧思。就像单向链表这把神奇的魔法钥匙&#xff0c;它能打开队…

解决Git提交错误分支

如果 Git 提交到错误的分支&#xff0c;可以通过以下步骤将其转移到正确的分支上&#xff1a; 1.检查当前所在的分支&#xff0c;可以通过 git branch 命令查看。 git branch2.切换到正确的分支&#xff0c;可以通过 git checkout <正确的分支名> 命令进行切换。 git …

vue3-vite前端快速入门教程 vue-element-admin

Vue3快速入门学习 初始化项目 # 创建项目 npm create vitelatest my-vue-app -- --template vue # 安装依赖 npm i # 运行 npm run dev 模板语法 文本插值​ 最基本的数据绑定形式是文本插值&#xff0c;它使用的是“Mustache”语法 (即双大括号)&#xff1a; <span&g…

三相不平衡电压的正负序分析

1、什么是正负序&#xff1f; ABC 正序 ACB 负序 2、在abc坐标系下 接着利用矢量的旋转消去其它分量。。。 同理&#xff0c;得到其它的所有正负序的分量abc 3、在α/β坐标系下&#xff0c; 依次算出正负序的α/β来表示的abc 有一点需要特别注意&#xff0c;可以看到…

贪心其实很简单

关卡名 认识贪心思想 我会了✔️ 内容 1.复习一维数组&#xff0c;对数组进行多轮插入或者删除时会频繁移动数据&#xff0c;理解双指针是如何避免该问题的 ✔️ 2.理解滑动窗口的原理和适用场景 ✔️ 3.掌握窗口变与不变的两种情况是如何用来解题的 ✔️ 1.难以解释的贪心…

基于SpringBoot+Vue前后端分离的商城管理系统(Java毕业设计)

大家好&#xff0c;我是DeBug&#xff0c;很高兴你能来阅读&#xff01;作为一名热爱编程的程序员&#xff0c;我希望通过这些教学笔记与大家分享我的编程经验和知识。在这里&#xff0c;我将会结合实际项目经验&#xff0c;分享编程技巧、最佳实践以及解决问题的方法。无论你是…

【Linux】进程周边001之进程概念

&#x1f440;樊梓慕&#xff1a;个人主页 &#x1f3a5;个人专栏&#xff1a;《C语言》《数据结构》《蓝桥杯试题》《LeetCode刷题笔记》《实训项目》《C》《Linux》 &#x1f31d;每一个不曾起舞的日子&#xff0c;都是对生命的辜负 目录 前言 1.基本概念 2.描述进程-PCB…

贪心算法及相关题目

贪心算法概念 贪心算法是指&#xff0c;在对问题求解时&#xff0c;总是做出在当前看来是最好的选择。也就是说&#xff0c;不从整体最优上加以考虑&#xff0c;算法得到的是在某种意义上的局部最优解 。 贪心算法性质&#xff08;判断是否可以使用贪心算法&#xff09; 1、贪…

【SpringBoot教程】SpringBoot 创建定时任务(配合数据库动态执行)

作者简介&#xff1a;大家好&#xff0c;我是撸代码的羊驼&#xff0c;前阿里巴巴架构师&#xff0c;现某互联网公司CTO 联系v&#xff1a;sulny_ann&#xff08;17362204968&#xff09;&#xff0c;加我进群&#xff0c;大家一起学习&#xff0c;一起进步&#xff0c;一起对抗…

DDD系列 - 第6讲 仓库Repository及Mybatis、JPA的取舍(一)

目录 一、领域层定义仓库接口1.1 设计聚合1.2 定义仓库Repository接口二 、基础设施层实现仓库接口2.1 设计数据库2.2 集成Mybatis2.3 引入Convetor2.4 实现仓库三、回顾一、领域层定义仓库接口 书接上回,之前通过一个关于拆解、微服务、面向对象的故事,向大家介绍了如何从微…

mysql中的DQL查询

表格为&#xff1a; DQL 基础查询 语法&#xff1a;select 查询列表 from 表名&#xff1a;&#xff08;查询的结果是一个虚拟表格&#xff09; -- 查询指定的列 SELECT NAME,birthday,phone FROM student -- 查询所有的列 * 所有的列&#xff0c; 查询结果是虚拟的表格&am…

【GlobalMapper精品教程】067:基于无人机航拍照片快速创建正射影像图

文章目录 一、加载无人机照片二、创建正射影像三、导出正射影像四、worldImagery比对一、加载无人机照片 打开globalmapper软件,点击打开数据文件。 选择配套实验数据data067.rar中的影像,Ctrl+A全选。 在globalmapper中,可以直接将照片加载为如下样式。 二、创建正射影像 …

深入理解网络中断:原理与应用

&#x1f52d; 嗨&#xff0c;您好 &#x1f44b; 我是 vnjohn&#xff0c;在互联网企业担任 Java 开发&#xff0c;CSDN 优质创作者 &#x1f4d6; 推荐专栏&#xff1a;Spring、MySQL、Nacos、Java&#xff0c;后续其他专栏会持续优化更新迭代 &#x1f332;文章所在专栏&…

[GPT]Andrej Karpathy微软Build大会GPT演讲(下)--该如何使用GPT助手

该如何使用GPT助手--将GPT助手模型应用于问题 现在我要换个方向,让我们看看如何最好地将 GPT 助手模型应用于您的问题。 现在我想在一个具体示例的场景里展示。让我们在这里使用一个具体示例。 假设你正在写一篇文章或一篇博客文章,你打算在最后写这句话。 加州的人口是阿拉…

【参天引擎】华为参天引擎内核架构专栏开始更新了,多主分布式数据库的特点,类oracle RAC国产数据开始出现了

cantian引擎的介绍 ​专栏内容&#xff1a; 参天引擎内核架构 本专栏一起来聊聊参天引擎内核架构&#xff0c;以及如何实现多机的数据库节点的多读多写&#xff0c;与传统主备&#xff0c;MPP的区别&#xff0c;技术难点的分析&#xff0c;数据元数据同步&#xff0c;多主节点的…

记录 | ubuntu上安装fzf

在 ubuntu 上采用命令行安装 fzf 的方式行不通 指的是采用下面的方式行不通&#xff1a; sudo apt install fzf # 行不通 sudo snap install fzf --classic # 行不通正确的安装方式是&#xff1a; ● 到 fzf 的 git 仓库&#xff1a;https://github.com/junegunn/fzf/re…

Landsat7_C2_SR数据集(大气校正地表发射率数据集)

Landsat7_C2_SR数据集是经大气校正后的地表反射率数据&#xff0c;属于Collection2的二级数据产品&#xff0c;空间分辨率为30米&#xff0c;基于Landsat生态系统扰动自适应处理系统&#xff08;LEDAPS&#xff09;&#xff08;版本3.4.0&#xff09;生成。水汽、臭氧、大气高度…

【SpringBoot教程】SpringBoot 实现前后端分离的跨域访问(Nginx)

作者简介&#xff1a;大家好&#xff0c;我是撸代码的羊驼&#xff0c;前阿里巴巴架构师&#xff0c;现某互联网公司CTO 联系v&#xff1a;sulny_ann&#xff08;17362204968&#xff09;&#xff0c;加我进群&#xff0c;大家一起学习&#xff0c;一起进步&#xff0c;一起对抗…