Data Mining2 复习笔记6 - Optimization Hyperparameter Tuning

news2025/1/11 9:56:23

6. Optimization & Hyperparameter Tuning

Why Hyperparameter Tuning?
Many learning algorithms for classification, regression, … Many of those have hyperparameters: k and distance function for k nearest neighbors, splitting and pruning options in decision tree learning, …

But what is their effect?
Hard to tell in general and rules of thumb are rare.

Parameters vs. Hyperparameters
Parameters are learned during training
Typical examples: Coefficients in (linear) regression, Weights in neural networks, …
Training: Find set of of parameters so that objective function is minimized/maximized (on a holdout set)

Hyperparameters are fixed before training
Typical examples: Network layout and learning rate in neural networks, k in kNN, …
Training: Find set of of parameters so that objective function is minimized/maximized (on a holdout set), given a previously fixed set of hyperparameters

Hyperparameter Tuning – a Naive Approach

  1. run classification/regression algorithm
  2. look at the results (e.g., accuracy, RMSE, …)
  3. choose different parameter settings, go to 1

Questions: when to stop? how to select the next parameter setting to test?

Hyperparameter Tuning – Avoid Overfitting!
Recap overfitting: classifiers may overadapt to training data. The same holds for parameter settings
Possible danger: finding parameters that work well on the training set but not on the test set
Remedy: train / test / validation split

Example

Example

Example

6.1 Hyperparameter Tuning: Brute Force

Try all parameter combinations that exist → we need a better strategy than brute force!

Hyperparameter tuning is an optimization problem
Finding optimal values for N variables
Properties of the problem:

  • the underlying model is unknown, i.e., we do not know changing a variable will influence the results
  • we can tell how good a solution is when we see it, i.e., by running a classifier with the given parameter set
  • but looking at each solution is costly

Related problem: feature subset selection
Given n features, brute force requires 2^n evaluations
e.g. for 20 features, that is already one million → ten million with cross validation

Knapsack problem
given a maximum weight you can carry and a set of items with different weight and monetary value. Pack those items that maximize the monetary value

Problem is NP hard – i.e., deterministic algorithms require an exponential amount of time
Note: feature subset selection for N features requires 2^n evaluations

Many optimization problems are NP hard
Routing problems (Traveling Salesman Problem)
Integer factorization: hard enough to be used for cryptography
Resource use optimization. e.g., minimizing cutoff waste
Chip design - minimizing chip sizes

Properties of Brute Force search
guaranteed to find the best parameter setting, too slow in most practical cases

6.1.1 Grid Search

performs a brute force search with equal-width steps on non-discrete numerical attributes
(e.g., 10,20,30,…,100)
Hyperparameter with a wide range (e.g., 0.0001 to 1,000,000)
with ten equal-width steps, the first step would be 1,000
but what if the optimum is around 0.1?
logarithmic steps may perform better for some parameters

Needed:
solutions that take less time/computation and often find the best parameter setting or find a near-optimal parameter setting

6.2 Hyperparameter Tuning: One After Another

Given n parameters with m degrees of freedom – brute force takes m^n runs of the base classifier

Simple tweak:

  1. start with default settings
  2. try all options for the first parameter
    2a. fix best setting for first parameter
  3. try all options for the second parameter
    3a. fix best setting for second parameter

This reduces the runtime to n*m
i.e., no longer exponential – but we may miss the best solution

6.2.1 Interaction Effects

Interaction effects make parameter tuning hard. i.e., changing one parameter may change the optimal settings for another one
Example: two parameters p and q, each with values 0,1, and 2 – the table depicts classification accuracy

Example: two parameters p and q, each with values 0,1, and 2. The table depicts classification accuracy. If we try to optimize one parameter by another (first p, then q). We end at p=0,q=0 in six out of nine cases. On average, we investigate 2.3 solutions.
(0.5-local optimum, 0.7-globe optimum)
Example

6.3 Hill climbing with variations

6.3.1 Hill-Climbing Search (greedy local search)

“Like climbing Everest in thick fog with amnesia” always search in the direction of the steepest ascend.
Hill-Climbing Search

Problem

Example

6.3.2 Variations of Hill Climbing Search

  • Stochastic hill climbing
    random selection among the uphill moves
    the selection probability can vary with the steepness of the uphill move
  • First-choice hill climbing
    generating successors randomly until a better one is found, then pick
    that one
  • Random-restart hill climbing
    run hill climbing with different seeds
    tries to avoid getting stuck in local maxima

6.4 Beam search

Local Beam Search
Keep track of k states rather than just one
Start with k randomly generated states
At each iteration, all the successors of all k states are generated
Select the k best successors from the complete list and repeat

6.5 Random search

Grid Search vs. Random Search
All the examples discussed so far use fixed grids
Challenges: some hyperparameters are pretty sensitive
e.g., 0.02 is a good value, but 0 and 0.05 are not – others have little influence
but it is hard to know upfront which
grid search may easily miss best parameters but random search often yields better results

6.6 Genetic Programming

Genetic Algorithms is inspired by evolution:
use a population of individuals (solutions) -> create new individuals by crossover -> introduce random mutations -> from each generation, keep only the best solutions (“survival of the fittest”)
Standard Genetic Algorithm (SGA)

6.6.1 SGA

Basic ingredients:

  • individuals: the solutions
    hyperparameter tuning: a hyperparameter setting
  • a fitness function
    hyperparameter tuning: performance of a hyperparameter setting (i.e., run learner with those parameters)
  • acrossover method
    hyperparameter tuning: create a new setting from two others
  • amutation method
    hyperparameter tuning: change one parameter
  • survivor selection

SGA

Example

Example

Example

Crossover OR Mutation?
Decade long debate: which one is better / necessary …
Answer (at least, rather wide agreement): it depends on the problem, but
in general, it is good to have both – both have another role
mutation-only-EA is possible, crossover-only-EA would not work

Exploration: Discovering promising areas in the search space, i.e. gaining information on the problem
Exploitation: Optimising within a promising area, i.e. using information

There is co-operation AND competition between them
Crossover is explorative, it makes a big jump to an area
somewhere “in between” two (parent) areas
Mutation is exploitative, it creates random small diversions, thereby staying near (in the area of) the parent

Crossover OR Mutation?

Only crossover can combine information from two parents
Remember: sample from entire value range
Only mutation can introduce new information (alleles)
To hit the optimum you often need a ‘lucky’ mutation

6.6.2 Genetic Feature Subset Selection

Feature Subset Selection can also be solved by Genetic Programming
Individuals: feature subsets
Representation: binary – 1 = feature is included; – 0 = feature is not included
Fitness: classification performance
Crossover: combine selections of two subsets
Mutation: flip bits

6.6.3 Selecting a Learner by Meta Learning

So far, we have looked at finding good parameters for a learner – the learner was always fixed
A similar problem is selecting a learner for the task at hand
Again, we could go with search. Another approach is meta learning

Meta Learning i.e., learning about learning
Goal: learn how well a learner will perform on a given dataset features: dataset characteristics, learning algorithm
prediction target: accuracy, RMSE, …

Also known as AutoML
Basic idea: train a regression model

  • data points: individual datasets plus ML approach
  • target: expected accuracy/RMSE etc.

Examples for features: number of instances/attributes, fraction of nominal/numerical attributes, min/max/average entropy of attributes, skewness of classes, …


Recap: search heuristics are good for problems where finding an optimal solution is difficult, evaluating a solution candidate is easy, the search space of possible solutions is large
Possible solution: genetic programming

We have encountered such problems quite frequently
Example: learning an optimal decision tree from data

6.6.4 Genetic Decision Tree Learning

Population: candidate decision trees (initialization: e.g., trained on small subsets of data)
Create new decision trees by means of crossover & mutation
Fitness function: e.g., accuracy
Example

Example

swap can happen in different level, just randomly

Example

6.6.5 Combination of GP with other Learning Methods

Rule Learning (“Learning Classifier Systems”)
Population: set of rule sets (!)
Crossover: combining rules from two sets
Mutation: changing a rule

Artificial Neural Networks
Easiest solution: fixed network layout
The network is then represented as an ordered set (vector) of weights
e.g., [0.8, 0.2, 0.5, 0.1, 0.1, 0.2]
Crossover and mutation are straight forward
Variant: AutoMLP - Searches for best combination of hidden layers and learning rate

请添加图片描述

6.7 Hyperparameter learning

Hyperparameter tuning as a learning problem: Given a set of hyperparameters H, predict performance p of model. The prediction model is referred to as a surrogate model or oracle
Rationale:
Training and evaluating an actual model is costly
Learning and predicting with the surrogate model is fast

Hyperparameter learning

Note:
we want to use not too many runs of the actual model, i.e., the surrogate model will have few training points - use a simple model.
Most well-known: bayesian optimization

Summary: Grid Search, Random Search, Learning hyperparameters / bayesian optimization

Grid search
Inefficient
Fixed grid sizes may miss good parameters (Smaller grid sizes would be even less efficient!)

Random search
Often finds good solutions in less time

Learning hyperparameters / bayesian optimization
Sucessively tests hyperparameters close to local optima
Similar to hill climbing
Difference: explicit surrogate model

6.8 Summary

Summary

Summary

Hyperparameter Tuning: Criticism

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1805811.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【JS】立即执行函数IIFE 和闭包到底是什么关系?

历史小剧场 ”我希望认您作父亲,但又怕您觉得我年纪大,不愿意,索性让我的儿子给您作孙子吧!“ ----《明朝那些事儿》 什么是立即执行函数? 特点: 声明一个匿名函数马上调用这个匿名函数销毁这个匿名函数 …

湖南(品牌控价)源点调研 手机价格管理对品牌的影响分析

前言:手机自发明以来,过去一直是国际品牌占主导地位,从最初的爱立信、摩托罗拉,到后来的诺基亚、三星,苹果在这个手机行业里,竞争激励,没有百年企业,每个品牌的盛衰都有背后的历史背…

transformer中对于QKV的个人理解

目录 1、向量点乘 2、相似度计算举例 3、QKV分析 4、整体流程 (1) 首先从词向量到Q、K、V (2) 计算Q*(K的转置),并归一化之后进行softmax (3) 使用刚得到的权重矩阵,与V相乘,计算加权求和。 5、多头注意力 上面…

VMware Fusion 如何增加linux硬盘空间并成功挂载

文章目录 0. 前言1. 增加硬盘空间2. 硬盘分区2.1 查看硬盘2.2 分区2.3 格式化2.4 挂载 3. 参考 0. 前言 如果发现虚拟机分配的硬盘不足,需要增加硬盘空间。本文教给大家如何增加硬盘空间并成功挂载。 查看当前硬盘使用情况: df -h可以看到&#xff0c…

使用 GPT-4 创作高考作文 2024年

使用 GPT-4 创作高考作文 2024年 使用 GPT-4 创作高考作文:技术博客指南 🤔✨摘要引言正文内容(详细介绍) 📚💡什么是 GPT-4?高考作文题目分析 ✍️🧐新课标I卷 人类智慧的进步&…

二次规划问题(Quadratic Programming, QP)原理例子

二次规划(Quadratic Programming, QP) 二次规划(Quadratic Programming, QP)是优化问题中的一个重要类别,它涉及目标函数为二次函数并且线性约束条件的优化问题。二次规划在控制系统、金融优化、机器学习等领域有广泛应用。下面详细介绍二次规划问题的原理和求解过程 二…

k8s学习--kubernetes服务自动伸缩之垂直伸缩(资源伸缩)VPA详细解释与安装

文章目录 前言VPA简介简单理解详细解释VPA的优缺点优点1.自动化资源管理2.资源优化3.性能和稳定性提升5.成本节约6.集成性和灵活性 缺点1.Pod 重启影响可用性2.与 HPA 冲突3.资源监控和推荐滞后:4.实现复杂度: 核心概念Resource Requests 和 Limits自动调…

多曝光融合算法(三)cv2.createAlignMTB()多曝光图像融合的像素匹配问题

文章目录 1.cv2.createAlignMTB() 主要是计算2张图像的位移,假设位移移动不大2.多曝光图像的aline算法:median thresold bitmap原理讲解3.图像拼接算法stitch4.多曝光融合工具箱 1.cv2.createAlignMTB() 主要是计算2张图像的位移,假设位移移动…

开发做前端好还是后端好?

目录 一、引言 二、两者的对比分析 技能要求和专业知识: 职责和工作内容: 项目类型和应用领域: 就业前景和市场需求: 三、技能转换和跨领域工作 评估当前技能: 确定目标领域: 掌握相关框架和库&a…

端午节大家都放假了吗

端午节作为中国四大传统节日之一,具有深厚的文化内涵和广泛的群众基础,因此,在这个节日里发布软文,可以围绕其传统习俗、美食文化、家庭团聚等方面展开,以吸引读者的兴趣。 首先,可以从端午节的起源和传统习…

轴承接触角和受力分析

提示:轴承接触角和受力分析 文章目录 1,接触角2,轴承受力分析 1,接触角 所谓公称接触角就是指轴承在正常状态下, 滚动体和内圈及外圈沟道接触点的法线与轴心线的垂直平面之间的夹角。 按滚动轴承工作时所能承受载荷的…

倩女幽魂手游攻略:云手机自动搬砖辅助教程!

《倩女幽魂》手游自问世以来一直备受玩家喜爱,其精美画面和丰富的游戏内容让人沉迷其中。而如今,借助VMOS云手机,玩家可以更轻松地进行搬砖,提升游戏体验。 一、准备工作 下载VMOS云手机: 在PC端或移动端下载并安装VM…

Spring 自动配置 condition

目录 前言 1. 自定义condition加载bean 1.1. 自定义一个condition注解 1.2. 实现自定义注解对应的实现类 1.3. 使用如上注解 1.4. 使用Spring上下文获取一下改bean 2. 我们来看看Spring是如何加载redisTemplate的。 2.1. 找到Spring的autoconfigure的jar包,我们…

C/C++学习笔记 CMake 与 Make有什么区别?

一、什么是编译? 编译器是一种将源代码翻译成机器码的程序。代码的编译包括几个步骤,包括预处理、编译和链接,以创建可在其目标计算机上直接运行的库或可执行文件。 ​ 这个编译过程也称为构建过程,这是 CMake 和Make发挥…

hid报表描述符不同item含义及整体结构

参考 报表描述符一般是两个字节同时出现 databtagbtypebsize表示两个字节 data表示一个字节, btagbtypebsize表示一个字节,又因为报表描述符中的数据都是小位权在前, 例如:0x05,0x01表示的是usage page(1),表示的是usage page的…

【算法篇】求最长公共前缀JavaScript版本

题目描述 给你一个大小为 n 的字符串数组 strs &#xff0c;其中包含n个字符串 , 编写一个函数来查找字符串数组中的最长公共前缀&#xff0c;返回这个公共前缀。 数据范围&#xff1a; 数据范围:0<n<5000&#xff0c;0<len(strsi)< 5000 进阶:空间复杂度 O(1)&a…

Java基础教程 - 14 Maven项目

更好的阅读体验&#xff1a;点这里 &#xff08; www.doubibiji.com &#xff09; 14 Maven项目 Java 为什么那么强大&#xff0c;很大一部分原因是在实际的开发中&#xff0c;可以将别人开发的模块引入到我们自己的项目中&#xff0c;这样别人开发好了&#xff0c;我拿来就…

Android电量优化,让你的手机续航更持久

节能减排&#xff0c;从我做起。一款Android应用如果非常耗电&#xff0c;是一定会被主人嫌弃的。自从Android手机的主人用了你开发的app&#xff0c;一天下来&#xff0c;也没干啥事&#xff0c;电就没了。那么他就会想尽办法找出耗电量杀手&#xff0c;当他找出后&#xff0c…

把chatgpt当实习生,进行matlab gui程序编程

最近朋友有个项目需要整点matlab代码&#xff0c;无奈自己对matlab这种工科的软件完全是外行&#xff0c;无奈只有求助gpt这种AI助手了。大神们告诉我们&#xff0c;chatgpt等的助手已经是大学实习生水平啦&#xff0c;通过多轮指令交互就可以让他帮你完成工作啦&#xff01;所…

从《千脑智能》看大模型

千脑智能与大模型 千脑智能介绍 世界模型千脑智能理论——对大脑的全新理解旧大脑&#xff1a;演化的历史烙印新大脑&#xff1a;智慧的创新引擎新旧大脑的互动与争斗启示与借鉴 大脑对信息的处理和建模六根六尘六识 新脑&#xff1a;智能的创新中枢旧脑&#xff1a;生存的本能…