The Category-theoretic Perspective of Statistical Learning for Amateurs

news2025/1/15 23:48:23

统计学习.范畴论视角

title: The Category-theoretic Perspective of Statistical Learning for Amateurs
author: Congwei Song
description: A representation in BIMSA


The Category-theoretical Perspective of Statistical Learning for Amateurs

Congwei Song
Email: william@bimsa.cn
Research Interest: Machine Learning, Wavelet Analysis

Abstract Statistical learning is a fascinating field that has long been the mainstream of machine learning/artificial intelligence. A large number of results have been produced which can be widely applied to real-world problems. It also leads to many research topics and also stimulates new research. This report summarizes some classic statistical learning models and well-known algorithms, especially for amateurs, and provides a category-theoretic perspective on understanding statistical learning models. The goal is to attract researchers from other fields, including basic mathematics, to participate in the research related to statistical learning.

Keywords Statistical Learning, Statistics, Category Theory, Variational Models, Neural Networks, Deep Learning

Abbreviations
distr.: distribution(s)
var.: variable(s)
cat.: category(ies)
rv: random varible(s)

Notations

  • P ( X ) P(X) P(X): distr. of target rv X X X
  • P ( X ∣ θ ) P(X|\theta) P(Xθ): parameteric distr.

Introduction

  • Probability theory/Statistics
  • Category theory
  • Statistical Learning
  • Classiscal models of statistical Learning
  • Advanced models
  • Misc

Probability theory

Definition (Probability Model)
The probability model is a probability measure sp, denoted by ( Ω , A , P ) (\Omega, \mathcal{A},P) (Ω,A,P); or ( X , P ( X ) ) (\mathcal{X}, P(X)) (X,P(X)), as its pushback by the rv X X X, where X \mathcal{X} X is the sample space of X X X.

X ∼ P X\sim P XP: the distr. of X X X is P P P, or draw X X X from P P P


Statistics

Definition (Statistical Model)
The statistical model is a family/set of probability models denoted by ( Ω , A , { P λ } ) (\Omega, \mathcal{A},\{P_\lambda\}) (Ω,A,{Pλ}) (with common ambient space) or ( X , { P λ } ) (\mathcal{X},\{P_\lambda\}) (X,{Pλ}) (with common sample space which is the range of a target rv X X X), denoted by P ( X ) P(X) P(X) for short.

Parameterized version: ( X , { P ( X ∣ θ ) } , θ ∈ Θ ) (\mathcal{X},\{P(X|\theta)\},\theta\in\Theta) (X,{P(Xθ)},θΘ) where Θ \Theta Θ is the parameter space, denoted by P ( X ∣ θ ) P(X|\theta) P(Xθ) for short.

Example
N ( μ , σ 2 ) , C a t ( p ) N(\mu,\sigma^2),Cat(p) N(μ,σ2),Cat(p)


Definition (Baysian Model)
THe Baysian model is a statistical model with priori distr. of parameters, as
( M θ , p ( θ ) ) (M_\theta,p(\theta)) (Mθ,p(θ)), where M θ M_\theta Mθ is a given statistical model.

Definition (Baysian Hierachical Model)
( M θ , p ( θ ∣ α ) , p ( α ) ) (M_\theta,p(\theta|\alpha),p(\alpha)) (Mθ,p(θα),p(α))


Category theory of Statistical Models

  • P r o b \mathcal{Prob} Prob: cat. of all probability models,
  • P r o b X \mathcal{Prob}_X ProbX: sub-cat. of P r o b \mathcal{Prob} Prob with the form of ( X , P ( X ) ) (\mathcal{X}, P(X)) (X,P(X))
  • P r o b Y ∣ x \mathcal{Prob}_{Y|x} ProbYx: sub-cat. of P r o b \mathcal{Prob} Prob taking the form of ( Y , P ( Y ∣ x ) ) (\mathcal{Y}, P(Y|x)) (Y,P(Yx)) with the conditional var. x x x.
  • S t a t \mathcal{Stat} Stat: cat. of The statistical models
  • S t a t X \mathcal{Stat}_X StatX: cat. of The statistical models of the target rv X X X
  • S t a t Y ∣ x \mathcal{Stat}_{Y|x} StatYx: cat. of The statistical models of the target rv Y ∣ x Y|x Yx with conditional var x x x.
  • B a y e s \mathcal{Bayes} Bayes: cat. of the Baysian models

S t a t \mathcal{Stat} Stat is regarded as a sub-cat. of S t a t \mathcal{Stat} Stat with the flatten priori. The Bayesian model gives joint P ( x , θ ) P(x,\theta) P(x,θ)。Therefore the category of the Bayesian models is a sub-cat. of P r o b Prob Prob.


Estimator

Definition (Statistical model with an estimator)
model with estimator: ( M θ , θ ^ ( X ) ) (M_\theta, \hat\theta(X)) (Mθ,θ^(X)) where θ ^ : X N → Θ \hat\theta: \mathcal{X}^N\to \Theta θ^:XNΘ, X X X is a sample with size N N N.

In most case, we use MLE. The estimator has been implied by model.


Statistical Learning

supervised learning (determinant form):
( X , Y , P ( Y ∣ X ) ) (\mathcal{X},\mathcal{Y}, P(Y|X)) (X,Y,P(YX)) where X X X is input (conditional var.), Y Y Y is output

Supervised learning based on sample X = { X i } X=\{X_i\} X={Xi}, is identified with statistical model:
( Y N , P ( Y ∣ X ) = ∏ i P ( y i ∣ x i ) ) (\mathcal{Y}^N, P(Y|X)=\prod_iP(y_i|x_i)) (YN,P(YX)=iP(yixi))
where the sample X X X is fixed, named design var. (design matrix, if it forms a matrix) and Y Y Y is a sample point (sample with size 1).

I claim: statistical learning == conditionalized statistics (model)


Facts in statistical learning are also facts in statistics

Bias-Variance Decomposition in statistics: Error = Bias + Variance; in statistical learning: Error = Bias + Variance under condition of input variable;


Classical Models in Statistical Learning(SL)

  • Supervised Learning
    • Regression: P ( Y ∣ X ) P(Y|X) P(YX), Linear Regression, Ridge/LASSO Regression ( X X X is the conditional rv)
    • Classification: P ( Y ∣ X ) P(Y|X) P(YX), Logistic Regression/LDA(/QDA)/Naive Bayesian Classifier
  • Unsupervised Learning
    • Clustering: P ( X , Z ) P(X,Z) P(X,Z), where Z Z Z is unobservable, K-means/GMM
    • Dimension Reduction: P ( X , Z ) P(X,Z) P(X,Z), PCA/ICA/MNF
    • Latent Variable Models: P ( X , Z ) P(X,Z) P(X,Z), Mixed models(GMM), pLSA
    • Hidden Markov Model: P ∗ ( X 1 : T , Z 1 : T ) P^*(X_{1:T},Z_{1:T}) P(X1:T,Z1:T) where ( X t , Z t ) ∼ P (X_t,Z_t)\sim P (Xt,Zt)P
  • Others
    Time sequence: P ∗ ( X 1 : T ) P^*(X_{1:T}) P(X1:T), ARMA

Learner

A learner is an estimator for statistical model:
( M θ , θ ^ ( X ) ) (M_\theta, \hat{\theta}(X)) (Mθ,θ^(X))
where M θ M_\theta Mθ is a statistical model.

One can define a latent model as ( P ( X , Z ) , θ ^ ( X ) ) (P(X,Z), \hat{\theta}(X)) (P(X,Z),θ^(X)) for unsupervised learning; ( P ( X , Y ) , θ ^ ( X , Y ) ) (P(X,Y), \hat{\theta}(X,Y)) (P(X,Y),θ^(X,Y)) for supervised learning.


Beginners’ Magic Cube

The classical models are all categories. And we have a diagram about them. I’d like to call it “the beginners’ magic cube”, since it looks like a cube and the beginner of SL should learn them first.

在这里插入图片描述


Probabilitic Graph Model

Another way to describe the statistical (learning) model.

  • Bayesian Network: Directed acyclic graph
  • Markov Network(Random Field): Undirected graph

Methods

Methods as Functors:

  • Kernel trick: X → ϕ ( X ) X\to \phi(X) Xϕ(X)
  • Localization/Smoothing: ∑ i K ( x 0 , x i ) l ( x i , y i ) \sum_i K(x_0,x_i)l(x_i,y_i) iK(x0,xi)l(xi,yi)
  • Hierarchical model: P ( X , Z 1 , ⋯   , Z n , Y ) P(X,Z_1,\cdots, Z_n,Y) P(X,Z1,,Zn,Y) where X → Z 1 → ⋯ → Z n → Y X\to Z_1\to \cdots\to Z_n\to Y XZ1ZnY forms Markov chain usually and Z 1 , ⋯   , Z n Z_1,\cdots, Z_n Z1,,Zn are hidden.
  • Variational trick: ( P ( X , Z ) , Q ( Z ∣ X ) ) (P(X,Z), Q(Z|X)) (P(X,Z),Q(ZX)) where Q ( Z ∣ X ) Q(Z|X) Q(ZX) is the variational distr.
  • Neural network(NN): f ( x ) → N e t ( x ) f(x) \to Net(x) f(x)Net(x)
  • Stochastic method/Monte Carlo method: Important sampling/MCMC

Advanced Models I:

Neural Models: Models equipped with Neural Networks; apply neural network in statistical models.

  • Neural Models: MLP, embed NN into regression models
  • RNN/LSTM: as a conditional HMM implemented by NN
  • Neural Autoencoder: NLPCA, embed NN into the autoencoder
  • Probabilistic Neural Autoencoder: Variational Autoencoder(VAE; the stochastic perturbation affacts the outputs of the layers)
  • Stochastic NN: Dropout(the stochastic perturbation affacts the weights of the layers)
  • Normalization Flow: Reparameterization, as a non-stochastic hierarchical VAE
  • Hierarchical VAE: Diffusion Model/Consistency Model

Beginners’ Star

在这里插入图片描述


Create an advanced models I

Take VAE as an example

P ( X ) ∼ N ( μ , σ 2 ) → P ( X ∣ Z ) ∼ N ( f ( z ) , σ 2 ) , P ( Z ) ∼ N ( 0 , 1 ) → ( P ( X , Z ) , Q ( Z ∣ X = x ) ∼ N ( g ( x ) , h ( x ) ) ) → ( P ( X , Z ) , Q ( Z ∣ X = x ) = g ( x ) + ξ h ( x ) ) , ξ ∼ N ( 0 , 1 ) P(X)\sim N(\mu,\sigma^2) \to P(X|Z)\sim N(f(z),\sigma^2),P(Z)\sim N(0,1)\\ \to (P(X,Z),Q(Z|X=x)\sim N(g(x),h(x)))\\ \to (P(X,Z),Q(Z|X=x) = g(x)+\xi h(x)),\xi\sim N(0,1) P(X)N(μ,σ2)P(XZ)N(f(z),σ2),P(Z)N(0,1)(P(X,Z),Q(ZX=x)N(g(x),h(x)))(P(X,Z),Q(ZX=x)=g(x)+ξh(x)),ξN(0,1)

Write it in the style of the composition of functors (informally)
V A E ( f , g , h ) = R e p ∘ V a r ∘ L V M ( P ( X ) ) VAE(f,g,h) = \mathrm{Rep}\circ \mathrm{Var}\circ\mathrm{LVM}(P(X)) VAE(f,g,h)=RepVarLVM(P(X)), regarding functions f , g , h f,g,h f,g,h as parameters.

The implimentation of VAE by the following NN (with a regularizing term):
y ∼ f ( g ( x ) + h ( x ) ξ ) , y \sim f(g(x)+h(x)\xi), yf(g(x)+h(x)ξ),
through self-supervised learning with data { ( x i , x i ) } \{(x_i,x_i)\} {(xi,xi)}, where f , g , h f,g,h f,g,h are all neural layers, ξ ∼ N ( 0 , 1 ) \xi\sim N(0,1) ξN(0,1) is the perturbation variable of the hidden layer g g g. when ξ → 0 \xi\to 0 ξ0, Q Q Q is degenerated, and VAE degenerates to an ordinary NN f ( g ( x ) ) f(g(x)) f(g(x)).


Create an advanced models II

Take RNN as an example

P ( Y ) → ⋯ → P ∗ ( Y , Z ∣ X ) → y t = N e t ( x t , z t − 1 ) , z t = N e t ( x t , z t − 1 ) P(Y)\to \cdots \to P^*(Y,Z|X)\\ \to y_t=Net(x_t,z_{t-1}),z_{t}=Net(x_t,z_{t-1}) P(Y)P(Y,ZX)yt=Net(xt,zt1),zt=Net(xt,zt1)

say R N N ( w ) = N N ∘ T S ∘ C o n d i ∘ L V M ( P ) RNN(w) = \mathrm{NN\circ TS\circ Condi\circ LVM}(P) RNN(w)=NNTSCondiLVM(P).

Homework

  1. What about Z k ∼ B ( p k ) Z_k\sim B(p_k) ZkB(pk) (multivariate Bernoullian distr.)
  2. To create time sequence version of VAE

Advanced Models II

  • Ensemble Learning
  • Transfer Learning
  • Incremental Learning(Continual Learning、On-line Learning)
  • Life-Long Learing

Possible definition of Transfer Learning

  • ( P ( X ∣ θ 1 ) , P ( X ∣ θ 2 ) ) , θ 1 , θ 2 ∼ P ( θ ∣ α ) (P(X|\theta_1),P(X|\theta_2)), \theta_1,\theta_2\sim P(\theta|\alpha) (P(Xθ1),P(Xθ2)),θ1,θ2P(θα)
  • ( P ( X ∣ θ 1 , θ 0 ) , P ( X ∣ θ 2 , θ 0 ) ) (P(X|\theta_1,\theta_0),P(X|\theta_2,\theta_0)) (P(Xθ1,θ0),P(Xθ2,θ0))
  • ( P ( ϕ ( X ) ∣ θ ) , P ( ϕ ( X ) ∣ θ ) ) (P(\phi(X)|\theta),P(\phi(X)|\theta)) (P(ϕ(X)θ),P(ϕ(X)θ))
    with sample X 1 X_1 X1 from sorce domain and sample X 2 X_2 X2 from target domain

Misc.

  • Reinforcement Learning: Stochastic Learning
    evaluation/sampling-estimation alternating.
    θ → v → θ → v → ⋯ \theta \to v \to \theta \to v\to \cdots θvθv
  • BiLSTM/BiLM/ELMo
  • Transformer/Self-attention
  • models based on Unnormalized distr. (Energy-based models)

Inspired by BiLSTM/BiLM: Tied Model
( P ( X ∣ θ 1 , θ 0 ) , P ( X ∣ θ 2 , θ 0 ) ) (P(X|\theta_1,\theta_0),P(X|\theta_2,\theta_0)) (P(Xθ1,θ0),P(Xθ2,θ0)) with the same sample.

Tied likelihood: P ( X ∣ θ 1 , θ 0 ) P ( X ∣ θ 2 , θ 0 ) P(X|\theta_1,\theta_0)P(X|\theta_2,\theta_0) P(Xθ1,θ0)P(Xθ2,θ0)
(a sort of pseudo-likelihood; a product of expert without normalized coef.)


Future Works


References

  • PETER MCCULLAGH WHAT IS A STATISTICAL MODEL? The Annals of Statistics. 2002, 30(5): 1225–1310
  • Jared Culbertson and Kirk Sturtz. Bayesian Machine Learning via Category Theory, 2013.
  • Categories for AI. https://www.youtube.com/watch?v=4poHENv4kR0
  • Kenneth A. Lloyd, Jr. A Category-Theoretic Approach to Agent-based Modeling and Simulation, 2010.
  • Dan Shiebler, Bruno Gavranovic, Paul Wilson, Category Theory in Machine Learning, 2021.

链接: https://pan.baidu.com/s/1GdPiVGG3GIKVS4nWqlBm-w?pwd=1111 提取码: 1111

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/597984.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

基于SSM的疫苗接种平台

末尾获取源码 开发语言:Java Java开发工具:JDK1.8 后端框架:SSM 前端:Vue 数据库:MySQL5.7和Navicat管理工具结合 服务器:Tomcat8.5 开发软件:IDEA / Eclipse 是否Maven项目:是 目录…

抢先看!文档控件aspose.words 新版本23.05 有哪些亮点?

Aspose.Words是一种高级Word文档处理API,用于执行各种文档管理和操作任务。API支持生成,修改,转换,呈现和打印文档,而无需在跨平台应用程序中直接使用Microsoft Word。 Aspose API支持流行文件格式处理,并…

Flutter-布局(Layout)原理

1、约束、尺寸、位置 overrideWidget build(BuildContext context) {return Scaffold(body: LayoutBuilder(builder: (context, constraints) {print("body约束:" constraints.toString());return Container(color: Colors.black,width: 300,height: 300,child: L…

计算机中丢失vcomp140.dll解决方案,可以使用这个3种方法修复

vcomp140.dll是一个动态连接库文件,它是Microsoft Visual Studio 2015应用程序所必需的运行库文件之一。它在Visual C Redistributable for Visual Studio 2015包中提供,是用于支持Microsoft C/C OpenMP Runtime的库文件。计算机中丢失vcomp140.dll无法打…

LNMP网站框架搭建(编译安装)

目录 一、Nginx的工作原理 工作进程: 二、Nginx编译安装安装 三、mysql的编译安装 四、php的编译安装 验证PHP与nginx的是否连接 验证lnmp的是否搭建成功 五、部署 Discuz!社区论坛 六、fpm参数优化 一、Nginx的工作原理 php-fpm.conf …

Redis数据类型之String——字符串、数值、bitmap

Redis数据类型之String——字符串、数值、bitmap 注意索引位置一般从左到右 0开始,叫正向索引。从右到左-1开始叫反向索引 字符串 字符串有很多操作set、get、append、setrange、getrange等,每个都有自己对应的用处 SET SET key value 设置指定 key …

单元测试系列:一、了解spring boot单元测试

一、java 单元测试框架 java当前主流的测试框架有两个:JUnit、TestNG。 1、两者相同点 使用annotation,且大部分annotation相同。都可以进行单元测试(Unit test)。都是针对Java测试的工具。 2、两者不同点 JUnit只能进行单元…

【Axure教程】大小图动态轮播

大小图动态轮播常用于展示多张图片,其中包含两种不同尺寸的图片。这种类型的轮播通常用于网站首页、产品展示、广告宣传等场景,提供更丰富的展示方式,吸引用户的注意力并提供更多信息。通过切换不同的图片,可以有效地传达多个内容…

采购异常支出怎么造成的?如何控制?

采购管理中的"异常支出 "是指在企业预定的支出规则之外进行的交易。虽然大多数大型采购都是与知名供应商合作,并直接通过采购团队的合同进行,但并不是所有的采购都是这样正式进行的,这可能会导致支出不明的问题。 什么原因造成异常…

15-创建Vue3.0工程

目录 1.使用 vue-cli 创建2.使用 vite 创建 1.使用 vue-cli 创建 官方文档:https://cli.vuejs.org/zh/guide/creating-a-project.html#vue-create ## 查看vue/cli版本,确保vue/cli版本在4.5.0以上 vue --version ## 安装或者升级你的vue/cli npm insta…

创建孔、缩放、复制和粘贴

创建孔 项目概况: 在本项目中,您将学习如何使用孔特征从另一个形状中去除材料。 在Tinkercad上查看整个项目 指示 将蓝色多边形形状拖动到工作平面上,并将其放置在以橙色勾勒的区域。 提示: 您可能需要向下滚动形状列表才能找到…

kendoUI中的Observable详解

kendoUI中的Observable方法 一、前言 Kendo UI是一个基于JavaScript的开源UI框架,它提供了一系列的UI组件和工具,包括表格、图表、表单、对话框等。Kendo UI中的Observable是一个非常重要的概念,它可以帮助我们更好地理解和使用Kendo UI。 …

如何把握住数字化时代的浪潮,数据要素是关键

随着美国、欧洲的数字、数据战略相继出现,这意味着数据在未来的价值已经不是什么秘密,而是对未来世界发展的共识。IDC曾经预测过到2025年中国产生的数据总量将会达到48.6ZB,占全球的27.8%。这就是在未来发展数据战略的底气,也标志…

二叉树详解:带你掌握二叉树

目录 前言1. 树型结构1. 1 树的概念1.2 树的特点1.3 树的相关术语 2. 二叉树(binary tree)2.1 二叉树的概念2.2 二叉树中的特殊树2.2.1 满二叉树2.2.2 完全二叉树 2.3 二叉树的性质 3. 二叉树的遍历3.1 前序遍历3.2 中序遍历3.3 后序遍历3.4 层序遍历 总…

基于matlab使用蒙特卡罗模拟生成雷达系统的接收机工作特性 (ROC) 曲线

一、前言 此示例说明如何使用蒙特卡罗模拟生成雷达系统的接收机工作特性 (ROC) 曲线。接收器工作特性决定了系统在目标不存在(误报)时抑制大杂散信号值时检测目标的能力。检测系统将通过将接收到的信号值与预设阈值进行比较来声明…

Angular学习笔记:environment.ts文件

本文是自己的学习笔记,主要参考资料如下。 - B站《Angular全套实战教程》,达内官方账号制作,https://www.bilibili.com/video/BV1i741157Fj?https://www.bilibili.com/video/BV1R54y1J75g/?p32&vd_sourceab2511a81f5c634b6416d4cc1067…

网络监控工具

网络监控是一项重要的 IT 操作,可控制网络中的危险信号,并有助于避免潜在的业务问题。网络监控工具提供有关网络可用性和运行状况的见解,并跟踪 CPU 使用率、延迟和数据包丢失等关键性能指标,使管理员能够全面了解网络。 为什么网…

Kafka Schema-Registry

一、为什么需要Schema-Registry 1.1、注册表 无论是 使用传统的Avro API自定义序列化类和反序列化类 还是 使用Twitter的Bijection类库实现Avro的序列化与反序列化,这两种方法都有一个缺点:在每条Kafka记录里都嵌入了schema,这会让记录的大…

持续集成部署-微前端 镜像可以有多小?

微前端 镜像可以有多小? 1. 需求2. 开整 1. 需求 目前项目前端的镜像大小基本在 150M 左右,试下能不能缩小到 20M? 看了下前端打包后的压缩包只有 几 兆; 想着有空调试下,第一反应应该是使用 alpine 镜像&#xff0…

Delving into Shape-aware Zero-shot Semantic Segmentation(CVPR2023)

文章目录 摘要本文方法Pixel-wise Vision-Language AlignmentShape ConstraintSelf-supervised Spectral Decomposition推理 实验结果 摘要 由于大规模视觉语言预处理取得了令人瞩目的进展,最近的识别模型可以以零样本和开放集的方式对任意对象进行分类&#xff0c…