Why do tree-based models still outperform deep learning on tabular data?
- Tabular Data (表格数据)
- NN 处理表格数据的挑战
- 模型的归纳偏置有何不同？
- 模型本质有何不同？
- 做个小结
[CIKM 2019] AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks
- Introduction
- AutoInt: Automatic Feature Interaction Learning
- Experiment
[AAAI 2021] TabNet: Attentive Interpretable Tabular Learning
- Introduction
- TabNet for Tabular Learning
- - Simple DNN building blocks with DT-like output manifold
  - TabNet encoder
  - Tabular self-supervised learning
- Experiments
[NeurIPS 2021] Revisiting deep learning models for tabular data
- Intruduction
- Models for tabular data problems
- - Notation
  - MLP
  - ResNet
  - FT-Transformer (Feature Tokenizer + Transformer)
  - Other models
- Experiments
- - Datasets
  - Comparing DL models
  - Comparing DL models and GBDT
References

Why do tree-based models still outperform deep learning on tabular data?

Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. “Why do tree-based models still outperform deep learning on tabular data?.” arXiv preprint arXiv:2207.08815 (2022).

Tabular Data (表格数据)

表格数据的最大的特点就是异质性 (Heterogeneous)，即数据的每一列具有不同含义且数据类型不一致，这种异质的表格数据与图像或语言数据 (同质数据) 相比，其具有密集的数值特征和稀疏的分类特征

NN 处理表格数据的挑战

(1) 低质量表格数据往往来自于真实世界的统计，而一旦数据来自真实世界，那么事情便复杂起来。可谓是虚假的数据千篇一律，真实的数据五花八门。脏数据 (missing values)、离群点 (outliers)、样本不均衡、数据空间小等一系列现实问题便很容易涌现出来
(2) 缺失或拥有复杂的空间相关性。当前主流的 NN 模型往往是在同质数据中使用归纳偏置，最典型的如卷积神经网络。表格数据集中的变量之间往往不存在空间相关性，或者特征之间的相关性相当复杂和不规则。当使用表格数据时，必须从头开始学习它的结构和特征之间的关系。这也是为什么迁移学习难以在表格数据上奏效的原因
(3) 强依赖预处理。同质数据上的深度学习的一个关键优势是它包含一个隐式表示学习步骤，因此只需要极少的预处理或显式特征构建。然而，当深度神经网络处理表格数据，其性能可能在很大程度上取决于所选择的预处理策略。不当的预处理方式可能导致信息缺失，预测性能下降、生成非常稀疏的特征矩阵 (如通过使用 one-hot 编码类别特征) 导致模型无法收敛、引入先前无序特征的虚假排序信息 (如通过使用有序编码方案)
(4) 特征重要性。通常情况下，改变图像的类别需要对许多特征 (如像素) 进行协调变化，但一个分类 (或二进制) 特征的最小可能变化可以完全颠覆对表格数据的预测。与深度神经网络相比，决策树算法通过选择单个特征和适当的阈值 “忽略” 其余数据样本，可以非常好地处理不同的特征重要性

模型的归纳偏置有何不同？

如下图所示，(1) 调优超参数并不能使神经网络达到 SOTA：基于树的模型对于每个随机搜索都具有优越性，即使经过大量的随机搜索迭代，NN 模型相比树模型性能差距仍然很大。(2) 类别特征并不是神经网络的主要弱点：类别特征通常被认为是在表格数据上使用神经网络的一个主要问题。在数值变量上的结果只揭示了树型模型和神经网络之间的差距比包含分类变量更小。然而，当只学习数字特征时，这种差距仍然存在
神经网络倾向于比较平滑的解决方案。通过不同尺度的高斯核函数将训练集上的 output 进行平滑，这样可以有效防止模型学习目标函数的不规则 pattern。高斯平滑核：
$K\left(\mathbf{x}^*, \mathbf{x}\right)=\exp \left(-\frac{1}{2}\left(\mathbf{x}^*-\mathbf{x}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}^*-\mathbf{x}\right)\right)$ 平滑训练集 output 方式：
$\tilde{Y}\left(X_i\right)=\frac{\sum_{j=1}^N K\left(X_i, X_j\right) Y\left(X_j\right)}{\sum_{j=1}^N K\left(X_i, X_j\right)}$ 下图显示了模型性能作为平滑核的长度尺度的函数。结果表明，对目标函数进行平滑处理会显著降低基于树的模型的精度，但对神经网络的精度影响不大。这也说明我们数据集中的目标函数不是平滑的，与基于树的模型相比，神经网络很难拟合这些不规则的函数。这与 Rahaman, Nasim, et al. “On the spectral bias of neural networks.” International Conference on Machine Learning. PMLR, 2019. 的发现一致，他们发现神经网络偏向拟合低频函数。而基于决策树的模型学习分段常数函数，不会表现出这样的偏见
非信息特征更能影响类似 MLP 的 NN. 表格数据集包含许多非信息 (uninformative) 特征，对于每个数据集，该研究根据特征的重要性会选择丢弃一定比例的特征 (通常按随机森林特征重要性排序)。从下图可以看出，去除一半以上的特征对 GBT 的分类准确率影响不大。当在被移除 20% 特征时，GBT 在测试集上的精度都降低的非常小，直到被移除 50% 特征时，精度下降才逐渐明显，这表明这些特征大部分是无信息增益的。但是从红线的涨幅来看，这些特征又不是完全无用的
下图可以看到移除非信息特征减少了 MLP (Resnet) 与其他模型 (FT Transformers 和基于树的模型) 之间的性能差距，而添加非信息特征会扩大差距，这表明 MLP 对非信息特征的鲁棒性较差
MLP 更具旋转不变性. 与其他模型相比，为什么 MLP 更容易受到无信息特征的影响？其中一个答案是：MLP 是旋转不变的。当对训练集和测试集特征应用旋转时，在训练集上学习 MLP 并在测试集上进行评估，这一过程是不变的。事实上，任何旋转不变的学习过程都具有最坏情况下的样本复杂度，该复杂度至少在不相关特征的数量上呈线性增长。直观地说，为了去除无用特征，旋转不变算法必须首先找到特征的原始方向，然后选择信息最少的特征 (the information contained in the orientation of the data is lost). 下图 a 显示了当对数据集进行随机旋转时的测试准确率变化，证实只有 Resnets 是旋转不变的。值得注意的是，随机旋转颠倒了性能顺序，这表明旋转不变性是不可取的。事实上，表格数据通常具有单独含义，例如年龄、体重等。图 b 中显示：删除每个数据集中最不重要的一半特征 (在旋转之前)，会降低除 Resnets 之外的所有模型的性能，但与没有删除特征使用所有特征时相比，相比较而言，下降的幅度较小

模型本质有何不同？

树模型的本质：分段常数函数. 决策树在本质上是一组嵌套的 if-else 判定规则，从数学上看是分段常数函数，对应于用平行于坐标轴的平面对空间的划分。判定规则是人类处理很多问题时的常用方法，这些规则是我们通过经验总结出来的，而决策树的这些规则是通过训练样本自动学习得到的。而正是这种简单的划分使得模型的决策流形 (decision manifolds) 可以看成是超平面的分割边界，对于表格数据的效果很好
神经网络的本质：分段线性函数. 神经网络强大的本质原因：1）激活函数让线性的神经网络具备了 “分段” 表达的能力。2）任何函数都可以用 “分段” 线性函数来逼近。成也强大、败也强大，正是因为神经网络这种过强的拟合能力导致在 size 通常不大的表格数据上很容易过拟合。对于大规模神经网络来说，中间隐层所生成的 “高维特征” 甚至有时比原始数据还多

做个小结

树模型特点

(1) 天然的鲁棒性，对异常点、缺失值不敏感，不需要归一化等操作
(2) 模型的决策流形 (decision manifolds) 可以看成是超平面的分割边界，对于表格数据的效果很好
(3) 基于贪心的自动化特征选择和特征组合能相比其他 ML 模型，具有更强的非线性表达能力
(4) 树的可解释性很好，分裂可视化以及特征重要性等操作，能改善特征工程。进一步优化特征，提升模型性能
(5) 数据量带来的边际增益不大，容易触及天花板

NN 模型特点

(1) 在语义含义统一的稠密数据上，拥有全自动化的特征工程的能力，包括超强的特征挖掘与特征组合能力
(2) 极强的数据记忆能力与外推泛化能力
(3) 对异常值敏感，对于表格数据，强依赖数据预处理
(4) 不可解释，无法像树模型那种直观展示预测流程，无法推演与优化基础特征
(5) 过强的非线性中隐含过拟合和噪音

[CIKM 2019] AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

Song, Weiping, et al. “Autoint: Automatic feature interaction learning via self-attentive neural networks.” Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019.
code: https://github.com/DeepGraphLearning/RecommenderSystems

Introduction

AutoInt 的提出是为了解决 Click-through rate (CTR) 预测问题，i.e., Predicting the probabilities of users clicking on ads or items，CTR 任务的数据集也是表格数据，它的输入特征包括用户特征和物品特征，一般维度很高且非常稀疏 (e.g. one-hot vector)，这使得模型非常容易过拟合，模型需要对输入特征进行降维。此外，有效的预测需要模型进行 high-order feature interactions. 这里 $p$ -order Combinatorial Feature 为 $f(x_{i_1} , ..., x_{i_p} )$ ， $x$ 为输入特征， $g (\cdot)$ 为 non-additive combination function，例如推荐时可以使用 third-order combinatorial feature <Gender = Male, Age = 10, ProductCategory = VideoGame>
AutoInt 能从高维度的稀疏特征中自动学得低维度的 high-order combinatorial features，在 CTR 任务上表现良好的同时还具有不错的可解释性

AutoInt: Automatic Feature Interaction Learning

在这里插入图片描述

Embedding Layer 负责将类别特征和数值特征都映射到同一低维空间上。如果对于某一类别特征，样本属于多个类别 (multi-hot vector)，则该类别特征的 embed 为所有所属类别 embed 的均值。数值特征通过将数值标量乘上对应的 embed 得到，并且如果数值 $z > 2$ ，则将 $z$ 变换到 $log^2 z$
Interacting Layer. 作者使用多头自注意力机制来建模 high-order feature interactions. 对于 attention head $h$ 中的 feature $m$ 的 embed $e_m$ ，自注意力机制可以得到一个新的 combinatorial feature $\tilde e_m^{(h)}\in\R^{d'}$ ，并且该特征还具有一定的可解释性

其中 $W^{(h)}_{Query/Key/Value}\in\R^{d'\times d}$ . 通过 multiple head 可以同时建模多种不同的 combinatorial feature
其中， $\oplus$ 为向量连接， $H$ 为自注意力头数。并且为了保留之前学得的 combinatorial features (包括原始的 first-order features)，还加入了 residual connections
其中 $W_{Res}\in\R^{d'H\times d}$ . 最终，通过堆叠多个 interacting layers 就可以显式地建模 different orders of combinatorial features
Output Layer
其中 $M$ 为特征数
Training

Experiment

DataSets
Evaluation of Effectiveness
Evaluation of Model Efficiency
Ablation study
Explainability. 作者用第一层的 attention map 来可视化不同特征之间的相关性，由下图可以看出，AutoInt 能抽取出有意义的特征组合 (第一张图为单个样本的可视化，第二张图为全部样本的平均可视化)
Integrating Implicit Interactions. 通过加入两个全连接层引入隐式特征交互，还能进一步提升模型性能

[AAAI 2021] TabNet: Attentive Interpretable Tabular Learning

Arik, Sercan Ö., and Tomas Pfister. “Tabnet: Attentive interpretable tabular learning.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 8. 2021.

Introduction

Why is DTs good for tabular data?

当前针对 tabular data 的 DL 模型仍未取得足够大的突破，决策树模型 (Decision Trees, DTs) 仍然主导着绝大多数基于 tabular data 的任务。这可能是由于 DTs 具备如下优势：(1) 决策树的决策流形 (decision manifolds) 近似于超平面分割边界，这对于表格数据十分友好. (2) 具有较好的可解释性，有很多针对决策树的后处理可解释模型集成方法. (3) 训练速度快；同时，当前的 DNN 还存在如下缺陷：(1) vastly overparametrized. (2) lack of appropriate inductive bias
下图是一个决策树流形的简单示例

Why is deep learning worth exploring for tabular data?

(1) expected performance improvements particularly for large datasets
(2) end-to-end learning which enables (i) efficiently encoding multiple data types like images along with tabular data; (ii) alleviating the need for feature engineering, which is currently a key aspect in tree-based tabular data learning methods; (iii) learning from streaming data and perhaps most importantly (iv) end-to-end models allow representation learning

TabNet

为了解决当前 DL 模型存在的问题，作者提出了 TabNet，它具有如下特点：

(1) 不需要数据预处理，可以直接输入 raw tabular data，并且可以端到端训练.
(2) 使用 sequential attention 来选择在每个决策步使用哪些特征 (instance-wise soft feature selection with controllable sparsity)，更加具备可解释性 (including local interpretability that visualizes the importance of features and how they are combined, and global interpretability which quantifies the contribution of each feature to the trained model).
(3) 作者证明了 predict masked features 的自监督预训练可以有效提升模型性能

TabNet for Tabular Learning

Simple DNN building blocks with DT-like output manifold

既然想要让 DNN 具有树模型的优点 (feature selection, interpretability…)，那么我们首先需要解决的一个问题就是：如何构建一个与树模型具有相似决策流形的神经网络？论文给出了一个简单的示例，如下图所示 (As $C_1$ and $C_2$ get larger, the decision boundary gets sharper.)

TabNet encoder

在这里插入图片描述

TabNet 不进行特征标准化等数据预处理步骤，而是直接将数值特征和类别特征 (用 Embedding 层转化为 scalar embed) $f\in\R^{B\times D}$ 用 BN 做批归一化后输入 $N_{steps}$ 个决策步
Instance-wise Soft Feature selection. 决策步 $i$ ( $i\geq1$ ) 利用上一决策步的输出信息 $a [i - 1]$ 来进行特征选择。具体而言，attentive transformer 生成 learnable mask $M[i]\in\R^{B\times D}$ 来进行显著特征的选择，i.e. $M[i]\cdot f$ . 生成过程如下：
$M[i]=\text{sparsemax}(P[i-1] \times h_i(a[i- 1]))$ 其中， $h_i$ 为 FC + BN， $\text{sparsemax}(z)=\argmax_{p\in\Delta}\|p-z\|$ 可以理解为 Softmax 的稀疏化版本，它通过直接将向量投影到一个 probabilistic simplex 来实现稀疏化，同时也满足 $\sum_{j=1}^DM[i]_{b,j}=1$ . $P[i-1]=\prod_{j=1}^{i-1}(\gamma-M[j])$ 为 prior scale term，代表某一特征在之前的决策步中被使用的程度， $\gamma$ 为 relaxation parameter，当 $\gamma=1$ 时，一个特征就只能在一个决策步内被使用， $\gamma$ 越大这一约束就越小。 $P [0]$ 初始化为全 1 矩阵 $1^{B\times D}$ ，不过如果不使用一些特征 (e.g. 自监督预训练)，那么就可以将 $P [0]$ 的对应特征列设为 0 来帮助模型训练
此外，为了进一步促进特征选择的稀疏性，降低特征冗余给模型造成的影响，作者还引入了 sparsity regularization，将样本 mask 的熵引入损失函数：
$L_{sparse}=\lambda_{sparse}\sum_{i=1}^{N_{\text {steps }}} \sum_{b=1}^B \sum_{j=1}^D \frac{-\mathbf{M}_{\mathbf{b}, \mathbf{j}}[\mathbf{i}] \log \left(\mathbf{M}_{\mathbf{b}, \mathbf{j}}[\mathbf{i}]+\epsilon\right)}{N_{s t e p s} \cdot B}$
Feature processing. 由于每个决策步都是对输入特征进行处理，因此 feature transformer 里同时包含被所有决策步共享的层和每个决策步特有的层，具体结构如下图所示，其中 GLU (gated linear unit) 就是在 FC 的基础上再加上一个门控，其计算公式为 $h(X)=(W_1 \cdot X+b_1) \otimes \sigma(W_2 \cdot X+b_2)$ ，normalized residual connection (在残差结构相加前将两个分支的输出都乘上 $\sqrt{0.5}$ 可以保证模型输出方差一致，使得训练过程更加稳定)
在抽取特征后，TabNet 用 Split 层将该特征划分用于决策的输出 $d [i]$ 和用于输入下一决策层的输出 $a [i]$ . 最终的决策特征为
$\mathrm{d}_{\text {out }}=\sum_{i=1}^{N_{\text {steps }}} \operatorname{ReLU}(\mathbf{d}[\mathbf{i}])$ $\mathrm{d}_{\text {out }}$ 经过一个 FC 后即可得到最终的输出结果
Interpretability. 可以通过如下方式量化每个特征对于模型的重要度。模型先对模型一个 step 的输出向量求和，得到一个标量，这个标量反映的是这个 step 对于最终结果的重要性，那么它乘以这个 step 的 Mask 矩阵就反映了这个 step 中每个 feature 的重要性，将所有 step 的结果加起来，就得到了 feature 的全局重要性

Tabular self-supervised learning

自监督预训练 predict masked features，在每个训练迭代开始时都从参数为 $p_s$ 伯努利分布中随机采样一个 mask $S\in\R^{B\times D}$ 用于指示哪些特征被 mask 掉了
decoder
reconstruction loss
$\sum_{b=1}^B \sum_{j=1}^D\left|\frac{\left(\hat{\mathbf{f}}_{\mathbf{b}, \mathbf{j}}-\mathbf{f}_{\mathbf{b}, \mathbf{j}}\right) \cdot \mathbf{S}_{\mathbf{b}, \mathbf{j}}}{\sqrt{\sum_{b=1}^B\left(\mathbf{f}_{\mathbf{b}, \mathbf{j}}-1 / B \sum_{b=1}^B \mathbf{f}_{\mathbf{b}, \mathbf{j}}\right)^2}}\right|^2$ 其中 $f$ 为 GT 特征， $\hat f$ 为重构特征，分母为 population standard deviation

Experiments

Instance-wise Feature Selection

第一个实验考察的是 TabNet 能够根据不同样本来选择相应特征的能力，用的是 6 个人工构建的数据集 Syn1-6，它们的 feature 大多是无用的，只有一小部分关键 feature 是与 label 相关的。对于 Syn1-3，这些关键 feature 对数据集上的所有样本都是一样的，因此只需要全局的特征选择方法就可以得到最优解；而 Syn4-6 则更困难一些，样本的关键 feature 并不相同，它们取决于另外一个指示 feature (indicator)，例如对于 Syn4 数据集， $X_{11}$ 是指示feature， $X_{11}$ 的取值决定了 ${X_1,X_2\}$ 和 ${X_3,X_4,X_5,X_6\}$ 哪一组是关键 feature，显然，对于这样的数据集，简单的全局特征选择并不是最优的
下表展示的是 TabNet 与一些 baseline 在测试集上的 AUC 均值 + 标准差，可以看出 TabNet 表现不错，在 Syn4-6 数据集上，相较于全局特征选择方法 (Global) 有所改善

Performance on Real-World Datasets

Forest Cover Type
Poker Hand
Sarcos
Higgs Boson：这是一个物理领域的数据集，任务是将产生希格斯玻色子的信号与背景信号分辨开来，由于这个数据集很大，因此 DNN 比树模型的表现更好
Rossmann Store Sales

Interpretability

下图中 $M_{agg}$ 反映的是 feature 的全局重要性， $M_i$ 是第 $i$ 个 step 的 feature 重要性。可以看出，对于 Syn2，TabNet 在每一个 step 分别选择了一个 feature，最后的全局重要性也集中在 ${X_3,X_4,X_5,X_6\}$ 上，这与 Syn2 的关键 feature 是全局一致的设定相符；对于 Syn6，TabNet 在一个 step 中选择的 feature 并不一致，而最后的全局重要性则集中在 ${X_1,X_2\}$ 、 ${X_3,X_4,X_5,X_6\}$ 以及 $X_{11}$ 上，这也与 Syn6 的 Instance-wise 设定相符。这些可视化结果说明 TabNet 可以准确地捕捉到指示 feature 与关键 feature 之间的联系

Self-Supervised Learning

在这里插入图片描述

[NeurIPS 2021] Revisiting deep learning models for tabular data

Gorishniy, Yury, et al. “Revisiting deep learning models for tabular data.” Advances in Neural Information Processing Systems 34 (2021): 18932-18943.
code: https://github.com/yandex-research/rtdl

Intruduction

作者认为，目前针对 tabular data 的 DL 模型缺少通用的 benchmark 和一个简单有效的 baseline，这导致很难公平地比较不同模型的性能
为此，作者回顾了针对 tabular data 的主流 DL 模型并提出了两种简单有效的 baseline：(1) ResNet-like architecture, (2) FT-Transformer, a simple adaptation of the Transformer architecture for tabular data. 通过将上述 baseline 与现有模型进行比较，作者发现还没有 DL 模型能始终领先 ResNet-like model，而 FT-Transformer 则在大多数任务上取得了最好效果。此外，通过与 GBDT 的比较，作者总结目前对于 tabular data 还没有 “silver bullet”

Models for tabular data problems

Notation

Supervised Learning Problems. $D=\{(x_i, y_i)\}_{i=1}^n$ , where $x_i=(x_i^{(num)},x_i^{(cat)})\in\mathbb X$ represents numerical $x_i^{(num)}$ and categorical $x_i^{(cat)}$ features. The total number of features is denoted as $k$ .

MLP

在这里插入图片描述

ResNet

在这里插入图片描述

FT-Transformer (Feature Tokenizer + Transformer)

The FT-Transformer architecture

在这里插入图片描述

Feature Tokenizer

The Feature Tokenizer module transforms the input features $x$ to embeddings $T\in\R^{k\times d}$ . 数值特征由逐元素乘得到 embed，类别特征由 lookup table 得到 embed

Transformer

We use the PreNorm variant for easier optimization. In the PreNorm setting, we also found it to be necessary to remove the first normalization from the first Transformer layer to achieve good performance.

Prediction

在这里插入图片描述

Other models

SNN (Klambauer et al., 2017). An MLP-like architecture with the SELU activation that enables training deeper models.
NODE (Popov et al., 2020). A differentiable ensemble of oblivious decision trees.
TabNet (Arik and Pfister, 2020). A recurrent architecture that alternates dynamical reweighting of features and conventional feed-forward modules.
GrowNet (Badirli et al., 2020). Gradient boosted weak MLPs. The official implementation supports only classification and regression problems.
DCN V2 (Wang et al., 2020). Consists of an MLP-like module and the feature crossing module (a combination of linear layers and multiplications).
AutoInt (Song et al., 2019). Transforms features to embeddings and applies a series of attention-based transformations to the embeddings.
XGBoost (Chen and Guestrin, 2016). One of the most popular GBDT implementations.
CatBoost (Prokhorenkova et al., 2018). GBDT implementation that uses oblivious decision trees (Lou and Obukhov, 2017) as weak learners.

Experiments

Datasets

在这里插入图片描述

Comparing DL models

在这里插入图片描述

The main takeaways:

MLP is still a good sanity check.
ResNet turns out to be an effective baseline that none of the competitors can consistently outperform.
FT-Transformer performs best on most tasks and becomes a new powerful solution for the field.
Tuning makes simple models such as MLP and ResNet competitive, so we recommend tuning baselines when possible. Luckily, today, it is more approachable with libraries such as Optuna (Akiba et al., 2019).

Comparing DL models and GBDT

在这里插入图片描述

这里需要注意，上表比较的是不同集成模型的性能，XGBoost 和 CatBoost 要比 DL 集成模型轻量得多

The main takeaway:

FT-Transformer allows building powerful ensembles out of the box.
FT-Transformer shows the most convincing improvements over ResNet exactly on those datasets where GBDT outperforms ResNet.
Admittedly, GBDT remains an unsuitable solution to multiclass problems with a large number of classes. Depending on the number of classes, GBDT can demonstrate unsatisfactory performance (Helena) or even be untunable due to extremely slow training (ALOI).
there is still no universal solution among DL models and GBDT
DL research efforts aimed at surpassing GBDT should focus on datasets where GBDT outperforms state-of-the-art DL solutions. Note that including “DL-friendly” problems is still important to avoid degradation on such problems.