摘要

在这篇文章中，我们利用元学习进行文本分类，元学习在计算机视觉中有相当好的效果，在这里，低层次的模式可以跨学习任务迁移的，，然而，直接将这个方法应用到文本中是具有挑战性的，词汇和特征对一项任务具有很高的信息量，对另一项任务可能无关紧要**。本文的模型不仅从单词中学习，还利用了它们的分布签名**。
这些签名编码了相关单词的并发模式。我们的模型训练在一个元学习框架内，将这些签名映射为注意力分数。然后用这些分数对单词的词汇表进行加权。我们发现，我们的模型大幅度的优于词汇知识上学习到的原型网络。在六个基准数据集上的零样本文本分类和关系分类。

(19.96% on average in 1-shot classification

介绍

在计算机视觉中，元学习出现作为一种有前途的方法，为学习一个低资源resource regime。
特别是这个目标是有能力将一个算法扩展到新类别，这些类别仅有一些训练示例可以利用。

这些模型可以学习产生低资源场景，从可利用的数据集中重建训练集，
（ $recre a t in g s u c h t r ainin g e p i so d es f ro m t h e d a t aa v ai l ab l e .$ ）
即使在最极端的低资源场景在字符识别任务上。（$ a single training example per class-this approach yield 99.6%的accuracy$）

基于这种强烈的经验性绩效，我们很有兴趣的将元学习框架应用在NLP中。

挑战
- 跨不同类别的潜在表示学习的可转移程度，在计算机视觉中，低层次的模式 (such as edges) 和相同维度的表示能够被分享在不同的任务中，然而，对于语言数据来说，许多情况是不同的，在这里，更多的任务操作在词汇层面上。单词对一个任务具有高信息量，但对另一个任务并不相关的。考虑到这个例子，
- the corpus of HuffPost headlines, categorized into 41 classes
- 图一显示，词汇对一个了类别高度显著，在其他类别上，并不会扮演者显著的效果。
  另人不惊讶的是，元学习直接被应用在文本输入上，其绩效能够下降，在一个单一的最近邻的分类器上，
- 传统的元学习的不足是在重要特征上进行放大，可以进一步说明在图2中。
- 考虑到目标类别: the target class fifty （lifestyle for middle-aged）标准的原型网络引入无信息含义的单词 $d a t e$
- 可以淡化这个高预测词汇 $g r an d ma$

图一:

在这里插入图片描述

图二

在这里插入图片描述

构建这些ideas，我们想要在跨类别的文本迁移中，利用分布式标签。另外，词频。在各自明确的类别中，评估词重要性。
后来的关系并不能可靠的评估目标类别，由于缺乏标签数据。
然而，我们也能够得到一些指标的噪音评估，通过利用目标类别相当少提供的训练示例，
utilizing the few provided training examples for the target class
在原学习框架内，进一步提炼近似性。我们注意到分布式标签的表示能力是微弱的，比那些词典对应物品(lexical counterparts)
元学习知识构建的分布式标签，更有能力产生其。

我们的模型包含两个成分:

第一个是注意力权重: an attention generator。
- 将分布式标签迁移到注意力得分，将词的重要性映射为类别。
a ridge regressor
- 在看到仅有一些训练示例中，迅速的学习预测。注意力权重在所有迭代数据集上共享，
  我们在五个标准文本分类数据集上评估我们的模型和一个关系分类数据集，实验结果显示我们的模型提供了所有的显著的性能，在得到所有 $a ll ba se l in es$

BACKGROUND

在这篇文章中，我们第一次概括零样本分类的标准的元学习框架，和描述了这个术语，相反，我们引入了扩展机制到这个框架中，从下图中说明我们的框架:
在这里插入图片描述

问题描述

假设，我们已经给出的标签示例a set of classes $y^{train}$
我们的目标是可以开发一个从训练集中获得知识的模型，以便于我们可以在新类别但不相关的数据集上做预测。我们仅可以有少量的标注，这些新类别属于 a set of classes $y^{test}$ 与不相交的 $y^{train}$

元训练

在元学习中，在整个元训练过程中，我们模仿以上的训练场景，以便于我们的模型从一些标注中迅速的学习。为了创造一个单个训练集，我们的处理方式为:

we first sample $N$ classes for $y^{train}$
each of these $N$ classes we sample $K$ examples as our training data and $L$ examples as our testing data
We update our model based on loss over these testing data
We repeat this procedure to increase the number of training episodes,
each of which is constructed over its own set of $N$ classes
the training data of one episode is commonly denoted as the support set .
the corresponding testing data is known as the query set.
我们提到在查询集上做预测任务:
- the task of making predictions over the query set as $N$ -way $K$ -shot classification。
  
  $N = 3$ 、 $K = 1$ $L = 2$

矩形意味着输入示例，与文本相对应的标签，

An episode contains a support set a query set and a source pool.

Meta-testing

在我们完成元训练之后，我们应用这些相同的基于episode-based mechanism 去测试我们的模型是否能够迅速的适应新类别。
To create a testing episode
- first Sample N new classes from $y^{test}$
- we sample the support set and the query set from the $N$ classes。
- We evaluate the average performance on the query set across all testing episodes.

Our extension

all examples from $y^{train}$ are accessible thronghout meta-training .

the standard meta-learning framework (Vinyals et al., 2016) only learns from small subsets of these data per training episode
相反，我们的模型在所有训练示例中利用分布式标签，其更加健壮。
为了适应这种调整，我们将一轮迭代中加入资源池$source pooling。
整个元学习训练过程中，资源池包括训练类别的所有示例，
整个测试过程中，资源池包含所有示例。（包含所有训练示例）*

创新点

将这些签名映射为注意力分数，然后用这些分数对单词的词汇表进行加权。

关键词

a single training example per class
the character recognition task 字符识别任务中
word-substitution 词替换
one-shot text classification
one-shot relation classification
high-quality attention 高质量的注意力
a metric space over input features
developing a prior over the optimization procedure
exploiting the relations between classes
lexical representations 文本表示
the source pool
the target classification task 目标类别任务
distributional statistics

可以借鉴模型产生的东西

utilize distributional signatures
we assess word importance with respect to a specific class。
a noisy estimate of this indicator by utilizing the few
provided training examples for the target class,
then further refine this approximation within the meta-learning framework

挑战

Given this strong empirical performance, we are interested in employing meta-learning frameworks
in NLP. The challenge, however, is the degree of transferability of the underlying representation
learned across different classes. In computer vision, low-level patterns (such as edges) and their
corresponding representations can be shared across tasks. However, the situation is different for
language data where most tasks operate at the lexical level. Words that are highly informative
for one task may not be relevant for other tasks. Consider, for example, the corpus of HuffPost
headlines, categorized into 41 classes. Figure 1 shows that words highly salient for one class do not
play a significant role in classifying others. Not surprisingly, when meta-learning is applied directly
on lexical inputs, its performance drops below a simple nearest neighbor classifier. The inability of
a traditional meta-learner to zoom-in on important features is further illustrated in Figure 2: when
considering the target class fifty (lifestyle for middle-aged), the standard prototypical network (Snell
et al., 2017) attends to uninformative words like “date,” while downplaying highly predictive words
such as “grandma.”
is the degree of transferability of the underlying representation learned across different classes.
In computer vision, low-level patterns (such as edges) and their corresponding representations can be shared across tasks. However, the situation is different for language data where most tasks operate at the lexical level.
Words that are highly informative for one task may not be relevant for other tasks.

方法

我们的目标是提升零样本分类的绩效，通过从输入的分布式标签中学习高质量的注意力:

Given a particular episode
我们从相关的资源池和支持集中提取相关的统计数据。
这些统计数据仅是约近似于分类的词重要性。
我们利用一个注意力权重向量，将其迁移到高质量的注意力去操作单词。这些产生的注意力为下游的预测提供指导，岭回归迅速的学习一些标签示例。
在整个训练轮次中，注意力权重被优化，
the ridge regressor is trained from scratch for each episode

Attention generator (注意力权重)

模块产生特定类别的注意力: 通过结合资源池和支持集的分布式统计。产生的注意力提供了the ridge regresso,在词重要性的归纳式偏差中，我们基于岭回归的反馈训练模块。

Ridge regressor

为每一轮迭代,模块使用从分布式标签提取的注意力组建词汇表示，
模块的目标在支持集上训练，在查询集上做预测。
其预测的损失是端到端可微的，各自的注意力权重进行有效的训练。
在我们理论分析中，我们显示，注意力权重的产出是词替代扰动的变体。

ATTENTION GENERATOR

注意力权重的目标从每个输入示例中的分布式标签中评估词重要性。

word importance from the distributional signatures of each input example

这些分布式标签的许多选择，在其之中，我们聚焦于一元分词函数

on functions of unigram statistics
这能够证明鲁棒性对词替代扰动。
我们利用大规模的资源池去形成生成词重要性的model>
- the model of general word importance
利用这个小样本的支持集去评估特定类别词的重要性。
- leverage the small support set to estimate class-specific word importance 。
- 使用产生的注意力去 重新构建下游分类任务的输入表示。

文章中证据确凿的，频繁出现的单词并不是信息丰富的。
因此，我们想要去减少单词的频率和增加罕见单词。

we would like to downweigh frequent words and upweight rare words

测量一般的词重要性

$s(x_i) = \frac{\varepsilon}{\varepsilon + P(x_i)}$

这里 $\varepsilon = 10^{-3}$

$x_i$ is the $i^{th}$ word of input example x
$P(x_i)$ : is the unigram likelihood of $x_i$ over the source pool.
另一方面，支持集上的词是有区别的，查询集上的词也是一样有区别的。
因此，我们定义以下的统计方法去反映特定类别词的重要性:

$t(x_i) =H( P(y|x_i))^{-1}$

条件概率估计 $P(y|x_i)$ ,在支集上使用正则化线性分类器进行估计:

is estimated over the support set using a regularized linear classifier
$H(\bullet)$ : the entropy operator
$t(\bullet)$ ：测量了给定单词类别标签 $y$ 的不确定性。
因此，展现倾斜分布的单词将会是更高权重的。

直接应用这个统计数据，可能并不会达到好的效果为两个理由:

两个统计指标包含补充信息，不清晰如何结合它们.
这些统计指标对类别的词重要性是近似嘈杂的。
为了缩小这些差距，我们拼接这些标签，使用一个双向LSTM，去拟合the information across the input
$h = bi L STM ([s (x); t (x)])$

最后，我们使用点积注意力,去预测这个得分 $\alpha_i$ of word $x_i$ :
$\alpha_i:=\frac{exp(v^Th_i)}{\sum_jexp(v^Th_j)}$

where $h_i$ is the output of BiLSTM at position $ $ and $v$ 是可学习向量。

RIDGE REGRESSOR

the attention generator的组成:
岭回归在寻找一些例子后迅速的完成预测。
一次迭代中的 for each example
我们构建一个词汇表示，(聚焦于重要单词的词汇表示）作为注意力得分指标，
下一轮，给出这些词汇表示，
在the support set from scratch 上面，我们训练岭回归。
最后，我们在查询集上做预测，使用这个损失去更新优化注意力权重。

Constructing representations

给出这些不同的词，去展示类别的多元化层次的重要性。
我们构建相关词语的词汇表示.特别地，我们构建示例 $x$ 的表示为:
$\phi(x) = \sum_i{\alpha_i} * f_{edb}(x_i)$

在这里， $f_{ebd}(*)$ 是预训练嵌入函数:
将一个单词映射为 $R^E$

Training from the support set

（在支持集上做训练）

Given an N-way K-shot classification task。
$\Phi_s \in R^{NK \times E}$ 是这个支持集的表示.
支持集表示得到由 $\phi(*)$
$Y_s \in R^{NK \times N}$ be the one-hot labels。

我们训练岭回归去拟合又标签的支持集:

ridge regression admits a closed-form solution** that enables end-to-end differentiation through the model.
with proper regularization 岭回归减少了小批支持集的过拟合。
特别地:我们最小化:
we minimize regularized squared loss
在这里 $I$ is the identity matrix。

Inference on the query set

Let $\Phi_{Q}$ denote the representation of the query set.
虽然，我们在方程1中最优化岭回归目标，
the learned transformation has been shown to work well in few-shot classification after a calibration step

在这里插入图片描述
$\alpha \in R^{+}$ and $\in R$

通过元训练得到的元参数 $m e t a - p a r am e t ers$ .

在这里插入图片描述

THEORETICAL ANALYSIS

分布式标签的工作，带来认证的鲁棒性，来应对输入的扰动。
$l e t (P, S, Q)$ 是一个单独的循环.
$P$ is a source pool
$S$ is a support set
$Q$
is the query set

为输入文本
$\in S \cup Q$

注意力权重产生由支持集和资源池产生的单词级别的重要性.

$\alpha = AttGen(x | S,P)$

在这里插入图片描述

EXPERIMENTAL SETUP

Datasets

我们评估我们的方式在四个训练数据集上，和一个关系型分类数据集上。

20 Newsgroups

RCV1

Reuters-21578

Amazon product data

HuffPost headlines

FewRel

BASELINES

Representations

AVG represents each example as the mean of its embeddings
IDF represents each example as the weighted average of its word embeddings with weights given by inverse document frequency over all training examples。
- CNN applies ID convolution over the input words and obtatins the representation by max-over-time pooling.

Learning algorithms

In addition to the ridge regressor. (RR)
two standard supervised learning algorithms
two meta-learning algorithms
NN is a 1-nearest-neighbor classifier under Euclidean distance.
FT pre-trains a classifier over all training examples, then finetunes the network using the support set
MAML meta-learns a prior over model parameters,
MAML meta-learns a prior over model parameters,
Prototypical network (PROTO) meta-learns a metric space for few-shot classification

在这里插入图片描述

IMPLEMENTATION DETAILS

pre-trained fastText embeddings for our model and all baselines For sentence-level datasets
we also experiment with pre-trained BERT embeddings using HuggingFaces codebase
For relation classification (FewRel), we augment the input of our attention generator with positional embeddings
在注意力generator： we use a biLSTM with 50 hidden units and apply dropout of 0.1 on the output
In the ridge regressor, we optimize meta-parameters $λ$ and a in the log space to maintain the positivity constraint
所有参数都是用 $A d am$ 优化，带一个学习率0.001
在元训练过程中，我们we sample 100 training episodes per epoch.
We apply early stopping when the validation loss fails to improve for 20 epochs
We evaluate test performance based on 1000 testing episodes and report the average accuracy over 5 different random seeds

结果

evaluated our model in both 5-way 1-shot and 5-way 5-shot settings

在这里插入图片描述

我们的模型有能力去产生过去类别的词汇。

Ablation study

We perform ablation studies on the attention generato
We observe that both statistics $s (\cdot)$ and $t (\cdot)$ contribute to the performance though the former has a larger impact
We also note that instead of computing word importance independently for each word**,fusing information** across the input with an biLSTM improves performance slightly.

Contextualized representations （上下文表示）

For sentence-level datasets
BERT给出的上下文表示做实验。
While BERT significantly improves classification performance on FewRel
We postulate that this discrepancy arises because relation
classification is highly contextual
news classification is mostly keyword-based

Analysis

我们可视化我们注意力权重表示 $\phi(x)$
使用

general word importance $s (x)$
class-specific word importance $t (x)$

Figure 9 visualizes the model’s input and output on the same query example in two testing episodes.
The example belongs to the class jobs in the Reuters dataset. First, we observe that our model
generates meaningful attention from noisy distributional signatures. Furthermore, the generated
attention is task-specific: in the depicted example, if the episode contains other economics-related
classes, the word “statistical” is downweighed by our model. Conversely, “statistical” is upweighted when we compare jobs to other distant classes.