【LLM 评估】GLUE benchmark：NLU 的多任务 benchmark

论文：GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

⭐⭐⭐⭐

arXiv:1804.07461, ICLR 2019

Site: https://gluebenchmark.com/

文章目录

- 一、论文速读
- 二、GLUE 任务列表
- - 2.1 CoLA（Corpus of Linguistic Acceptability）
  - 2.2 SST-2（The Stanford Sentiment Treebank）
  - 2.3 MRPC（The Microsoft Research Paraphrase Corpus）
  - 2.4 STSB（The Semantic Textual Similarity Benchmark）
  - 2.5 QQP（The Quora Question Pairs）
  - 2.6 MNLI（The Multi-Genre Natural Language Inference Corpus）
  - 2.7 QNLI（Qusetion-answering NLI）
  - 2.8 RTE（The Recognizing Textual Entailment datasets）
  - 2.9 WNLI（Winograd NLI）

一、论文速读

GLUE benchmark 包含 9 个 NLU 任务来评估 NLP 模型的语义理解能力。这些任务均为 sentence or sentence-pair NLU tasks，语言均为英语。

二、GLUE 任务列表

下图是各个任务的一个统计：

在这里插入图片描述

2.1 CoLA（Corpus of Linguistic Acceptability）

单句子分类任务。每个 sentence 被标注为是否合乎语法的单词序列，是一个二分类任务。

样本个数：训练集 8551 个，开发集 1043 个，测试集 1063 个。

label = 1（合乎语法）的 examples：

She is proud.
she is the mother.
Will John not go to school?

label = 0（不合乎语法）的 examples：

Mary wonders for Bill to come.
Yes, she used.
Mary sent.

注意到，这里面的句子看起来不是很长，有些错误是性别不符，有些是缺词、少词，有些是加s不加s的情况，各种语法错误。但我也注意到，有一些看起来错误并没有那么严重，甚至在某些情况还是可以说的通的。

2.2 SST-2（The Stanford Sentiment Treebank）

单句子分类任务：给定一个 sentence（电影评论中的句子），预测其情感是 positive 还是 negative，是一个二分类任务。

样本个数：训练集 67350 个，开发集 873 个，测试集 1821 个。

label = 1（positive）的 examples：

two central performances
against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape
a better movie

label = 0（negative）的 examples：

so pat it makes your teeth hurt
eastwood 's dirty harry period .
faced with the possibility that her life is meaningless , vapid and devoid of substance , in a movie that is definitely meaningless , vapid and devoid of substance

注意到，由于句子来源于电影评论，又有它们情感的人类注释，不同于CoLA的整体偏短，有些句子很长，有些句子很短，长短并不整齐。

2.3 MRPC（The Microsoft Research Paraphrase Corpus）

相似性和释义任务：给定两个 sentence（来自于在线新闻），判断两个句子在语义上是否等效。

样本个数：训练集 3668 个，开发集 408 个，测试集 1725 个。

label = 1（正样本，两个 sentence 语义相同）的 examples：

Example 1:
The largest gains were seen in prices, new orders, inventories and exports.
Sub-indexes measuring prices, new orders, inventories and exports increased.

Example 2:
Trading in Loral was halted yesterday; the shares closed on Monday at $ 3.01.
The New York Stock Exchange suspended trading yesterday in Loral, which closed at $ 3.01 Friday.

label = 2（负样本，两个 sentence 语义不同）的 examples：

Example 1：
Earnings per share from recurring operations will be 13 cents to 14 cents.
That beat the company 's April earnings forecast of 8 to 9 cents a share.

Example 2：
He beat testicular cancer that had spread to his lungs and brain.
Armstrong, 31, battled testicular cancer that spread to his brain.

本任务的数据集，包含两句话，每个样本的句子长度都非常长，且数据不均衡，正样本占比 68%，负样本仅占 32%。

2.4 STSB（The Semantic Textual Similarity Benchmark）

相似性和释义任务。预测两个 sentence 的相似性得分，评分为 0~5 的一个 float。

样本个数：训练集 5749 个，开发集 1379 个，测试集 1377 个。

Example 1：
A plane is taking off.
An air plane is taking off.
score：5.000

Example 2：
A man is playing a large flute.
A man is playing a flute.
score：3.800

整体句子长度适中偏短，且均衡。

2.5 QQP（The Quora Question Pairs）

相似性和释义任务。预测两个 question 在语义上是否等效，是二分类任务。

样本个数：训练集 363,870 个，开发集 40,431 个，测试集 390,965 个。

label = 1（positive，等效）的 Examples：

Example 1：
How can I improve my communication and verbal skills?
What should we do to improve communication skills?

Example 2:
What has Hillary Clinton done that makes her trustworthy?
Why do Democrats consider Hillary Clinton trustworthy?

label = 0（negative，不等效）：

Example 1：
Why are you so sexy?
How sexy are you?

Example 2：
Which programming languages are common to develop in the area of gamification?
Who is the worst Director in the history of MNIT/MREC?

任务类似于 MRPC，这个任务的正负样本也不均衡，负样本占 63%，正样本是 37%，而且这个训练集、测试集都非常大，这里的测试集比其他训练集都要多好几倍。

2.6 MNLI（The Multi-Genre Natural Language Inference Corpus）

自然语言推断任务。给定 premise 和 hypothesis 两个 sentence，预测两者关系：entailment or condradiction or neutral。

样本个数：训练集392, 702个，开发集dev-matched 9, 815个，开发集dev-mismatched9, 832个，测试集test-matched 9, 796个，测试集test-dismatched9, 847个。因为MNLI是集合了许多不同领域风格的文本，所以又分为了matched和mismatched两个版本的数据集，matched指的是训练集和测试集的数据来源一致，mismached指的是训练集和测试集来源不一致。

Example 1：
premise：The man is playing a guitar.
hypothesis：The man is singing while playing the guitar.
label：neutral
前提描述了一个男人正在弹吉他，而假设则进一步提出这个男人在弹吉他的同时还在唱歌。由于前提没有提及唱歌这一行为，所以我们不能从前提直接推断出假设是正确的（非蕴含），同时也不能断定它是错误的（非矛盾）。因此，这个文本对的关系被标记为中立。

总体训练集很充足，GLUE 论文作者使用并推荐 SNLI 数据集作为辅助训练数据。

2.7 QNLI（Qusetion-answering NLI）

自然语言推断任务。给定一个 question 和来自 Wikipedia 的 sentence，判断两者关系：蕴含 or 不蕴含。

数据是从 SQuAD 1.0（The Stanford Question Answering Dataset）中转换而来。

样本个数：训练集104, 743个，开发集5, 463个，测试集5, 461个。

Example:

Which collection of minor poems are sometimes attributed to Virgil?
A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.
label: 1（蕴含）

总体就是问答句子组成的问答对，一个是问题，一个是句子信息，后者包含前者的答案就是蕴含，不包含就是不蕴含，是一个二分类。

2.8 RTE（The Recognizing Textual Entailment datasets）

自然语言推断任务。判断两个 sentence 是否互为蕴含，二分类任务。

数据来源于一系列的年度文本蕴含挑战赛。

样本个数：训练集2, 491个，开发集277个，测试集3, 000个。

Example:

Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients.
Herceptin can be used to treat breast cancer.
label: 1（蕴含）