前言
在传统的密集检索中,常见的检索单元包括文档、段落或句子。然而,这些单元的选择可能会影响检索性能和下游任务的效果。例如,段落可能包含与问题无关的额外细节,而句子可能过于复杂或缺乏必要的上下文信息。为了解决这些问题,提出了使用“命题”作为新的检索单元。
一、命题的定义
命题在文中被定义为文本中的原子表达,每个命题都包含一个独特的事实片段,并以简洁、自包含的自然语言格式呈现。例如,在讨论比萨斜塔的倾斜角度时,可以提取出以下命题:“比萨斜塔现在大约倾斜3.99度。”这个命题简洁地回答了问题,并且包含了必要的上下文信息。
命题的定义原则:
-
对应独特语义:每个命题应该对应文本中一个独特的语义片段。所有命题的组合应该能够代表整个文本的语义。
-
最小化(minimal):命题应该是最小的,即它不能再被分割成更小的命题单元。
-
自包含和上下文化(contextualized and self-contained):命题应该被上下文化并且是自包含的。这意味着命题应该包含所有从文本中必要的上下文(例如,指代关系)来独立解释其含义。
总结就是:每个命题对应文本中的一个独立的意义单元,所有命题的组合代表了整个文本的语义。
二、命题分解方法
如图中A部分:
数据处理流程
-
数据集:FACTOIDWIKI,已经去除了图表、表格和列表,组织成了段落形式。
-
分割粒度:
- 100单词段落:使用贪婪方法将段落分割成100字的段落块,仅在句子结束时分割,确保每个段落块包含完整的句子。
- 句子:使用Python的SpaCy库en_core_web_lg模型进一步将每个段落分割成句子。
- 命题:使用“Propositionizer”的文本生成模型将每个段落分解成命题。
Propositionizer 模型
- 训练Propositionizer:使用GPT-4生成提示,通过两步蒸馏过程训练Propositionizer。首先,使用GPT-4生成段落到命题对的种子集,然后使用这个种子集来微调FlanT5-large模型。
- 评估Propositionizer:通过F1分数评估Propositionizer生成命题的质量,F1分数基于两个命题集合之间的相似度,使用了BertScore和roberta-large配置作为相似度度量。
段落到命题的提示词
英文:
Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.
Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both
hares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America." ]
Input: <a new passage>
Output:
中文:
以下是所提供文本的翻译:
将“内容”分解为清晰简单的命题,确保它们脱离上下文后仍可理解。
1. 将复合句拆分为简单句。尽可能保持输入的原始措辞。
2. 对于任何附带额外描述信息的命名实体,将这些信息分离成自己独特的命题。
3. 通过为名词或整个句子添加必要的修饰语,并将代词(例如,“它”,“他”,“她”,“他们”,“这个”,“那个”)替换为它们所指实体的全名,来使命题脱离上下文。
4. 将结果以字符串列表的形式呈现,格式化为JSON。
输入:标题:Eostre。部分:理论和解释,与复活节兔子的联系。内容:
复活节兔子(Osterhase)的最早证据由医学教授Georg Franck von Franckenau在1678年记录在德国西南部,但在18世纪之前在德国其他地区仍然不为人知。学者Richard Sermon写道,“春天经常在花园里看到野兔,因此可能为隐藏在那里供孩子们寻找的彩色彩蛋的起源提供了一个方便的解释。另外,存在一种欧洲传统,认为野兔会下蛋,因为野兔的爪印或形态和鸻鸟的巢看起来非常相似,而且两者都出现在草地上,并且都是在春天首次见到。19世纪,复活节卡片、玩具和书籍的影响使复活节兔子/野兔在整个欧洲变得流行。德国移民随后将这一习俗传播到英国和美国,那里它演变成了复活节兔子。”
输出:[
“复活节兔子的最早证据是在1678年由位于德国西南部的Georg Franck von Franckenau记录的。”,
“Georg Franck von Franckenau是医学教授。”,
“复活节兔子的证据直到18世纪才在德国其他地区为人所知。”,
“Richard Sermon是一位学者。”,
"Richard Sermon提出了一个关于春天野兔和复活节传统之间联系的可能解释的假设",
“春天经常在花园里看到野兔。”,
“野兔可能为隐藏在花园里供孩子们寻找的彩色彩蛋的起源提供了一个方便的解释。”,
“存在一种欧洲传统,认为野兔会下蛋。”,
“野兔的爪印或形态和鸻鸟的巢看起来非常相似。”,
“野兔和鸻鸟的巢都出现在草地上,并且都是在春天首次见到。”,
“在19世纪,复活节卡片、玩具和书籍的影响使复活节兔子/野兔在整个欧洲变得流行。”,
“德国移民将复活节兔子/野兔的习俗传播到英国和美国。”,
“复活节兔子/野兔的习俗在英国和美国演变成了复活节兔子。”
]
输入:<新的段落>(注:文本在此处被截断,因此输入不完整)
三、实验结果
参考文献
Dense X Retrieval: What Retrieval Granularity Should We Use?,https://arxiv.org/pdf/2312.06648