【论文阅读 ICTIR‘2022】Revisiting Open Domain Query Facet Extraction and Generation

news2025/1/16 17:23:52

文章目录

  • Revisiting Open Domain Query Facet Extraction and Generation
    • Motivation
    • Contributions
    • Method
      • Facet Extraction and Generation
      • Facet Extraction as Sequence Labeling
      • Autoregressive Facet Generation
      • Facet Generation as Extreme Multi-Label Classification
      • Facet Generation by Prompting Large Language Models
      • Unsupervised Facet Extraction from SERP
      • Facet Lists Aggregation
    • Data

Revisiting Open Domain Query Facet Extraction and Generation

https://dl.acm.org/doi/abs/10.1145/3539813.3545138

Motivation

Revisit the task of query facet extraction and generation and study various formulations of this task

  • also explored various aggregation approaches based on relevance and diversity to combine the facet sets produced by different formulations of the task

Contributions

  • Introduction of novel formulations for the facet extraction and generation task(by the recent advancements in text understanding and generation)
  • Through offline evaluation, we demonstrate that the models studied in this paper significantly outperform state-of-art baselines. We demonstrate that their combination leads to improvement in recall
  • create an open-source toolkit, named Faspect, that includes various implementations of facet extraction and generation methods in this paper

Method

Facet Extraction and Generation

We focus on the extraction and generation of facets from the search engine result page (SERP) for a given query

  • training set:

    在这里插入图片描述

    • q i q_i qi is an open-domain search query
    • D i = [ d i 1 , d i 2 , ⋯   , d i k ] D_i = [d_{i1}, d_{i2}, \cdots,d_{ik}] Di=[di1,di2,,dik]​ denotes the top 𝑘 documents returned by a retrieval model in response to query.
    • F i = { f i 1 , f i 2 , ⋯   , f i m } F_i = \{f_{i1}, f_{i2}, \cdots,f_{im}\} Fi={fi1,fi2,,fim} is a set of m ground truth facets associated with query q i q_i qi

The task is to train a model to return an accurate list of facets.

Facet Extraction as Sequence Labeling

We can cast the facet extraction problem as sequence labeling task.

在这里插入图片描述

  • w x ∈ t o k e n i z e ( d i j ) w_x \in tokenize(d_{ij}) wxtokenize(dij)

Our M θ e x t M_{\theta_{ext}} Mθext classifies each document token to B,I,O. We use RoBERTa and apply an MLP with the output dimensionality of three to each token representation of BERT.

  • input: [CLS] query tokens [SEP] doc tokens [SEP]

  • objective:

    在这里插入图片描述

    • where

      在这里插入图片描述

    • where p p p can be computed by applying a softmax operator to the model’s output for the x t h x^{th} xth token.

在这里插入图片描述

  • inference: get the model output for all the documents in 𝐷 𝑖 𝐷_𝑖 Di and sort them by frequency

Autoregressive Facet Generation

We perform facet generation using an autoregressive text generation model.

For evert query q i q_i qi we concatenate the facets in F i F_i Fi using a separation token as y i y_i yi.

The model is BART(a Transformer-based encoder-decoder model for text generation.) and we use two variations:

  • variations:

    • only takes the query tokens and generates the facets

    • takes the query tokens and the document tokens for all documents in SERP (separated by [SEP]) as input and generates facet tokens one by one.

  • objective:

    在这里插入图片描述

    • v v v is the BART encoder’s output
  • inference: perform autoregressive text generation with beam search and sampling, conditioning the probability of the next token on the previous generated tokens

Facet Generation as Extreme Multi-Label Classification

we treat the facet generation task as an extreme multi-label text classification problem.

  • The intuition behind this approach is that some facets tend to appear very frequently across different queries

The model is RoBERTa M θ m c l M_{\theta_{mcl}} Mθmcl

  • get the probability of every facet by applying a linear transformation to the representation of the [CLS] token followed by sigmoid activation

  • objective(binary cross-entropy):

    在这里插入图片描述

    • where y i , j ′ y'_{i,j} yi,j​ is the probability of relevance of the facet f j f_j fj given the query q i q_i qi and the list of documents D i D_i Di

      • it can be computed by applying a sigmoid operator to the model’s output for the j t h j^{th} jth facet class

        在这里插入图片描述

Facet Generation by Prompting Large Language Models

We investigate the few-shot effectiveness of largescale pre-trained autoregressive language models.

model: GPT-3

  • generate facets using a task description followed by a small number of examples(prompt)

    • Through prompting, we define the number of facets in the beginning of every example output. so that we can have control over the number of facets GPT-3 can generate.

    在这里插入图片描述

Unsupervised Facet Extraction from SERP

Use some rules to extract facets from SERP and re-rank them.

Facet Lists Aggregation

We explore three aggregation methods: Learning to Rank, MMR diversification, Round Robin Diversification

  • Facet Relevance Ranking:

    • use a bi-encoder model to assign a score to each candidate facet for each query and re-rank them based on their score in descending order

      • score: use the dot product of the query and facet representations: sim(𝑞𝑖 , 𝑓𝑖 ) = 𝐸(𝑞𝑖 ) · 𝐸( 𝑓𝑖 ).

      • E: use the average token embedding of BERT pre-trained on multiple text similarity tasks. To find optimal parameter, minimize cross-entropy loss for every positive query-facet pair ( q i , f i + ) (q_i,f_i^+) (qi,fi+) in MIMICS dataset

        在这里插入图片描述

        • B is the training batch size
        • { f i , j − } j = 1 B − 1 \{f_{i,j}^-\}_{j=1}^{B-1} {fi,j}j=1B1 is the set of in-batch negative examples
  • MMR diversification:

    • use a popular diversification approach, named Maximal Marginal Relevance (MMR).

      • The intuition is that different models may generate redundant facets

      • score function:

        在这里插入图片描述

        • R R R is the list of extracted facets for a given query
        • S S S is the set of already selected facets
  • Round Robin Diversification:

    • iterate over the four lists of facets generated by different models, and alternatively select the facet with the highest score from each list until we generate the desired number of facets.

Data

MIMICS: contains web search queries sampled from the Bing query logs, and for each query, it provides up to 5 facets and the returned result snippets.

  • train: MIMICS-Click
  • evaluation: MIMICS-Manual

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/49247.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【Pandas数据处理100例】(九十八):Pandas使用between_time()筛选出给定时间区间的数据

前言 大家好,我是阿光。 本专栏整理了《Pandas数据分析处理》,内包含了各种常见的数据处理,以及Pandas内置函数的使用方法,帮助我们快速便捷的处理表格数据。 正在更新中~ ✨ 🚨 我的项目环境: 平台:Windows10语言环境:python3.7编译器:PyCharmPandas版本:1.3.5N…

影响工业产品设计的主要因素

设计师对工业产品的产品外观设计主要依靠形状、图案和颜色的结合,创造出具有一定功能性质的新产品。在这个过程中,设计师需要充分利用各种因素,外观工业设计公司强调材料的机制和颜色。那么,影响产品设计的主要因素是什么呢? 一、…

IMMA~~

搬来自斯坦福的研究者提出了 IMMA, 一种利用隐空间多层图 (multiplex latent graphs) 来表征多种独立的交互类型,并使用一种新型的多层图注意力机制 (multiplex attention mechanism) 来描述个体间交互强度的行为及轨迹预测模型。该方法不仅大幅提升了预测的准确度&…

中小企业如何进行云灾备?

随着IT服务在企业运营和运维的影响权重日益增强,企业IT投入也越来越高,而随着云计算的逐步渗透,企业们逐渐将本地数据迁移到云上,以节省本地服务器的部署与运维等成本。如雨后新笋般冒出的中小企业,难以负荷高成本的本…

从 select、poll、epoll 回答IO多路复用

前言 先大概了解一下这方面的知识 IO多路复用指的是单个进程或者线程能同时监听处理多个IO请求, select、epoll、poll是LinuxAPI提供的复用方式。本质上是由操作系统内核缓存fd文件描述符, 使得单个进程线程能监视多个文件描述符。 select是将所有文件…

Spire.PDF 8.11.10 supports .NET 7.0 and PDF to PPTX

度娘找破解版Spire.PDF for .NET is a professional PDF API applied to creating, writing, editing, handling and reading PDF files without any external dependencies within .NET ( C#, VB.NET, ASP.NET, .NET Core, .NET 5.0, .NET 6.0, MonoAndroid and Xamarin.iOS )…

如何在表格里面添加表单,并且进行表单验证

通过el-form和el-table的组合使用,在表格里面添加表单 并且进行验证 表单中v-model绑定scope.row中表格的数据 效果如图所示: <template><div><!-- <el-form ref"ruleForm" :model"ruleForm" :rules"rules"><el-form-…

C++ 类的静态成员详解【static】

目录 前言 一、类的静态成员 1.static关键字 2.静态成员变量 3.静态成员函数 二、程序样例 1.程序演示 2.程序截图 总结 前言 本文记录C中 static 修饰类成员成为静态成员&#xff0c;其中包括静态成员类别、作用和程序演示。 嫌文字啰嗦的可直接跳到最后的总结。 一、类的静…

从爆红到被黑,游戏黑产攻防48小时

编者按&#xff1a; 数字化浪潮蓬勃兴起&#xff0c;企业面临的安全挑战亦日益严峻。 腾讯安全近期将复盘2022年典型的攻击事件&#xff0c;帮助企业深入了解攻击手法和应对措施&#xff0c;完善自身安全防御体系。 本篇聚焦某游戏公司新游戏上线后与黑产多轮攻防的48小时里…

Python遥感开发之arcpy批量重采样

Python遥感开发之arcpy批量重采样1. 重采样2. ARCGIS中的重采样3. Python_arcpy代码实现前言&#xff1a;主要介绍使用ARCGIS中自带的重采样工具&#xff0c;以及Python代码实现arcpy批量重采样&#xff0c;使用Python代码更加方便快捷。1. 重采样 作用&#xff1a;更改栅格数…

百度飞桨公布最新成果:凝聚535万开发者,服务20万家企事业单位

11月30日&#xff0c;由深度学习技术及应用国家工程研究中心主办的WAVE SUMMIT2022深度学习开发者峰会如期而至。鹏城实验室主任、中国工程院高文院士&#xff0c;深圳大学电子与信息工程学院院长、深度学习技术及应用国家工程研究中心技术委员会副主任、中国工程院丁文华院士受…

【数据结构】二叉树详解(下篇)

&#x1f9d1;‍&#x1f4bb;作者&#xff1a; 情话0.0 &#x1f4dd;专栏&#xff1a;《数据结构》 &#x1f466;个人简介&#xff1a;一名双非编程菜鸟&#xff0c;在这里分享自己的编程学习笔记&#xff0c;欢迎大家的指正与点赞&#xff0c;谢谢&#xff01; 二叉树&…

c#、wpf开发中页面在win10下被缩放125%引起页面错乱的解决办法。

正常情况下,我们开发的页面页面应该是100%缩放的,这样程序在win7和win10下保持一致,但是win10里面会根据显示器的情况自动调整“缩放与布局”,这使得桌面程序有时候会发生页面错乱,怎么调整就是个问题。 如图:在“缩放与布局”100%显示如下: 而在 “缩放与布局”125%显…

etcd cli选举流程

etcd cli选举流程 创建选举的Session 可配置项&#xff1a; WithTTL&#xff1a;设置session的存活时间&#xff0c;默认60sWithLease&#xff1a;设置session的key的lease&#xff0c;默认新建一个TTL为session TTL的leaseWithContext&#xff1a;设置context&#xff0c;默…

逆势高速增长,毫末征战三年诠释量产为王

作者 | 白日梦想家 编辑 | 王博“1096天&#xff0c;这是一场毫末人自己与自己“死磕”的大战。”11月29日&#xff0c;毫末智行在公司成立三周年的内部信中这样总结。 死磕的事情&#xff0c;包括&#xff1a; 以“自动驾驶三定律”为战略原点&#xff0c;创建了业内独有的毫末…

点餐项目实现

目录项目简介功能设计数据库表设计用户表订单表菜品表订单详情表&#xff08;关联菜品表和订单表)配置类实体类DishOrderDetailOrderInfoUserInfo统一后端返回内容用户模块用户注册功能mapper接口xml文件controller层前端用户登录功能验证用户账号和密码(session存储)session k…

Oracle表空间详细介绍

表空间概述 Oracle的表空间属于Oracle中的存储结构&#xff0c;是一种用于存储数据库对象(如&#xff1a;数据文件)的逻辑空间&#xff0c;是Oracle中信息存储的最大逻辑单元&#xff0c;其下还包含有段、区、数据块等逻辑数据类型。表空间是在数据库中开辟的一个空间&#xf…

GoEXP规则变更通知 l 信创额外奖励活动

一、规则变更 Goby 致力于打造高质量漏洞 EXP&#xff0c;每个环节的细节都不容忽视。为了提升大家在 Goby 上查看、使用 EXP 等环节的体验&#xff0c;我们针对 GoEXP 计划规则中的 EXP 参数部分提出规约。 参数名必须要⻅名思义&#xff0c;禁⽌使⽤套⽤。 反例&#xff1…

Mybatis-Plus--LambdaQueryChainWrapper--使用/实例

原文网址&#xff1a;Mybatis-Plus--LambdaQueryChainWrapper--使用/实例_IT利刃出鞘的博客-CSDN博客 简介 说明 本文用示例介绍Mybatis-Plus的LambdaQueryChainWrapper的用法。 这种用法是使用Mybatis-Plus进行查询最好用的方法&#xff0c;用过一次后就不会再想用其他写法了…

某有色金属集团主数据管理面临的问题和解决方案

某有色金属集团成立于2002年&#xff0c;出资人为省国资委&#xff0c;注册资金12.6亿元人民币。以有色金属、稀有金属、黄金资源开发为主&#xff0c;集地质勘查、采矿、选矿、冶炼、加工、科研设计、机械制造、建筑安装、商贸物流及物资进出口、房地产开发、物业管理为一体的…