【论文阅读 CIKM2011】Finding Dimensions for Queries

news2024/11/17 15:54:36

文章目录

    • Foreword
    • Abs
    • Method
      • List Extraction
      • List Weighting
      • List Clustering
      • Dimension and Item Ranking

Foreword

  • This paper is from CIKM2011, so we only condier the method, not results
  • There are many papers that have not been shared. More papers can be found in: ShiyuNee/Awesome-Conversation-Clarifying-Questions-for-Information-Retrieval: Papers about Conversation and Clarifying Questions (github.com)

Abs

We address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists.

we propose aggregating frequent lists within the top search results to mine query dimensions and implement a system called QDMiner.

Method

QDMiner discovers query dimensions by aggregating frequent lists within the top results.

  • Important information is usually organized in list formats by websites
    • Listing is a graceful way to show parallel knowledge or items
  • Important lists are commonly supported by relevant websites and hence repeat in the top search results, whereas unimportant lists just infrequently appear in results.

Query dimensions are mined by the following four steps:

  • List Extraction: Several types of lists are extracted from each document
  • List Weighting: All extracted lists are weighted, and thus some unimportant or noisy lists can be assigned by low weights
  • List Clustering: Similar lists are grouped together to compose a dimension.
  • Dimension and Item Ranking: Dimensions(between dimensions) and their items(with a dimension) are evaluated and ranked based on their importance.

List Extraction

For each document, we extract a set of lists from the HTML content of d d d based on three different types of patterns

  • Free text patterns:

    • pattern: item{, item}*(and|or) {other} item

      Example 1 We shop for gorgeous watches from Seiko, Bulova, Lucien Piccard, Citizen, Cartier or Invicta

    • further use pattern: {ˆitem (: |-) .+$}+ to extract lists from some semi-structured paragraphs

      Example 2 … are highly important for following reasons: Consistency - every fact table is filtered consistently res… Integration - queries are able to drill different processes … Reduced development time to market - the common dimensions are available without recreating the wheel over again.

  • HTML tag patterns:

    • style of HTML tags
      • SELECT: extract all text from their child tags(OPTION) to create a list
      • UL / OL: extract text within their child tags(LI)
      • TABLE: extract one list from each column or each row

在这里插入图片描述

  • Repeat region patterns:

    在这里插入图片描述

    • First detect repeat regions in webpages based on vision-based DOM trees

    • Then extract all leaf HTML nodes within each block, and group them by their tag names(name, rating, etc) and display styles.

    • Last, for each group, extract all text from its nodes as a list

    Note: we do post-processing for each extracted list

List Weighting

在这里插入图片描述

This type of lists are useless for finding dimensions and we should punish them.

we propose to aggregate all lists of a query, and evaluate the importance of each unique list l by the following components:

  • document matching weight: S D O C = ∑ d ∈ R ( s d m ∗ s d r ) S_{\mathrm{DOC}}=\sum_{d \in R}\left(s_d^m * s_d^r\right) SDOC=dR(sdmsdr)

    • d d m d_d^m ddm is the percentage of items contained in d d d
      • s d m = ∣ d ∩ l ∣ ∣ l ∣ s_d^m=\frac{|d \cap l|}{|l|} sdm=ldl
    • s d r s_d^r sdr measures the importance of document d d d
      • s d r = 1 / ran ⁡ k d s_d^r=1 / \sqrt{\operatorname{ran} k_d} sdr=1/rankd
      • The higher d d d​ is ranked, the larger its score s d r s_d^r sdr is.( d d d is more relevant to the query)
  • average invert document frequency(IDF) of items:

    • A list comprised of common items in a corpus(we use ClueWeb09) is not informative to the query.

The importance of a list l l l: S l = S D O C ∗ S I D F S_l = S_{DOC} * S_{IDF} Sl=SDOCSIDF

List Clustering

Two lists can be grouped together if they share enough items

  • d c ( c 1 , c 2 ) = max ⁡ l 1 ∈ c 1 , l 2 ∈ c 2 d l ( l 1 , l 2 ) = 1 − ∣ l 1 ∩ l 2 ∣ min ⁡ { ∣ l 1 ∣ , ∣ l 2 ∣ } d_c\left(c_1, c_2\right)=\max _{l_1 \in c_1, l_2 \in c_2} d_l\left(l_1, l_2\right) = 1-\frac{\left|l_1 \cap l_2\right|}{\min \left\{\left|l_1\right|,\left|l_2\right|\right\}} dc(c1,c2)=maxl1c1,l2c2dl(l1,l2)=1min{l1,l2}l1l2

Use a modified QT (assume that all data is equally important)clustering algorithm to group similar lists

We modify the original QT algorithm to first group highly weighted lists. The algorithm, which we refer to as WQT (Quality Threshold with Weighted data points)

Don’t use individual weighted lists as query dimensions

Dimension and Item Ranking

A good dimension should frequently appear in the top results, a dimension c c c is more important if:

  • (1) The lists in c c c are extracted from more unique websites
  • (2) the lists in c c c are more important, i.e., they have higher weights.

在这里插入图片描述

  • S l S_l Sl is the weight of a list l l l

In a dimension, the importance of an item depends on how many lists contain the item and its ranks in the lists.

在这里插入图片描述

  • e e e is a item
  • w ( c , e , s ) w(c,e,s) w(c,e,s) is the weight contributed by a website s s s
  • A v g R a n k c , e , s AvgRank_{c,e,s} AvgRankc,e,s is the average rank of e within all lists extracted from website s s s.

We only output qualified items by default in QDMiner.

  • qualified items: S e ∣ c > 1 S_{e|c} > 1 Sec>1 and S e ∣ c > ∣ S i t e s ( c ) ∣ 10 S_{e|c} > \frac{|Sites(c)|}{10} Sec>10Sites(c)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/164156.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

微信小程序+阿里物联/Onenet物联+esp32搭建无服务器物联系统(一)---ESP32硬件设计开源

目录 简介 一、硬件设计开源连接 二、硬件设计解析 1、电路原理图 2、PCB版图 3、BOM资料 4、整体项目的资料连接 微信小程序阿里物联平台合宙Air724UG搭建无服务器物联系统(一) 微信小程序阿里物联平台合宙Air724UG搭建无服务器物联系统&#xff08…

uni-app canvas绘制海报流程的一些记录

绘制流程 布局定义 在我们的布局里要声明canvas的定义如下&#xff0c;可以声明class布局样式&#xff0c;width和height是必须的&#xff0c;因为如果没有就绘制不了。 canvas-id也是必须的&#xff0c;我们需要通过id找到对一个你的canvas对象&#xff0c;来做操作 <c…

公测开启!CRMEB 多商户v2.1新功能快来体验

“ 真正实现完美主义当然很困难&#xff0c;但具备追求完美主义的态度&#xff0c;就能减少错误。” ——稻盛和夫 CRMEB 多商户 v2.1公测版正式发布&#xff01;新增付费会员&#xff0c;可进行会员折扣价管理、会员福利设置&#xff1b;新增社区发布短视频&#xff0c;打造短…

Django REST framework--DRF视图

Django REST framework--DRF视图DRF视图编写常规Django视图编写视图装饰器api_view查询资源返回所有数据返回单条数据返回json格式的数据新增资源修改资源删除资源DRF视图编写 常规Django视图编写 序列化器最终的作用是为视图提供转化后的数据&#xff0c;可使用Serializer类…

【饥荒】本地服务器+内网穿透

本地服务器搭建方法 准备工具 网易UU加速器&#xff08;用于加速科雷官网登陆和steam创意工坊&#xff09;Don’t Starve Together Dedicated Server(steam饥荒联机版开服工具) 获取方法&#xff1a;在库中勾选工具&#xff0c;然后搜索Don’t Starve Together Dedicated Ser…

Unity SRP自定义渲染管线学习1.2:初步绘制

绘制物体 绘制物体&#xff0c;包括不透明的物体&#xff0c;透明物体&#xff0c;再加上之前的天空盒 Camera void DrawVisibleGeometry(){//我们需要将不透明物体和透明物体分开绘制//如果我们直接先绘制所有的物体&#xff0c;然后再绘制天空盒&#xff0c;我们就会看到对…

nodejs+vue家庭食谱饮食方案管理系统网站python php java

本系统分为用户和管理员两个角色&#xff0c;其中用户可以注册登陆系统&#xff0c;查看美食食谱&#xff0c;一周健康食谱安排&#xff0c;查看美食食材&#xff0c;在线交流发布帖子&#xff1b;管理员可以对食谱分类&#xff0c;食谱信息&#xff0c;材料信息&#xff0c;美…

区块链技术2---BTC的数据结构

1&#xff1a;Hash pointers&#xff08;哈希指针&#xff09;和普通指针相比&#xff0c;哈希指针除了保存地址还保存哈希值2&#xff1a;Block chain区块链中的区块通过哈希指针相连&#xff0c;这里的哈希指针的哈希值是对前一个区块的整体取哈希值&#xff08;包括前一个区…

linux系统中使用QT实现摄像头功能的方法

大家好&#xff0c;今天主要和大家聊一聊&#xff0c;如何使用QT中的Camera的功能和实现。 目录 第一&#xff1a;摄像头资源简介 第二&#xff1a;环境搭建要求 第三&#xff1a;代码编译实现要求 第一&#xff1a;摄像头资源简介 开发板上有一路“CSI”摄像头接口&#xf…

组件封装 - steps组件

首先, 我先来看看效果 steps 组件的封装和 tabs 组件还是相似的 都会去指定两个组件来完成(仿Element UI), 都会去使用 jsx 的语法 让其中一个组件去规定样式和排版, 另外一个组件去接收父组件传入的动态数据 但和面包屑组件还是有区别的(面包屑组件封装): 相同点都是使用两…

v-for 的“就地更新”策略

前言 我们平时使用v-for的时候通常都是加一个唯一标识key&#xff0c;因为不加的时候Vue会给我们发出警告。其实我们加上key的操作&#xff0c;就是为了避免它的“就地更新”策略。我们来看一下官网对“就地更新”的解释&#xff1a; 当 Vue 正在更新使用 v-for 渲染的元素列表…

jsp文化活动系统Myeclipse开发mysql数据库web结构java编程计算机网页项目

一、源码特点 JSP 文化活动系统 是一套完善的web设计系统&#xff0c;对理解JSP java编程开发语言有帮助&#xff0c;系统采用web模式开发&#xff0c;系统具有完整的源代码和数据库&#xff0c;系统主要采用B/S模式开发。开发环境为TOMCAT7.0,Myeclipse8.5开发&#xff0c;数…

第4章 任务看门狗

任务看门狗 主任务死循环 在app_main任务中死循环 #include <stdio.h> #include "esp_log.h" #include "freertos/FreeRTOS.h" #include "freertos/task.h"const char *TAG "COUNTER";int count 0;void app_main(void) {wh…

[翻译]PG15新特性-加速WAL日志归档

PG15新特性-加速WAL日志归档PG15通过&#xff1a;一次扫描64个待归档的日志&#xff0c;将其放到一个数组中以供归档&#xff0c;当处理完这64个文件后&#xff0c;再进行下一次扫描。这样达到减少archive_status目录扫描次数提升性能的目的。WAL归档介绍PG15如何加速归档前&am…

C++ · 入门 · 05 | 内联函数

啊我摔倒了..有没有人扶我起来学习.... &#x1f471;个人主页&#xff1a;《CGod的个人主页》\color{Darkorange}{《CGod的个人主页》}《CGod的个人主页》交个朋友叭~ &#x1f492;个人社区&#xff1a;《编程成神技术交流社区》\color{Darkorange}{《编程成神技术交流社区》…

SegFormer学习笔记(4)train续2

这次关注一下最关键的东西&#xff1a;用什么网络&#xff0c;用什么数据&#xff0c;预训练数据在哪里呢&#xff1f;为了方便&#xff0c;重新贴一下 train.pyimport torch import argparse import yaml import time import multiprocessing as mp from tabulate import tab…

JVM笔记——根据黑马jvm课程课件+自己总结

JVM一、内存结构1、程序计数器&#xff08;PC Register&#xff09;2、虚拟机栈&#xff08;JVM Stacks&#xff09;3、本地方法栈&#xff08;Native Method Stacks&#xff09;4、堆&#xff08;Heap&#xff09;5、方法区&#xff08;Method Area&#xff09;6、直接内存二、…

【爬虫】第七部分 scrapy

【爬虫】第七部分 scrapy 文章目录【爬虫】第七部分 scrapy7. scrapy7.1 基本使用7.2 项目的文件结构7.3 response的方法和属性7.4 小案例7.5 scrapy 工作原理7.6 管道的使用7.7 多管道下载7.8 下载分页类型和get请求的使用7.9 下载多层级类型7.10 post请求的使用总结7. scrapy…

清华大学出版——C语言从入门到精通(第4版)

《C语言从入门到精通&#xff08;第4版&#xff09;》是清华大学出版社出版的图书&#xff0c;该书从初学者的角度出发&#xff0c;以通俗易懂的语言&#xff0c;丰富多彩的实例&#xff0c;详细介绍了使用C语言进行程序开发需要掌握的各方面知识。《C语言从入门到精通&#xf…

YOLO v8详解

回顾一下YOLOv5 Backbone&#xff1a;CSPDarkNet结构&#xff0c;主要结构思想的体现在C3模块&#xff0c;这里也是梯度分流的主要思想所在的地方&#xff1b;PAN-FPN&#xff1a;双流的FPN&#xff0c;但是量化还是有些需要图优化才可以达到最优的性能&#xff0c;比如cat前后…