【论文阅读 CIKM2011】Finding Dimensions for Queries

news2025/4/19 0:34:57

文章目录

- Foreword
- Abs
- Method
- - List Extraction
  - List Weighting
  - List Clustering
  - Dimension and Item Ranking

Foreword

This paper is from CIKM2011, so we only condier the method, not results
There are many papers that have not been shared. More papers can be found in: ShiyuNee/Awesome-Conversation-Clarifying-Questions-for-Information-Retrieval: Papers about Conversation and Clarifying Questions (github.com)

Abs

We address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists.

we propose aggregating frequent lists within the top search results to mine query dimensions and implement a system called QDMiner.

Method

QDMiner discovers query dimensions by aggregating frequent lists within the top results.

Important information is usually organized in list formats by websites
- Listing is a graceful way to show parallel knowledge or items
Important lists are commonly supported by relevant websites and hence repeat in the top search results, whereas unimportant lists just infrequently appear in results.

Query dimensions are mined by the following four steps:

List Extraction: Several types of lists are extracted from each document
List Weighting: All extracted lists are weighted, and thus some unimportant or noisy lists can be assigned by low weights
List Clustering: Similar lists are grouped together to compose a dimension.
Dimension and Item Ranking: Dimensions(between dimensions) and their items(with a dimension) are evaluated and ranked based on their importance.

List Extraction

For each document, we extract a set of lists from the HTML content of $d$ based on three different types of patterns

Free text patterns:
- pattern: item{, item}*(and|or) {other} item
  
  Example 1 We shop for gorgeous watches from Seiko, Bulova, Lucien Piccard, Citizen, Cartier or Invicta
- further use pattern: {ˆitem (: |-) .+$}+ to extract lists from some semi-structured paragraphs
  
  Example 2 … are highly important for following reasons: Consistency - every fact table is filtered consistently res… Integration - queries are able to drill different processes … Reduced development time to market - the common dimensions are available without recreating the wheel over again.
HTML tag patterns:
- style of HTML tags
  - SELECT: extract all text from their child tags(OPTION) to create a list
  - UL / OL: extract text within their child tags(LI)
  - TABLE: extract one list from each column or each row

在这里插入图片描述

Repeat region patterns:
- First detect repeat regions in webpages based on vision-based DOM trees
- Then extract all leaf HTML nodes within each block, and group them by their tag names(name, rating, etc) and display styles.
- Last, for each group, extract all text from its nodes as a list
Note: we do post-processing for each extracted list

List Weighting

在这里插入图片描述

This type of lists are useless for finding dimensions and we should punish them.

we propose to aggregate all lists of a query, and evaluate the importance of each unique list l by the following components:

document matching weight: $S_{\mathrm{DOC}}=\sum_{d \in R}\left(s_d^m * s_d^r\right)$
- $d_d^m$ is the percentage of items contained in $d$
  - $s_d^m=\frac{|d \cap l|}{|l|}$
- $s_d^r$ measures the importance of document $d$
  - $s_d^r=1 / \sqrt{\operatorname{ran} k_d}$
  - The higher $d$ is ranked, the larger its score $s_d^r$ is.( $d$ is more relevant to the query)
average invert document frequency(IDF) of items:
- A list comprised of common items in a corpus(we use ClueWeb09) is not informative to the query.

The importance of a list $l$ : $S_l = S_{DOC} * S_{IDF}$

List Clustering

Two lists can be grouped together if they share enough items

$d_c\left(c_1, c_2\right)=\max _{l_1 \in c_1, l_2 \in c_2} d_l\left(l_1, l_2\right) = 1-\frac{\left|l_1 \cap l_2\right|}{\min \left\{\left|l_1\right|,\left|l_2\right|\right\}}$

Use a modified QT (assume that all data is equally important)clustering algorithm to group similar lists

We modify the original QT algorithm to first group highly weighted lists. The algorithm, which we refer to as WQT (Quality Threshold with Weighted data points)

Don’t use individual weighted lists as query dimensions

Dimension and Item Ranking

A good dimension should frequently appear in the top results, a dimension $c$ is more important if:

(1) The lists in $c$ are extracted from more unique websites
(2) the lists in $c$ are more important, i.e., they have higher weights.

在这里插入图片描述

$S_l$ is the weight of a list $l$

In a dimension, the importance of an item depends on how many lists contain the item and its ranks in the lists.

在这里插入图片描述

$e$ is a item
$w (c, e, s)$ is the weight contributed by a website $s$
$AvgRank_{c,e,s}$ is the average rank of e within all lists extracted from website $s$ .