文章目录
- Foreword
- Abs
- Method
- List Extraction
- List Weighting
- List Clustering
- Dimension and Item Ranking
Foreword
- This paper is from
CIKM2011
, so we only condier the method, not results - There are many papers that have not been shared. More papers can be found in: ShiyuNee/Awesome-Conversation-Clarifying-Questions-for-Information-Retrieval: Papers about Conversation and Clarifying Questions (github.com)
Abs
We address the problem of finding multiple groups of words or phrases that explain the underlying query facets, which we refer to as query dimensions. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query dimensions can be mined out by aggregating these significant lists.
we propose aggregating frequent lists within the top search results to mine query dimensions and implement a system called QDMiner
.
Method
QDMiner
discovers query dimensions by aggregating frequent lists within the top results.
- Important information is usually organized in
list
formats by websites- Listing is a graceful way to show parallel knowledge or items
- Important lists are commonly supported by relevant websites and hence repeat in the top search results, whereas unimportant lists just infrequently appear in results.
Query dimensions are mined by the following four steps:
- List Extraction: Several types of lists are extracted from each document
- List Weighting: All extracted lists are weighted, and thus some unimportant or noisy lists can be assigned by low weights
- List Clustering: Similar lists are grouped together to compose a dimension.
- Dimension and Item Ranking: Dimensions(between dimensions) and their items(with a dimension) are evaluated and ranked based on their importance.
List Extraction
For each document, we extract a set of lists from the HTML content of d d d based on three different types of patterns
-
Free text patterns:
-
pattern: item{, item}*(and|or) {other} item
Example 1 We shop for gorgeous watches from Seiko, Bulova, Lucien Piccard, Citizen, Cartier or Invicta
-
further use pattern: {ˆitem (: |-) .+$}+ to extract lists from some semi-structured paragraphs
Example 2 … are highly important for following reasons: Consistency - every fact table is filtered consistently res… Integration - queries are able to drill different processes … Reduced development time to market - the common dimensions are available without recreating the wheel over again.
-
-
HTML tag patterns:
- style of HTML tags
- SELECT: extract all text from their child tags(
OPTION
) to create a list - UL / OL: extract text within their child tags(
LI
) - TABLE: extract one list from each column or each row
- SELECT: extract all text from their child tags(
- style of HTML tags
-
Repeat region patterns:
-
First detect repeat regions in webpages based on vision-based DOM trees
-
Then extract all leaf HTML nodes within each block, and group them by their tag names(name, rating, etc) and display styles.
-
Last, for each group, extract all text from its nodes as a list
Note: we do post-processing for each extracted list
-
List Weighting
This type of lists are useless for finding dimensions and we should punish them.
we propose to aggregate all lists of a query, and evaluate the importance of each unique list l by the following components:
-
document matching weight: S D O C = ∑ d ∈ R ( s d m ∗ s d r ) S_{\mathrm{DOC}}=\sum_{d \in R}\left(s_d^m * s_d^r\right) SDOC=∑d∈R(sdm∗sdr)
-
d
d
m
d_d^m
ddm is the percentage of items contained in
d
d
d
- s d m = ∣ d ∩ l ∣ ∣ l ∣ s_d^m=\frac{|d \cap l|}{|l|} sdm=∣l∣∣d∩l∣
-
s
d
r
s_d^r
sdr measures the importance of document
d
d
d
- s d r = 1 / ran k d s_d^r=1 / \sqrt{\operatorname{ran} k_d} sdr=1/rankd
- The higher d d d is ranked, the larger its score s d r s_d^r sdr is.( d d d is more relevant to the query)
-
d
d
m
d_d^m
ddm is the percentage of items contained in
d
d
d
-
average invert document frequency(IDF) of items:
- A list comprised of common items in a corpus(we use
ClueWeb09
) is not informative to the query.
- A list comprised of common items in a corpus(we use
The importance of a list l l l: S l = S D O C ∗ S I D F S_l = S_{DOC} * S_{IDF} Sl=SDOC∗SIDF
List Clustering
Two lists can be grouped together if they share enough items
- d c ( c 1 , c 2 ) = max l 1 ∈ c 1 , l 2 ∈ c 2 d l ( l 1 , l 2 ) = 1 − ∣ l 1 ∩ l 2 ∣ min { ∣ l 1 ∣ , ∣ l 2 ∣ } d_c\left(c_1, c_2\right)=\max _{l_1 \in c_1, l_2 \in c_2} d_l\left(l_1, l_2\right) = 1-\frac{\left|l_1 \cap l_2\right|}{\min \left\{\left|l_1\right|,\left|l_2\right|\right\}} dc(c1,c2)=maxl1∈c1,l2∈c2dl(l1,l2)=1−min{∣l1∣,∣l2∣}∣l1∩l2∣
Use a modified QT
(assume that all data is equally important)clustering algorithm to group similar lists
We modify the original QT algorithm to first group highly weighted lists. The algorithm, which we refer to as WQT (Quality Threshold with Weighted data points)
Don’t use individual weighted lists as query dimensions
Dimension and Item Ranking
A good dimension should frequently appear in the top results, a dimension c c c is more important if:
- (1) The lists in c c c are extracted from more unique websites
- (2) the lists in c c c are more important, i.e., they have higher weights.
- S l S_l Sl is the weight of a list l l l
In a dimension, the importance of an item depends on how many lists contain the item and its ranks in the lists.
- e e e is a item
- w ( c , e , s ) w(c,e,s) w(c,e,s) is the weight contributed by a website s s s
- A v g R a n k c , e , s AvgRank_{c,e,s} AvgRankc,e,s is the average rank of e within all lists extracted from website s s s.
We only output qualified items by default in QDMiner
.
- qualified items: S e ∣ c > 1 S_{e|c} > 1 Se∣c>1 and S e ∣ c > ∣ S i t e s ( c ) ∣ 10 S_{e|c} > \frac{|Sites(c)|}{10} Se∣c>10∣Sites(c)∣