COMP 6714-Info Retrieval and Web Search笔记week1

news2026/2/18 7:33:12

哭了哭了，这周唯一能听懂的就这门

IR（Information Retrieval)是什么？
IR的基本假设
Unstructured (text) vs. structured
Documents vs. Database Records
比较文本（Comparing Text）
IR的范围(Dimensions of IR)
IR的任务(IR Task)
IR的大问题(Big Issues in IR)
- 相关性(relevance)
- 评估(Evaluation)
Unranked retrieval evaluation:

IR（Information Retrieval)是什么？

不等同于search，不是做数据查询(database query)
The field of computer science that is most involved with R&D(research and development) for search is information retrieval (IR)

找 finding material(doctuments)
无结构 unstructured nature
大集合 an information need within large collection

IR的基本假设

集合(Collection) ：一组文档，静态的（a static collection for the moment）
目标(Goal) ：检索与用户需要的信息相关的文档(retrieve documents with information that is relevant to the user’s information
need and helps the user complete a task)

Unstructured (text) vs. structured

在这里插入图片描述 market cap 市场总值
90年代中期，大部分数据是非结构化的，而在行业里，大部分的钱都在结构化数据库上。如oracle、Microsoft SQL Server、IBM database、DB2

而到了2019年的时候，非结构数据更多了，在非结构化数据上花的钱也比结构化数据更多了（如chatgpt）
这让信息检索比以前更重要了

Documents vs. Database Records

数据库记录（或关系数据库中的元组tuple）通常由定义良好的字段field(或属性attribute)组成。数据库( fields with well-defined semantics)查询很容易，文本(text or documents)较难。

比较文本（Comparing Text）

将查询文本(query text)与文档文本(document text)进行比较，确定什么是好的匹配，是信息检索的核心问题(core issue)。

IR的范围(Dimensions of IR)

IR不仅仅是文本和网络搜索（虽然在这门课上是核心）
在这里插入图片描述

IR的任务(IR Task)

动态查询(Ad-hoc search)：查找任意文本(arbitrary text)查询的相关文档
筛选(Filtering):又名信息传播(aka information dissemination)，为新文档识别相关用户的profile(比如你告诉你的社交媒体你喜欢动漫，它可能以后会给你推这方面的)
分类(Classification)：识别文档相关的标签
问题回答(Question answering)：对问题给出一个具体的答案

IR的大问题(Big Issues in IR)

评估(Evaluation)

比较系统输出(system output)与用户期望(user expectations)的实验程序和措施
召回率(Recall)和准确率(precision) 是有效度量的两个例子

Unranked retrieval evaluation:

accuracy不是信息检索的词，accuracy很误导，我们不用accuracy来衡量信息检索而是Precision和Recall

Precision：fraction of retrieved docs that are relevant = P (relevant|retrieved)
你搜索到的有多少是正确的样本？
Recall：fraction of relevant docs that are retrieved = P (retrieved|relevant)
在正确的样本中有多少正确的样本被搜索到了？
所以一个是关于retrieve，另一个是关于collection
- tp:true positive(相关，并且搜索到了)
- fp:false positive
- fn:false negative
- tn:true negative(不相关，并且没搜索到)
  all the true are good stuff, all the false you don’t like