1️⃣设定:对于查询 Q = { q 1 , q 2 , . . . , q n } Q\text{=}\{q_1,q_2,...,q_n\} Q={q1,q2,...,qn}和段落集 P = { P ( 1 ) , P ( 2 ) , … , P ( N ) } \mathscr{P}\text{=}\left\{P^{(1)},P^{(2)},\ldots,P^{(\text{N})}\right\} P={P(1),P(2),…,P(N)},其中 P ( α ) = { p 1 ( α ) , p 2 ( α ) , . . . , p m ( α ) } P^{(\alpha)}\text{=}\left\{p^{(\alpha)}_1,p^{(\alpha)}_2,...,p^{(\alpha)}_m\right\} P(α)={p1(α),p2(α),...,pm(α)}
2️⃣逆文档频率: IDF ( q i ) = log ( N–DF ( q i ) + 0.5 DF ( q i ) + 0.5 + 1 ) \displaystyle{}\text{IDF}(q_i)\text{=}\log\left(\cfrac{\text{N}–\text{DF}(q_i)\text{+}0.5}{\text{DF}(q_i)\text{+}0.5}\text{+}1\right) IDF(qi)=log(DF(qi)+0.5N–DF(qi)+0.5+1)
- 段落频: DF ( q i ) \text{DF}(q_i) DF(qi)为文档频率,即在 P = { P ( 1 ) , P ( 2 ) , … , P ( N ) } \mathscr{P}\text{=}\left\{P^{(1)},P^{(2)},\ldots,P^{(\text{N})}\right\} P={P(1),P(2),…,P(N)}中有多少个段落包含词汇 q i q_i qi
- 含义:用于评估词 q i q_i qi在段落集 P \mathscr{P} P中的稀有程度, IDF \text{IDF} IDF值越高越稀有
1️⃣ BM25 \text{BM25} BM25评分: Score ( Q , P ( α ) ) = ∑ i = 1 n IDF ( q i ) × ( k 1 + 1 ) TF ( q i , P ( α ) ) TF ( q i , P ( α ) ) + k 1 ( 1 – b + b ∣ P ( α ) ∣ avgdl ) \displaystyle{}\text{Score}\left(Q,P^{(\alpha)}\right)\text{=}\sum_{i=1}^{n}\text{IDF}(q_i)\text{×}\cfrac{(k_1\text{+}1)\text{TF}\left(q_i,P^{(\alpha)}\right)}{\text{TF}\left(q_i,P^{(\alpha)}\right)\text{+}k_1\left(1–b\text{+}\cfrac{b|P^{(\alpha)}|}{\text{avgdl}}\right)} Score(Q,P(α))=i=1∑nIDF(qi)×TF(qi,P(α))+k1(1–b+avgdlb∣P(α)∣)(k1+1)TF(qi,P(α))
- 词频: TF ( q i , P ( α ) ) \text{TF}\left(q_i,P^{(\alpha)}\right) TF(qi,P(α))为词频,即 q i q_i qi在 P ( α ) P^{(\alpha)} P(α)中出现的总次数
- 归一化:即 1 – b + b ∣ P ( α ) ∣ avgdl 1–b\text{+}\cfrac{b|P^{(\alpha)}|}{\text{avgdl}} 1–b+avgdlb∣P(α)∣,其中 avgdl \text{avgdl} avgdl是段落的平均长度,抹除长文档的固有相似度偏好
- 调参: k 1 k_1 k1与 b b b可供调整,适应不同的检索任务