Lecture 3 N-gram Language Models

news2025/4/6 18:56:38

- - Probabilities: Joint to Conditional 概率：联合概率到条件概率
  - The Markov Assumption 马尔可夫假设
  - Maximum Likelihood Estimation 最大似然估计
  - Book-ending Sequences 序列的开头和结尾
  - Problems with N-gram models N-gram模型的问题
  - Smoothing 平滑处理
  - In Practice
  - Generation 生成
  - How to select next word

Language Models

One application NLP is about explaining language: Why some sentences are more fluent than others NLP的一个应用是关于解释语言：为什么有些句子比其他句子更流畅
E.g. in speech recognition: recognize speech > wreck a nice beach 在语音识别中：识别语音 > 毁坏一个好的海滩
Measures “goodness” using probabilities estimated by language models 使用语言模型估计的概率衡量"好度"
Language model can also be used for generation 语言模型也可以用于生成
Language model is useful for: 语言模型对以下方面很有用
- Query completion 查询补全
- Optical character recognition 光学字符识别
- And other generation tasks: 和其他生成任务
  - Machine translation 机器翻译
  - Summarization 概括
  - Dialogue systems 对话系统
Nowadays pretrained language models are the backbone of modern NLP systems 如今，预训练的语言模型是现代NLP系统的骨干

N-gram Language Model

Probabilities: Joint to Conditional 概率：联合概率到条件概率

Goal of language model is to get a probability for an arbitrary sequence of m words: 语言模型的目标是获取一个任意序列的m个词的概率
$P(w_1, w_2, \ldots, w_m)$
First step is to apply the chain rule to convert joint probabilities to conditional ones: 第一步是应用链式法则，将联合概率转换为条件概率
$P(w_1, w_2, \ldots, w_m) = P(w_1)P(w_2|w_1)P(w_3|w_1, w_2) \ldots P(w_m|w_1, \ldots, w_{m-1})$

The Markov Assumption 马尔可夫假设

$P(w_1)P(w_2|w_1)P(w_3|w_1, w_2) \ldots P(w_m|w_1, \ldots, w_{m-1})$ is still intractable, so make a simplifying assumption: 还是不可处理的，所以做一个简化的假设
$P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-n+1}, \ldots, w_{i-1})$
For some small n: 对于某个小的n
- When n = 1, it is a unigram model : $P(w_1, w_2, \ldots, w_m) = \prod_{i=1}^{m} P(w_i)$
- When n = 2, it is a bigram model: $P(w_1, w_2, \ldots, w_m) = \prod_{i=1}^{m} P(w_i|w_{i-1})$
- When n = 3, it is a trigram model: $P(w_1, w_2, \ldots, w_m) = \prod_{i=1}^{m} P(w_i|w_{i-2}, w_{i-1})$

Maximum Likelihood Estimation 最大似然估计

Estimate the probabilities based on counts in the corpus: 根据语料库中的计数估计概率
- For unigram models:
  $P(w_i) = \frac{C(w_i)}{M}$
- For bigram models:
  $P(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1}})$
- For n-gram models generally:
  $P(w_i|w_{i-n+1}, \ldots, w_{i-1}) = \frac{C(w_{i-n+1}, \ldots, w_i)}{C(w_{i-n+1}, \ldots, w_{i-1})}$

Book-ending Sequences 序列的开头和结尾

Special tags used to denote start and end of sequence: 用特殊的标签表示序列的开始和结束
- <s> = sentence start 句子开始
- </s> = sentence end 句子结束

Problems with N-gram models N-gram模型的问题

Language has long distance effects, therefore large n required. 语言有长距离的影响，因此需要较大的n值

The lecture/s that took place last week was/were on preprocessing
- The “was/were” here is mentioning “lecture/s” which is 6 words ahead. Therefore need a 6-grams
Resulting probabilities are often very small 结果的概率通常非常小
- Possible solution: Use log probability to avoid numerical underflow 可能的解决方案：使用对数概率以避免数值下溢问题
Unseen words: 未见过的词
- Special symbol to represent. E.g. <UNK> 用特殊符号表示
Unseen n-grams: Because the opertaion is multiplication, if one term in the multiplication is 0 then whole probability is 0 未见过的n-grams：因为操作是乘法，如果乘法中的一个术语为0，那么整个概率就是0
- Need to smooth the n-gram language model 需要对n-gram语言模型进行平滑处理

Smoothing

Smoothing 平滑处理

Basic idea: give events you have never seen before some probability 基本思想：给你从未见过的事件赋予一些概率
Must be the case that $P(everything) = 1$
Many different kinds of smoothing
- Laplacian(add-one) smoothing
- Add-k smoothing
- Absolute discounting
- Katz Backoff
- Kneser-Ney
- Interpolation
- Interpolated Kneser-Ney Smoothing

Laplacian(add-one) smoothing

Simple idea: pretend we have seen each n-gram once more than we did. 假装我们看到每个n-gram多了一次
For unigram models:
$P_{add1}(w_i) = \frac{C(w_i)+1}{M+|V|}$
For bigram models:
$P_{add1}(w_i|w{i-1}) = \frac{C(w_{i-1}, w_i)+1}{C(w_{i-1})+|V|}$

Add-k smoothing

Adding one is often too much. Instead, add a fraction k. 加一通常太多。相反，加一个k的分数
Also called Lidstone Smoothing
$P_{addk}(w_i|w_{i-1}, w_{i-2}) = \frac{C(w_{i-2}, w_{i-1}, w_i)+k}{C(w_{i-2}, w_{i-1})+k|V|}$
Have to choose k 需要选择k的值

Absolute Discounting

Borrows a fixed probability mass from observed n-gram counts 从观察到的n-gram计数中借来固定的概率质量
Redistributes it to unseen n-grams 将其重新分配给未见过的n-grams

Katz Backoff

Absolute discounting redistributes the probability mass equally for all unseen n-grams
Katz Backoff: redistributes the mass based on a lower order model (e.g. Unigram)
Problems: Has preference of high frequency words rather than true related words. 问题：倾向于高频词，而不是真正相关的词
- E.g. I can’t see without my reading _
  - C(reading, glasses) = C(reading, Francisco) = 0
  - C(Francisco) > C(glasses)
  - Katz Backoff will give higher probability to Francisco

Kneser-Ney Smoothing

Redistribute probability mass based on the versatility(广泛性) of the lower order n-gram. 根据低阶n-gram的多功能性(广泛性)重新分配概率质量
Also called continuation probability 也称为续写概率
Versatility: 多功能性
- High versatility: co-occurs with a lot of unique words 高多功能性：与许多唯一的词共现
  - E.g. glasses: men’s glasses, black glasses, buy glasses
- Low versatility: co-occurs with few unique words 低多功能性：与少数唯一的词共现
  - E.g. Francisco: San Francisco

Intuitively the numerator of P_cont counts the number of unique w_i-1 that co-occurs with w_i 直观地说，P_cont的分子计算与w_i共现的唯一w_i-1的数量
High continuation counts for glasses and low continuation counts for Francisco 对于眼镜有高的续写计数，对于Francisco有低的续写计数

Interpolation

A better way to combine different orders of n-grams models 结合不同阶数的n-grams模型的更好方式
Weighted sum of probabilities across progressively shorter contexts 对逐渐缩短的上下文进行加权概率求和
E.g. Interpolated trigram model: 插值trigram模型：

P_IN(w_i|w_i-1, w_i-2) = λ₃P₃(w_i|w_i-2, w_i-1) + λ₂P₂(w_i|w_i-1) + λ₁P₁(w_i)
λ₃ + λ₂ + λ₁ = 1

Interpolated Kneser-Ney Smoothing

Interpolation instead of back-off 使用插值而不是回退

In Practice

Commonly used Kneser-Ney language models use 5-grams as max order 常用的Kneser-Ney语言模型将5-grams作为最大阶数
Has different discount values for each n-gram order 对每个n-gram阶数有不同的折扣值

Generating Language

Generation 生成

Given an initial word, draw the next word according to the probability distribution produced by the language model. 给定一个初始词，根据语言模型产生的概率分布选择下一个词
Include n-1 <s> tokens for n-gram model to provide context to generate first word 对于n-gram模型，包括n-1个~~标记，以便提供上下文来生成第一个词~~
- Never generate <s> 永不生成
- Generating </s> terminates the sequence 生成会结束序列
E.g.

How to select next word

Argmax: Takes highest probability word each turn. 每次选择概率最高的词
- Greedy Search 贪婪搜索
Beam Search Decoding:
- Keeps track of top-N highest probability words each turn 每次跟踪前N个概率最高的词
- Select sequence of words that produce the best sentence probability 选择产生最佳句子概率的词序列
Randomly samples from the distribution 从分布中随机采样