目录
- Probabilities: Joint to Conditional 概率:联合概率到条件概率
- The Markov Assumption 马尔可夫假设
- Maximum Likelihood Estimation 最大似然估计
- Book-ending Sequences 序列的开头和结尾
- Problems with N-gram models N-gram模型的问题
- Smoothing 平滑处理
- In Practice
- Generation 生成
- How to select next word
Language Models
- One application NLP is about explaining language: Why some sentences are more fluent than others NLP的一个应用是关于解释语言:为什么有些句子比其他句子更流畅
- E.g. in speech recognition: recognize speech > wreck a nice beach 在语音识别中:识别语音 > 毁坏一个好的海滩
- Measures “goodness” using probabilities estimated by language models 使用语言模型估计的概率衡量"好度"
- Language model can also be used for generation 语言模型也可以用于生成
- Language model is useful for: 语言模型对以下方面很有用
- Query completion 查询补全
- Optical character recognition 光学字符识别
- And other generation tasks: 和其他生成任务
- Machine translation 机器翻译
- Summarization 概括
- Dialogue systems 对话系统
- Nowadays pretrained language models are the backbone of modern NLP systems 如今,预训练的语言模型是现代NLP系统的骨干
N-gram Language Model
Probabilities: Joint to Conditional 概率:联合概率到条件概率
-
Goal of language model is to get a probability for an arbitrary sequence of m words: 语言模型的目标是获取一个任意序列的m个词的概率
-
First step is to apply the chain rule to convert joint probabilities to conditional ones: 第一步是应用链式法则,将联合概率转换为条件概率
The Markov Assumption 马尔可夫假设
-
is still intractable, so make a simplifying assumption: 还是不可处理的,所以做一个简化的假设
-
For some small n: 对于某个小的n
-
When n = 1, it is a unigram model :
-
When n = 2, it is a bigram model:
-
When n = 3, it is a trigram model:
-
Maximum Likelihood Estimation 最大似然估计
- Estimate the probabilities based on counts in the corpus: 根据语料库中的计数估计概率
-
For unigram models:
-
For bigram models:
-
For n-gram models generally:
-
Book-ending Sequences 序列的开头和结尾
- Special tags used to denote start and end of sequence: 用特殊的标签表示序列的开始和结束
<s>
= sentence start 句子开始</s>
= sentence end 句子结束
Problems with N-gram models N-gram模型的问题
-
Language has long distance effects, therefore large n required. 语言有长距离的影响,因此需要较大的n值
The lecture/s that took place last week was/were on preprocessing
- The “was/were” here is mentioning “lecture/s” which is 6 words ahead. Therefore need a 6-grams
-
Resulting probabilities are often very small 结果的概率通常非常小
- Possible solution: Use log probability to avoid numerical underflow 可能的解决方案:使用对数概率以避免数值下溢问题
-
Unseen words: 未见过的词
- Special symbol to represent. E.g.
<UNK>
用特殊符号表示
- Special symbol to represent. E.g.
-
Unseen n-grams: Because the opertaion is multiplication, if one term in the multiplication is 0 then whole probability is 0 未见过的n-grams:因为操作是乘法,如果乘法中的一个术语为0,那么整个概率就是0
- Need to smooth the n-gram language model 需要对n-gram语言模型进行平滑处理
Smoothing
Smoothing 平滑处理
- Basic idea: give events you have never seen before some probability 基本思想:给你从未见过的事件赋予一些概率
- Must be the case that
- Many different kinds of smoothing
- Laplacian(add-one) smoothing
- Add-k smoothing
- Absolute discounting
- Katz Backoff
- Kneser-Ney
- Interpolation
- Interpolated Kneser-Ney Smoothing
Laplacian(add-one) smoothing
-
Simple idea: pretend we have seen each n-gram once more than we did. 假装我们看到每个n-gram多了一次
-
For unigram models:
-
For bigram models:
Add-k smoothing
-
Adding one is often too much. Instead, add a fraction k. 加一通常太多。相反,加一个k的分数
-
Also called Lidstone Smoothing
-
Have to choose k 需要选择k的值
Absolute Discounting
- Borrows a fixed probability mass from observed n-gram counts 从观察到的n-gram计数中借来固定的概率质量
- Redistributes it to unseen n-grams 将其重新分配给未见过的n-grams
Katz Backoff
-
Absolute discounting redistributes the probability mass equally for all unseen n-grams
-
Katz Backoff: redistributes the mass based on a lower order model (e.g. Unigram)
-
Problems: Has preference of high frequency words rather than true related words. 问题:倾向于高频词,而不是真正相关的词
- E.g. I can’t see without my reading _
- C(reading, glasses) = C(reading, Francisco) = 0
- C(Francisco) > C(glasses)
- Katz Backoff will give higher probability to Francisco
- E.g. I can’t see without my reading _
Kneser-Ney Smoothing
- Redistribute probability mass based on the versatility(广泛性) of the lower order n-gram. 根据低阶n-gram的多功能性(广泛性)重新分配概率质量
- Also called continuation probability 也称为续写概率
- Versatility: 多功能性
- High versatility: co-occurs with a lot of unique words 高多功能性:与许多唯一的词共现
- E.g. glasses: men’s glasses, black glasses, buy glasses
- Low versatility: co-occurs with few unique words 低多功能性:与少数唯一的词共现
- E.g. Francisco: San Francisco
- High versatility: co-occurs with a lot of unique words 高多功能性:与许多唯一的词共现
- Intuitively the numerator of Pcont counts the number of unique wi-1 that co-occurs with wi 直观地说,Pcont的分子计算与wi共现的唯一wi-1的数量
- High continuation counts for glasses and low continuation counts for Francisco 对于眼镜有高的续写计数,对于Francisco有低的续写计数
Interpolation
- A better way to combine different orders of n-grams models 结合不同阶数的n-grams模型的更好方式
- Weighted sum of probabilities across progressively shorter contexts 对逐渐缩短的上下文进行加权概率求和
- E.g. Interpolated trigram model: 插值trigram模型:
PIN(wi|wi-1, wi-2) = λ3P3(wi|wi-2, wi-1) + λ2P2(wi|wi-1) + λ1P1(wi)
λ3 + λ2 + λ1 = 1
Interpolated Kneser-Ney Smoothing
- Interpolation instead of back-off 使用插值而不是回退
In Practice
- Commonly used Kneser-Ney language models use 5-grams as max order 常用的Kneser-Ney语言模型将5-grams作为最大阶数
- Has different discount values for each n-gram order 对每个n-gram阶数有不同的折扣值
Generating Language
Generation 生成
-
Given an initial word, draw the next word according to the probability distribution produced by the language model. 给定一个初始词,根据语言模型产生的概率分布选择下一个词
-
Include n-1
<s>
tokens for n-gram model to provide context to generate first word 对于n-gram模型,包括n-1个标记,以便提供上下文来生成第一个词- Never generate
<s>
永不生成 - Generating
</s>
terminates the sequence 生成会结束序列
- Never generate
-
E.g.
How to select next word
-
Argmax: Takes highest probability word each turn. 每次选择概率最高的词
- Greedy Search 贪婪搜索
-
Beam Search Decoding:
- Keeps track of top-N highest probability words each turn 每次跟踪前N个概率最高的词
- Select sequence of words that produce the best sentence probability 选择产生最佳句子概率的词序列
-
Randomly samples from the distribution 从分布中随机采样