08.C2W3.Auto-complete and Language Models

news2025/1/22 15:40:56

往期文章请点这里

目录

  • N-Grams: Overview
  • N-grams and Probabilities
    • N-grams
    • Sequence notation
    • Unigram probability
    • Bigram probability
    • Trigram Probability
    • N -gram probability
    • Quiz
  • Sequence Probabilities
    • Probability of a sequence
    • Sequence probability shortcomings
    • Approximation by N gram probabilities
    • Quiz
  • Starting and Ending Sentences
    • Start of sentence token \<s\>
    • End of sentence token \</s\> -motivation
    • End of sentence token \</s\> -solution
    • Example-bigram
    • Quiz
  • The N-gram Language Model
    • Count matrix
    • Probability matrix
    • Language model
    • Log probability
    • Generative Language model
  • Language Model Evaluation
    • Test data
    • Perplexity
    • Perplexity for bigram models
    • Log perplexity
    • Example
  • Out of Vocabulary Words
    • Out of vocabulary words
    • Using \<UNK\> in corpus
    • How to create vocabulary V
    • Quiz
  • Smoothing
    • Missing N-grams in training corpus
  • Smoothing
  • Backoff
    • Interpolation

往期文章请点这里

N-Grams: Overview

● Create language model (LM) from text corpus to
○ Estimate probability of word sequences
○ Estimate probability of a word following a sequence of words
● Apply this concept to autocomplete a sentence with most likely suggestions
在这里插入图片描述
语言模型在自然语言处理(NLP)和人工智能领域有着广泛的应用,以下是它们在特定领域中的应用:
语音识别(Speech Recognition):
语音识别系统将人类的语音转换成书面文本。语言模型在这个过程中起到关键作用,因为它们帮助系统理解语音片段中单词的上下文和语法结构。通过语言模型,系统能够更准确地预测一个单词序列的可能性,从而提高语音到文本转换的准确性。
在这里插入图片描述

拼写检查与纠正(Spelling Correction):
语言模型可以用来检测和纠正文本中的拼写错误。由于语言模型知道哪些单词序列在特定语言中是常见的或符合语法规则的,它们可以识别出不符合这些规则的单词,提示或自动更正为正确的拼写。例如,如果用户错误地输入了“recieve”,语言模型可以识别出这不是一个常见单词,并建议更正为“receive”。
在这里插入图片描述

辅助交流(Augmentative and Alternative Communication, AAC):
辅助交流设备或系统帮助那些有语言障碍或沟通困难的人表达自己。语言模型可以集成到这些系统中,提供个性化的预测和建议,帮助用户更快地构建句子和表达思想。例如,对于使用特殊设备进行交流的用户,语言模型可以预测他们可能想要表达的下一个单词或短语,从而提高交流效率。
在这里插入图片描述

主要目标:
●Process text corpus to N-gram language model
●Out of vocabulary words
●Smoothing for previously unseen N-grams
●Language model evaluation

N-grams and Probabilities

N-grams

N-gram 是自然语言处理中用于描述文本数据的一种统计模型。简单来说,一个 N-gram 是由 N 个连续的词(words)组成的序列。在这个序列中,每个词被称作一个“gram”,并且这个序列可以被用来捕捉文本中的局部上下文信息。

以下是不同 N 值的 N-gram 的一些例子:
对于 Unigram(1-gram):N=1,它只包含一个词。例如,“cat”就是一个 unigram。
对于 Bigram(2-gram):N=2,它包含两个连续的词。例如,“cat sat”就是一个 bigram。
对于 Trigram(3-gram):N=3,它包含三个连续的词。例如,“cat sat on”就是一个 trigram。
N-gram 模型在语言模型中非常重要,因为它们可以用来预测文本序列中下一个词出现的概率。例如,在一个 bigram 模型中,给定第一个词,模型可以预测第二个词出现的概率。这种模型对于诸如拼写检查、语法分析、机器翻译和语音识别等应用至关重要。

然而,N-gram 模型也存在一些局限性,比如当 N 值较大时,模型可能会遇到数据稀疏问题,因为大量的词序列在训练数据中可能只出现很少的次数或从未出现过。此外,N-gram 模型通常忽略了词序之外的上下文信息,如句法和语义。

理解 N-gram 的关键是认识到它们提供了一种简单但有效的方式来捕捉和表示文本数据中的局部依赖关系。
另外一个例子:
Corpus: I am happy because I am learning
Unigrams: {I , am, happy, because, learning}
Bigrams: {I am, am happy , happy because …}这里I happy不是Bigrams,必须要连续的两个词;I am在语料库中出现两次,只会记录一次
Trigrams: {I am happy , am happy because , …}

Sequence notation

假设现有语料库中有500个单词:
在这里插入图片描述
则单词序列可以表示为:
w 1 m = w 1 w 2 ⋯ w m w_1^m=w_1w_2\cdots w_m w1m=w1w2wm
例如第一个到第三个单词的序列:
w 1 3 = w 1 w 2 w 3 w_1^3=w_1w_2w_3 w13=w1w2w3
语料库中最后三个词可以表示为:
w m − 2 m = w m − 2 w m − 1 w m w_{m-2}^m=w_{m-2}w_{m-1}w_m wm2m=wm2wm1wm

Unigram probability

假设语料库为:I am happy because I am learning
语料库大小 m = 7 m=7 m=7
对于单词I: P ( I ) = 2 7 P(I)=\cfrac{2}{7} P(I)=72
对于单词happy: P ( h a p p y ) = 1 7 P(happy)=\cfrac{1}{7} P(happy)=71
Unigram probability公式为:
P ( w ) = C ( w ) m P(w)=\cfrac{C(w)}{m} P(w)=mC(w)

Bigram probability

假设语料库为:I am happy because I am learning
则前一个单词是I,后一个单词是am的概率为: P ( a m ∣ I ) = C ( I   a m ) C ( I ) = 2 2 = 1 P(am|I)=\cfrac{C(I\space am)}{C(I)}=\cfrac{2}{2}=1 P(amI)=C(I)C(I am)=22=1
前一个单词是I,后一个单词是happy的概率为: P ( h a p p y ∣ I ) = C ( I   h a p p y ) C ( I ) = 0 2 = 0 P(happy|I)=\cfrac{C(I\space happy)}{C(I)}=\cfrac{0}{2}=0 P(happyI)=C(I)C(I happy)=20=0
前一个单词是am,后一个单词是learning的概率为: P ( l e a r n i n g ∣ a m ) = C ( a m   l e a r n i n g ) C ( a m ) = 1 2 P(learning|am)=\cfrac{C(am\space learning)}{C(am)}=\cfrac{1}{2} P(learningam)=C(am)C(am learning)=21
Bigram probability公式为:
P ( y ∣ x ) = C ( x   y ) ∑ w C ( x   w ) = C ( x   y ) C ( x ) P(y|x)=\cfrac{C(x\space y)}{\sum_wC(x\space w)}=\cfrac{C(x\space y)}{C(x)} P(yx)=wC(x w)C(x y)=C(x)C(x y)

Trigram Probability

假设语料库为:I am happy because I am learning
前两个单词是I am,后一个单词是happy的概率为: P ( h a p p y ∣ I   a m ) = C ( I   a m   h a p p y ) C ( I   a m ) = 1 2 P(happy|I\space am)=\cfrac{C(I\space am\space happy)}{C(I\space am)}=\cfrac{1}{2} P(happyI am)=C(I am)C(I am happy)=21
Trigram Probability公式为:
P ( w 3 ∣ w 1 2 ) = C ( w 1 2 w 3 ) C ( w 1 2 ) P(w_3|w_1^2)=\cfrac{C(w_1^2w_3)}{C(w_1^2)} P(w3w12)=C(w12)C(w12w3)

N -gram probability

直接给公式:
P ( w N ∣ w 1 N − 1 ) = C ( w 1 N − 1 w N ) C ( w 1 N − 1 ) P(w_N|w_1^{N-1})=\cfrac{C(w_1^{N-1}w_N)}{C(w_1^{N-1})} P(wNw1N1)=C(w1N1)C(w1N1wN)
分子: C ( w 1 N − 1 w N ) = C ( w 1 N ) C(w_1^{N-1}w_N)=C(w_1^{N}) C(w1N1wN)=C(w1N)

Quiz

Corpus:
In every place of great resort the monster was the fashion. They sang of it in the cafes, ridiculed it in the papers, and rep res ented it on the stage. ” (Jules Verne, Twenty Thousand Leagues under the Sea)
In the context of our corpus, what is the probability of word “papers” following the phrase “it in the”.
Answer: 1/2
解析:it in the总共出现了2次,后面接papers出现了1次

Sequence Probabilities

Probability of a sequence

给定一个句子,其出现概率如何计算?
根据链式法则:
P ( A , B , C , D ) = P ( A ) P ( B ∣ A ) P ( C ∣ A , B ) P ( D ∣ A , B , C ) P(A, B,C, D)= P(A)P(B|A)P(C|A, B)P(D|A, B, C) P(A,B,C,D)=P(A)P(BA)P(CA,B)P(DA,B,C)
根据条件概率:
P ( B ∣ A ) = P ( A , B ) P ( A ) ⇒ P ( A , B ) = P ( A ) P ( B ∣ A ) P(B|A)=\cfrac{P(A,B)}{P(A)}\xRightarrow{} P(A,B)=P(A)P(B|A) P(BA)=P(A)P(A,B) P(A,B)=P(A)P(BA)
则某句话出现的概率为:
P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks |the teacher)P(tea |the teacher drinks)

Sequence probability shortcomings

最大的问题:Corpus almost never contains the exact sentence we’re interested in or even its longer subsequences!
例如上面的例子中最后一项:
P ( t e a ∣ t h e   t e a c h e r   d r i n k s ) = C ( t h e   t e a c h e r   d r i n k s   t e a ) C ( t h e   t e a c h e r   d r i n k s ) P(tea |the\space teacher\space drinks)=\cfrac{C(the\space teacher\space drinks\space tea)}{C(the\space teacher\space drinks)} P(teathe teacher drinks)=C(the teacher drinks)C(the teacher drinks tea)
可以预想到分子和分母项在语料中出现的次数估计为0,会使得P(the teacher drinks tea)计算依赖相乘的结果也为0

Approximation by N gram probabilities

为了避免上面提到的情况,将条件概率中的条件限制为前一个单词:
P ( t e a ∣ t h e   t e a c h e r   d r i n k s ) ≈ P ( t e a ∣ d r i n k s ) P(tea |the\space teacher\space drinks)\approx P(tea|drinks) P(teathe teacher drinks)P(teadrinks)
P ( t h e   t e a c h e r   d r i n k s   t e a ) = P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t h e   t e a c h e r ) P ( t e a ∣ t h e   t e a c h e r   d r i n k s ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(the\space teacher\space drinks\space tea)=P(the)P(teacher|the)P(drinks |the\space teacher)P(tea |the\space teacher\space drinks)\\ \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(the teacher drinks tea)=P(the)P(teacherthe)P(drinksthe teacher)P(teathe teacher drinks)P(the)P(teacherthe)P(drinksteacher)P(teadrinks)
当然,还可以根据Markov assumption: only last N words matter
Bigram某个单词出现概率:
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P(w_n | w_1^{n-1}) \approx P(w_n | w_{n-1}) P(wnw1n1)P(wnwn1)
N-gram某个单词出现概率:
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P(w_n | w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1}) P(wnw1n1)P(wnwnN+1n1)
Bigram整个句子出现概率:
P ( w 1 n ) ≈ P ( w 1 ) P ( w 2 ∣ w 1 ) ⋯ P ( w n ∣ w n − 1 ) P(w_1^n)\approx P(w_1)P(w_2|w_1)\cdots P(w_n|w_{n-1}) P(w1n)P(w1)P(w2w1)P(wnwn1)

Quiz

Given these conditional probabilities
P(Mary)=0.1;
P(likes)=0.2;
P(cats)=0.3
P(Mary|likes) =0.2;
P(likes|Mary) =0.3;
P(cats|likes)=0.1;
P(likes|cats)=0.4
Approximate the probability of the following sentence with bigrams: “Mary likes cats”
Answer:0.003
解析:P(Mary likes cats)=P(Mary)P(likes|Mary)P(cats|likes)=0.1×0.3×0.1=0.003

Starting and Ending Sentences

Start of sentence token <s>

P ( t h e   t e a c h e r   d r i n k s   t e a ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(the\space teacher\space drinks\space tea) \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(the teacher drinks tea)P(the)P(teacherthe)P(drinksteacher)P(teadrinks)
可以看到第一个单词没有前置词,无法使用Bigram来计算条件概率,因此,我们通常会加上一个特殊项,使得上面的公式右边每一项都变成Bigram,the teacher drinks tea就变成了<s> the teacher drinks tea,概率计算变成:
P ( < s >   t h e   t e a c h e r   d r i n k s   t e a ) ≈ P ( t h e ∣ < s > ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(<s>\space the\space teacher\space drinks\space tea) \approx P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(<s> the teacher drinks tea)P(the<s>)P(teacherthe)P(drinksteacher)P(teadrinks)

对于Trigram:
P ( t h e   t e a c h e r   d r i n k s   t e a ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t h e   t e a c h e r ) P ( t e a ∣ t e a c h e r   d r i n k s ) P(the\space teacher\space drinks\space tea)\approx P(the)P(teacher|the)P(drinks| the\space teacher)P(tea|teacher\space drinks) P(the teacher drinks tea)P(the)P(teacherthe)P(drinksthe teacher)P(teateacher drinks)
需要加上两个<s>,得到:<s> <s> the teacher drinks tea

进一步推广到N-gram,则需要添加N-1个<s>

End of sentence token </s> -motivation

第一个动机:
对于公式:
P ( y ∣ x ) = C ( x , y ) ∑ w C ( x , w ) = C ( x , y ) C ( x ) P(y|x)=\cfrac{C(x,y)}{\sum_wC(x,w)}=\cfrac{C(x,y)}{C(x)} P(yx)=wC(x,w)C(x,y)=C(x)C(x,y)
当我们计算最后一个词的时候,上面公式的分母不一定相等,即: ∑ w C ( x , w ) ≠ C ( x ) \sum_wC(x,w)\neq C(x) wC(x,w)=C(x)
例如有语料库:
<s> Lyn drinks chocolate
<s> John drinks
数一下drinks后面带有单词出现的次数是1:
∑ w C ( d r i n k s , w ) = 1 \sum_wC(drinks,w)=1 wC(drinks,w)=1
drinks单独出现的次数是2:
∑ w C ( d r i n k s ) = 2 \sum_wC(drinks)=2 wC(drinks)=2
第二个动机:
假如有语料库:
<s> yes no
<s> yes yes
<s> no no
先生成长度为2的句子:
<s> yes yes
<s> yes no
<s> no no
<s> no yes
以第一个<s> yes yes为例,计算其出现概率:
P ( < s >   y e s   y e s ) = P ( y e s ∣ < s > ) × P ( y e s ∣ y e s ) = C ( < s > , y e s ) ∑ w C ( < s > , w ) × C ( y e s , y e s ) ∑ w C ( y e s , w ) = 2 3 × 1 2 = 1 3 P(<s>\space yes\space yes)=P(yes|<s>)\times P(yes|yes)\\ =\cfrac{C(<s>,yes)}{\sum_wC(<s>,w)}\times\cfrac{C(yes,yes)}{\sum_wC(yes,w)}\\ =\cfrac{2}{3}\times\cfrac{1}{2}=\cfrac{1}{3} P(<s> yes yes)=P(yes<s>)×P(yesyes)=wC(<s>,w)C(<s>,yes)×wC(yes,w)C(yes,yes)=32×21=31
同理,可以计算得到<s> yes no出现概率为:1/3;<s> no no出现概率为:1/3;<s> no yes 出现概率为:0;
也就是说所有长度为2的句子出现概率总和为: ∑ 2   w o r d P ( ⋯   ) = 1 / 3 + 1 / 3 + 1 / 3 + 0 = 1 \sum_{2\space word}P(\cdots)=1/3+1/3+1/3+0=1 2 wordP()=1/3+1/3+1/3+0=1
同理可以计算长度为3的句子:
在这里插入图片描述
这个结果是不符合我们的假设的,正常来说,根据语料库生成所有句子的可能性加起来应该为1,而不是某个长度的句子生成概率为1:
∑ 2   w o r d P ( ⋯   ) + ∑ 3   w o r d P ( ⋯   ) + ⋯ = 1 \sum_{2\space word}P(\cdots)+\sum_{3\space word}P(\cdots)+\cdots=1 2 wordP()+3 wordP()+=1

End of sentence token </s> -solution

解决方法就是在句末加</s>,例如:<s> the teacher drinks tea </s>,出现概率为:
P ( t h e ∣ < s > ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P ( < / s > ∣ t e a ) P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks)P(</s>|tea) P(the<s>)P(teacherthe)P(drinksteacher)P(teadrinks)P(</s>tea)
注意:和句首不一样,即使是N-gram也只需要加一个</s>,例如Trigram:
the teacher drinks tea=> <s> <s> the teacher drinks tea </s>

对于动机1:
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
数一下drinks后面带有单词出现的次数是2:
∑ w C ( d r i n k s , w ) = 2 \sum_wC(drinks,w)=2 wC(drinks,w)=2
drinks单独出现的次数是2:
∑ w C ( d r i n k s ) = 2 \sum_wC(drinks)=2 wC(drinks)=2

Example-bigram

假设语料库为:
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
以下是一些单词出现概率计算结果:
P ( J o h n ∣ < s > ) = 1 3 P ( < / s > ∣ t e a ) = 1 1 P(John|<s>)=\cfrac{1}{3}\quad P(</s>|tea)=\cfrac{1}{1} P(John<s>)=31P(</s>tea)=11
P ( c h o c o l a t e ∣ e a t s ) = 1 1 P ( L y n ∣ < s > ) = 2 3 P(chocolate |eats )=\cfrac{1}{1}\quad P(Lyn |<s>)=\cfrac{2}{3} P(chocolateeats)=11P(Lyn<s>)=32
对于第一句话出现概率为:
P ( s e n t e n c e ) = 2 3 × 1 2 × 1 2 × 2 2 = 1 6 P(sentence)=\cfrac{2}{3}\times\cfrac{1}{2}\times\cfrac{1}{2}\times\cfrac{2}{2}=\cfrac{1}{6} P(sentence)=32×21×21×22=61
可以看到,计算结果要比3个句子情况下出现概率为1/3的概率要低,剩余概率可以分布到语料库中使用bigram生成的其他句子中,这就是模型的泛化方式。

Quiz

Question:
Given these conditional probabilities
P(Mary)=0.1;
P(likes)=0.2;
P(cats)=0.3
P(Mary|<s>)=0.2;
P(</s>|cats)=0.6
P(likes|Mary) =0.3;
P(cats|likes)=0.1
Approximate the probability of the following sentence with bigrams: “<s> Mary likes cats </s>”
Answer: 0.0036
解析:P(Mary|<s>)P(likes|Mary)P(cats|likes)P(</s>|cats)=0.2×0.3×0.1×0.6

The N-gram Language Model

Count matrix

在N-gram的公式中:
P ( w n ∣ w n − N + 1 n − 1 ) = C ( w n − N + 1 n − 1 , w n ) C ( w n − N + 1 n − 1 ) P(w_n|w_{n-N+1}^{n-1})=\cfrac{C(w_{n-N+1}^{n-1},w_n)}{C(w_{n-N+1}^{n-1})} P(wnwnN+1n1)=C(wnN+1n1)C(wnN+1n1,wn)
分子: C ( w n − N + 1 n − 1 , w n ) C(w_{n-N+1}^{n-1},w_n) C(wnN+1n1,wn)
Count matrix计算了在语料库中出现的所有共现次数。
它的行值是非重复语料库前一词
列是所有非重复语料当前词
Bigram count matrix实例:
Corpus:<s> I study I learn </s>
在这里插入图片描述
上面的study I在语料库出现1次

Probability matrix

上面以及计算了分子,再计算出分母后就得到概率矩阵
Divide each cell by its row sum
s u m ( r o w ) = ∑ w ∈ V C ( w n − N + 1 n − 1 , w n ) = C ( w n − N + 1 n − 1 ) sum(row)=\sum_{w\in V}C(w_{n-N+1}^{n-1},w_n)=C(w_{n-N+1}^{n-1}) sum(row)=wVC(wnN+1n1,wn)=C(wnN+1n1)
根据Count matrix计算每行的求和
在这里插入图片描述
然后计算概率得到Probability matrix:
在这里插入图片描述

Language model

通过Probability matrix,Language model可以计算:
○ Sentence probability
○ Next word prediction
例如,根据上一节的Probability matrix,计算<s> I learn </s>这个句子的概率:
P ( s e n t e n c e ) = P ( I ∣ < s > ) P ( l e a r n ∣ I ) P ( < / s > ∣ l e a r n ) = 1 × 0.5 × 1 = 0.5 P(sentence)=P(I|<s>)P(learn|I)P(</s>|learn)=1\times0.5\times1=0.5 P(sentence)=P(I<s>)P(learnI)P(</s>learn)=1×0.5×1=0.5

Log probability

同样的,这里也出现了多个概率相乘的情况,需要使用对数计算防止下溢。
P ( w 1 n ) ≈ ∏ i = 1 n P ( w i ∣ w i − 1 ) P(w_1^n ) \approx\prod_{i=1}^{n} P(w_i | w_{i-1}) P(w1n)i=1nP(wiwi1)
取对数后:
log ⁡ ( P ( w 1 n ) ) ≈ ∑ i = 1 n log ⁡ ( P ( w i ∣ w i − 1 ) ) \log(P(w_1^n ) )\approx\sum_{i=1}^{n}\log( P(w_i | w_{i-1})) log(P(w1n))i=1nlog(P(wiwi1))

Generative Language model

实例:
在这里插入图片描述
可以看到,生成语言模型算法大概步骤如下:

  1. Choose sentence start
  2. Choose next bigram starting with previous word
  3. Continue until </s> is picked

Language Model Evaluation

Test data

For smaller corporaFor large corpora (typical for text)
Train80% Train98%
Validation10% Validation1%
Test10% Validation1%

●split method
对于连续的文本
在这里插入图片描述
对于Random short sequences
在这里插入图片描述

Perplexity

Perplexity(困惑度)是自然语言处理中用来衡量语言模型好坏的一个指标,特别是在评估语言模型对文本的预测能力时。Perplexity的公式通常表示为:

markdown
PP ( W ) = P ( w 1 , w 2 , . . . , w m ) − 1 m \text{PP}(W) = P(w_1 ,w_2 ,...,w_m)^{-\frac{1}{m}} PP(W)=P(w1,w2,...,wm)m1
其中:
P ( w 1 , w 2 , . . . , w m ) P(w_1 ,w_2 ,...,w_m) P(w1,w2,...,wm)是语言模型对观测到的词序列的概率的乘积
m m m 是词序列中的词的总数。
具体来说, P ( w 1 , w 2 , . . . , w N ) P(w_1 ,w_2 ,...,w_N) P(w1,w2,...,wN) 可以展开为:

P ( w 1 , w 2 , . . . , w N ) = ∏ i = 1 N P ( w i ∣ w 1 , w 2 , . . . , w i − 1 ) P(w_1, w_2, ..., w_N) = \prod_{i=1}^{N} P(w_i | w_1, w_2, ..., w_{i-1}) P(w1,w2,...,wN)=i=1NP(wiw1,w2,...,wi1)
这里:
w i w_i wi表示序列中的第 i i i 个词。
P ( w i ∣ w − 1 , w 2 , . . . , w i − 1 ) P(w_ i ∣w-1 ,w_2 ,...,w_{i−1} ) P(wiw1,w2,...,wi1) 是给定前 i − 1 i−1 i1 个词的情况下,第 i i i 个词出现的概率。
Perplexity的计算公式中的 P − 1 N P^{-\frac{1}{N}} PN1 表示的是所有词的概率的几何平均值的倒数。几何平均值可以看作是所有概率乘积的N次方根,而取倒数是为了将平均值转换为原始概率的尺度。
困惑度越低,表示语言模型对数据的预测越准确,即模型对词序列的预测越不困惑。在实践中,一个低困惑度的语言模型意味着它能够更好地预测下一个词,从而生成更自然、更连贯的句子。
Smaller perplexity = better model
Character level models PP < word based models PP

Perplexity for bigram models

P P ( W ) = ∏ i = 1 m ∏ j = 1 ∣ s i ∣ 1 P ( w j ( i ) ∣ w j − 1 ( i ) ) m PP(W)=\sqrt[m]{\prod_{i=1}^m\prod_{j=1}^{|s_i|}\cfrac{1}{P(w_j^{(i)}|w_{j-1}^{(i)})}} PP(W)=mi=1mj=1siP(wj(i)wj1(i))1
w j ( i ) w_j^{(i)} wj(i)表示第i个句子中的第j个词

concatenate all sentences in W
然后计算bigram模型的困惑度,需要计算所有句子的bigram概率的乘积,然后取幂次-1/m
P P ( W ) = ∏ i = 1 m 1 P ( w i ∣ w i − 1 ) m PP(W)=\sqrt[m]{\prod_{i=1}^m\cfrac{1}{P(w_i|w_{i-1})}} PP(W)=mi=1mP(wiwi1)1
w i w_{i} wi表示test set中第i个词

Log perplexity

同样将乘法变成加法:
log ⁡ P P ( W ) = 1 m ∑ i = 1 m log ⁡ 2 ( P ( w i ∣ w i − 1 ) ) \log PP(W)=\cfrac{1}{m}\sum_{i=1}^m\log_2(P(w_i|w_{i-1})) logPP(W)=m1i=1mlog2(P(wiwi1))

Example

在这里插入图片描述
Training 38 million words, test 1.5 million words, WSJ corpus
Perplexity Unigram: 962 Bigram: 170 Trigram: 109
WSJ corpus,全称为Wall Street Journal (WSJ) Corpus,是一个广泛使用的文本语料库,它基于《华尔街日报》的文本内容。这个语料库在自然语言处理(NLP)领域非常知名,特别是用于语言模型的训练和评估。

Out of Vocabulary Words

Out of vocabulary words

Closed vs. Open vocabularies
封闭词汇表提供了一种简化的方法来处理文本,但可能会牺牲对新词的处理能力;而开放词汇表提供了更大的灵活性,可以更好地适应多样化的语言使用,但可能会增加模型的复杂性和计算成本。
Closed Vocabularies(封闭词汇表):
在封闭词汇表系统中,模型在训练前定义了一个固定的词汇集,这个词汇集包含了所有在模型训练和预测时会用到的单词或标记(tokens)。
任何不在词汇表中的词在处理时通常会被忽略或替换为一个特殊的未知标记(如<UNK>)。
封闭词汇表有助于减少模型的复杂性,因为它限制了模型需要学习和预测的词汇数量。
这种方法的一个缺点是,它无法很好地处理词汇表之外的新词或罕见词,这可能会影响模型对新文本的理解能力。

Open Vocabularies(开放词汇表):
开放词汇表系统不限制模型使用的词汇数量。模型可以处理任何它遇到的词,无论这些词是否在训练数据中出现过。
在这种设置下,模型通常使用子词分割(subword segmentation)技术,如Byte Pair Encoding(BPE)或WordPiece,来处理不在训练集中的词。
开放词汇表可以更好地处理多样化的文本,包括专业术语或新出现的词汇,因为它们不会被简单地替换为未知标记。
然而,这种方法可能会增加模型的复杂性,因为模型需要学习更多的词汇和它们之间的关系。

Unknown word = Out of vocabulary word (OOV)
special tag <UNK> in corpus and in input

Using <UNK> in corpus

步骤:
● Create vocabulary V
● Replace any word in corpus and not in V by <UNK>
● Count the probabilities with <UNK> as with any other word
例子:
Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
将词表门槛定为最少出现两次:Min frequency f=2
<s> Lyn drinks chocolate </s>
<s> <UNK> drinks <UNK> </s>
<s> Lyn <UNK> chocolate </s>
最后的词表为:
Vocabulary
Lyn, drinks, chocolate
在进行输入查询时,如果有非词表的单词,也要替换为UNK
<s>Adam drinks chocolate</s>
<s><UNK> drinks chocolate</s>

How to create vocabulary V

两种条件:

  1. 设定单词最小出现频率,大于该频率的进入词表,否则设置为UNK
  2. 设定词表最大容量 ∣ V ∣ |V| V,按单词出现频率排序,将前 ∣ V ∣ |V| V个单词包含进词表,其他的设置为UNK

虽然UNK对于降低困惑度有效,但不建议设置过多的UNK词,否则在你生成句子的时候会看到很多的UNK
在比较困惑度的时候,only compare LMs with the same V

Quiz

Given the training corpus and minimum word frequency=2, how would the vocabulary for corpus
preprocessed with <UNK> look like?
“<s> I am happy I am learning </s> <s> I am happy I can study </s>”
Answer:
V = (I,am,happy)

Smoothing

Missing N-grams in training corpus

Problem: N-grams made of known words still might be missing in the training corpus
如何处理由语料库中出现的单词组成但Ngram本身不存在的N-gram的概率
例如,语料库有“John”,“eats”,但是没有“John eats”,此时“John eats”的计数为0,其bigram概率也为0,会导致整个句子出现概率也为0

Smoothing

Add-one smoothing (Laplacian smoothing)

P ( w n ∣ w n − 1 ) = C ( w n − 1 , w n ) + 1 ∑ w ∈ V ( C ( w n − 1 , w n ) + 1 ) = C ( w n − 1 , w n ) + 1 C ( w n − 1 ) + V P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+1}{\sum_{w\in V}(C(w_{n-1},w_n)+1)}=\cfrac{C(w_{n-1},w_n)+1}{C(w_{n-1})+V} P(wnwn1)=wV(C(wn1,wn)+1)C(wn1,wn)+1=C(wn1)+VC(wn1,wn)+1
Add-one smoothing需要在词表足够大的情况下使用,否则会使得缺失单词概率过高。
如果语料库非常大,则可以使用Add k smoothing(可用在3gram、4gram等高阶gram上):
P ( w n ∣ w n − 1 ) = C ( w n − 1 , w n ) + k ∑ w ∈ V ( C ( w n − 1 , w n ) + k ) = C ( w n − 1 , w n ) + k C ( w n − 1 ) + k × V P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+k}{\sum_{w\in V}(C(w_{n-1},w_n)+k)}=\cfrac{C(w_{n-1},w_n)+k}{C(w_{n-1})+k\times V} P(wnwn1)=wV(C(wn1,wn)+k)C(wn1,wn)+k=C(wn1)+k×VC(wn1,wn)+k
Advanced methods:
Kneser-Ney Smoothing(Kneser-Ney 平滑):
Kneser-Ney 由 Reinhard Kneser 和 Hermann Ney 提出,是一种用于计算条件概率分布的平滑技术。
它通过调整概率分布,使得低频词或未见词的概率分布更加均匀,从而提高语言模型的泛化能力。
Kneser-Ney 考虑了词的上下文,通过加权平均的方式来更新概率,其中权重取决于词在语料库中的相对频率。
它特别适合处理大规模语料库,因为它可以有效地利用语料中的统计信息。

Good-Turing Smoothing(Good-Turing 平滑):
Good-Turing smoothing 是由I. J. Good提出的,用于估计在语料库中未出现过的词的概率。
它基于一个简单的统计观察:在语料库中出现一次的词的数量大约是出现多次的词的数量的一半。
Good-Turing 方法通过将概率质量从高频词转移到低频词来实现平滑,特别是对于那些在训练语料中未出现过的词。

这种方法简单且计算效率高,但可能不如 Kneser-Ney 方法那样灵活,因为它不区分不同上下文中的词。
两种平滑方法各有优势和局限性。Kneser-Ney smoothing 通常在实际应用中表现更好,因为它考虑了词的上下文信息,但计算复杂度较高。Good-Turing smoothing 则因其简单性和效率而在某些情况下被采用,尤其是在资源受限的情况下。

Backoff

If N-gram missing => use (N-1)-gram, …有两种backoff方式
第一种是直接替换:Probability discounting e.g. Katz backoff
第二种是乘以某个常数(0.4比较好)后替换:“Stupid” backoff
在这里插入图片描述

Interpolation

Interpolation(插值)是一种在自然语言处理中用于平滑语言模型的技术,特别是在处理不同概率分布的组合时。它通过将多个模型或分布的输出以某种方式结合起来,以减少模型的不确定性和过拟合,同时提高泛化能力。最常见的插值方法是线性插值,它简单地将不同模型的概率输出按照一定的权重进行加权平均。
在这里插入图片描述
系数 λ \lambda λ可以通过训练来确定

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1909800.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

tauri如何实现窗口拖动,自定义标题栏

文章目录 一、tauri是什么&#xff1f;二、封装好的标题栏&#xff0c;引用修改即可使用三 相关配置实现细节实现窗口拖动 一、tauri是什么&#xff1f; Tauri是一个开源框架&#xff0c;用于创建跨平台的桌面应用程序。它使用Rust编程语言&#xff0c;并结合了现有的Web技术&…

javascript DOM BOM 笔记

Web API API的概念 API&#xff08;Application Programming Interface,应用程序编程接口&#xff09;是一些预先定义的函数&#xff0c;目的是提供应用程序与开发人员基于某软件或硬件得以访问一组例程的能力&#xff0c;而又无需访问源码&#xff0c;或理解内部工作机制的细…

PLM系统供应商:PLM系统供应商哪家好

PLM系统供应商&#xff1a;PLM系统供应商哪家好 在智能制造时代&#xff0c;产品生命周期管理&#xff08;PLM&#xff09;系统已成为企业提升产品创新能力、优化生产流程、加速产品上市的关键工具。作为这一领域的核心力量&#xff0c;PLM系统供应商正以前所未有的速度和深度&…

vue3实现无缝滚动 列表滚动 vue3-seamlessscroll

vue3框架内使用无缝滚动&#xff0c;使用一个插件比较合适&#xff08;gitee地址&#xff09;&#xff1a; vue3-seamless-scroll: Vue3.0 无缝滚动组件 具体更多配置请看&#xff1a; 组件配置 | vue3-scroll-seamless 1. 安装&#xff1a; npm install vue3-seamless-sc…

红酒与电影经典:那些银幕上的醉人瞬间

在光影交织的银幕世界里&#xff0c;红酒不仅是品味生活的象征&#xff0c;更是情感与故事的催化剂。每当夜幕降临&#xff0c;一杯色泽深邃的红酒&#xff0c;便能带我们走进那些令人陶醉的影片瞬间&#xff0c;感受不同的人生百态。今天&#xff0c;就让我们一起回味那些银幕…

【北京迅为】《i.MX8MM嵌入式Linux开发指南》-第一篇 嵌入式Linux入门篇-第十二章 Linux 权限管理

i.MX8MM处理器采用了先进的14LPCFinFET工艺&#xff0c;提供更快的速度和更高的电源效率;四核Cortex-A53&#xff0c;单核Cortex-M4&#xff0c;多达五个内核 &#xff0c;主频高达1.8GHz&#xff0c;2G DDR4内存、8G EMMC存储。千兆工业级以太网、MIPI-DSI、USB HOST、WIFI/BT…

25考研,数二全程跟的张宇老师请问660(做了一半)880和张宇1000题应该怎么选择?

跟张宇老师&#xff0c;也可以做其他的题集&#xff0c;不一定非要做1000题 我当初考研复习的时候&#xff0c;也听了张宇老师的课程&#xff0c;但是我并没有做1000题 因为1000题对于我来说太难了。做了一章之后&#xff0c;就换成其他的题目了。 对于大家来说&#xff0c;…

MySQL——第一次作业

部署MySQL 8.0环境 1&#xff0c;删除之前存在的MySQL程序 控制面板删除 2&#xff0c;删除完成后下载MySQL 官网&#xff1a; https://www.mysql.com 在window下下载MSI版本 3&#xff0c;自定义安装 4&#xff0c;配置环境变量 1&#xff0c;系统高级系统设置 2&#xff…

网络通信、BIO、NIO

1. 涉及的网络基础知识 Socket&#xff1a; 操作系统提供的api&#xff0c;介于应用层和tcp/ip层之间的软件层&#xff0c;封装服务器客户端之间网络通信相关内容&#xff0c;方便调用 IO多路复用&#xff1a; &#xff08;I/O Multiplexing&#xff09;是一种IO操作模式&a…

《算法笔记》总结No.5——递归

一.分而治之 将原问题划分为若干个规模较小而结构与原问题相同或相似的子问题&#xff0c;然后分别解决这些子问题&#xff0c;最后合并子问题的解&#xff0c;即可得到原问题的解&#xff0c;步骤抽象如下&#xff1a; 分解&#xff1a;将原问题分解为若干子问题解决&#x…

网络基础:二层交换与多层交换

二层交换 二层交换是以太网交换机的基本功能&#xff1b;二层交换指的是交换机根据数据帧的第二层头部中的目的MAC地址进行帧转发的行为。 每台交换机都维护一个MAC地址表&#xff0c;用于指导数据帧转发&#xff1b;MAC地址表&#xff08;MAC Address Table&#xff09;&…

基于vue的引入登录界面

以下是一些常见的登录页面布局&#xff1a; 1. 中心布局 - 登录表单位于页面的中心位置&#xff0c;通常包括用户名输入框、密码输入框、登录按钮等元素。页面背景简洁&#xff0c;以突出登录表单。 - 这种布局常见于大多数网站和应用&#xff0c;简洁明了&#xff0c;用户注意…

Spring Boot Vue 毕设系统讲解 3

目录 项目配置类 项目中配置的相关代码 spring Boot 拦截器相关知识 一、基于URL实现的拦截器&#xff1a; 二、基于注解的拦截器 三、把拦截器添加到配置中&#xff0c;相当于SpringMVC时的配置文件干的事儿&#xff1a; 项目配置类 项目中配置的相关代码 首先定义项目认…

WebRTC API接口教程:实现高效会议的步骤?

WebRTC api接口教程如何使用&#xff1f;WebRTC api接口的功能&#xff1f; WebRTC无需中间服务器即可传输音视频流&#xff0c;为视频会议、在线教育等应用提供了强大的支持。AokSend将详细介绍如何利用WebRTC API接口实现高效会议的步骤。 WebRTC API接口教程&#xff1a;获…

澳大利亚TikTok直播为什么需要海外直播专线?

近年来&#xff0c;许多卖家为了解决澳大利亚TikTok直播中的卡顿和高延迟问题&#xff0c;纷纷选择使用海外直播专线。这种专线服务是一种高效、低延迟的数据传输解决方案&#xff0c;专为需要高质量网络连接的场合设计。 与公共互联网相比&#xff0c;海外直播专线提供更稳定、…

海外展会 | 7月8-11日,INNOPROM 2024 ,云仪邀您俄罗斯观展→

7月8日&#xff0c;2024俄罗斯工业博览会&#xff08;INNOPROM 2024&#xff09;在俄罗斯工业中心第三大城市“叶卡捷琳堡国际展览中心”隆重举办。作为俄罗斯最大的工业展览会之一&#xff0c;INNOPROM 2024不仅展示了俄罗斯本土的工业实力&#xff0c;更是一个集合了全球最新…

快团团开团大团长和帮卖团长如何合并“收件人信息相同的订单”核销打印?

快团团开团大团长和帮卖团长如何合并“收件人信息相同的订单”核销打印&#xff1f; 一、背景 经营方式为线下自提等无需快递的团长&#xff0c;在核销打印订单时&#xff0c;需要将“收件人信息相同的订单”合并核销打印 二、操作说明 第一步&#xff0c;团长电脑端登陆快…

mybatis 延迟加载

MyBatis的延迟加载&#xff08;Lazy Loading&#xff09;是一种优化技术&#xff0c;用于在需要时才加载关联对象或集合&#xff0c;从而提高性能和效率。以下是对MyBatis延迟加载的详细介绍&#xff1a; 延迟加载的基本概念 延迟加载是指在第一次访问对象的属性时才加载该对象…

G24代刷卡鉴权方案

PICK_G24 产品型号 PICK_G24 尺寸 124x90x12mm&#xff08;不含安装支架&#xff09; 屏幕尺寸 4.2 inch 显示技术 电子墨水屏显示 显示区域面积 (mm) 84.8(H) x 63.6(V) 分辨率 400*300 像素密度 120dpi 显示颜色 黑/白 外观颜色 白色&灰外圏 按键 …

关于 Mac 系统 .DS_store 文件的起源

原文&#xff1a;Arno - 2006.10.01 &#xff08;前排提醒&#xff1a;可以在 .gitignore 中添加 .DS_Store&#xff0c;否则 git 仓库会存储这个和项目无关的文件。&#xff09; 如果你是 Mac 用户&#xff0c;曾经将文件从 Mac 传输到 Windows&#xff0c;那么可能对 .DS_S…