C2W2.Assignment.Parts-of-Speech Tagging (POS).Part1

理论课：C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models

文章目录

0 Data Sources
1 POS Tagging
- 1.1 Training
- - Transition counts
  - Emission counts
  - Tag counts
  - Exercise 01
- 1.2 Testing
- - Exercise 02

理论课： C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models

这次作业将开发语音部分（POS）标记技能，即为输入文本中的每个单词指定一个语音部分标记（名词、动词、形容词…）的过程。标记是很困难的，因为有些词在不同的语境可以有不同标签，例如：

The whole team played well. [adverb]
You are doing well for yourself. [adjective]
Well, this assignment took me forever to complete. [interjection]
The well is dry. [noun]
Tears were beginning to well in her eyes. [verb]

词性标注任务有助于更好地理解句子的意思。该任务对搜索查询至关重要，识别专有名词、组织机构、股票代码或任何类似的东西，将大大提高从语音识别到搜索的各种能力。本作业的内容包括：

了解语音部分标记的工作原理
计算隐马尔可夫模型中的转换矩阵 A
计算隐马尔可夫模型中的发射矩阵 B
计算维特比算法
计算模型的准确性

先导入包

# Importing packages and loading in the data set 
from utils_pos import get_word_tag, preprocess  
import pandas as pd
from collections import defaultdict
import math
import numpy as np
import w2_unittest

0 Data Sources

将用到Wall Street Journal (WSJ)的两个标记数据集，词性标签含义可以看这里。

一个数据集（WSJ-2_21.pos）将用于训练。
另一个数据集（WSJ-24.pos）将用于测试。
标记的训练数据经过预处理后形成了一个词汇表（hmm_vocab.txt，详情见绑定下载资源）。
词汇表中的词是训练集中使用两次或两次以上的词。
词汇表中还添加了一组 “'unknown word tokens”，详情见后文描述。

训练集将用于创建emission、transmission和词性标签的计数。
测试集（WSJ-24.pos）用来创建 y。

其中包含测试文本和真实标签。
测试集还经过预处理以去除标签，形成test_words.txt（可下载）。
读入该文本后，使用 utils_pos.py中提供的函数对其进行进一步处理，以识别句子末尾并处理词汇表中没有的单词。
这就形成了 “prep ”列表，即用于测试 POS 标记器的预处理文本。

POS 标记器会遇到不在其数据集中的单词。

为了提高准确性，需在预处理过程中对这些单词进行进一步分析，以提取关于其适当标记的可用提示。
例如，后缀 “ize ”可以提示单词是动词，如 “final-ize ”或 “character-ize”。
这里使用自定义的未知标记，如“–unk-verb–”或“–unk-noun–”，将取代训练语料库和测试语料库中的未知单词，并将出现在训练语料库和测试语料库中。

在这里插入图片描述
加载训练语料

# load in the training corpus
with open("./data/WSJ_02-21.pos", 'r') as f:
    training_corpus = f.readlines()

print(f"A few items of the training corpus list")
print(training_corpus[0:5])

结果：
A few items of the training corpus list
[‘In\tIN\n’, ‘an\tDT\n’, ‘Oct.\tNNP\n’, ‘19\tCD\n’, ‘review\tNN\n’]

# read the vocabulary data, split by each line of text, and save the list
with open("./data/hmm_vocab.txt", 'r') as f:
    voc_l = f.read().split('\n')

print("A few items of the vocabulary list")
print(voc_l[0:50])
print()
print("A few items at the end of the vocabulary list")
print(voc_l[-50:])

结果：
A few items of the vocabulary list
[‘!’, ‘#’, ‘$’, ‘%’, ‘&’, “'”, “‘’”, “'40s”, “'60s”, “'70s”, “'80s”, “'86”, “'90s”, “'N”, “'S”, “'d”, “'em”, “'ll”, “'m”, “‘n’”, “'re”, “'s”, “'til”, “'ve”, ‘(’, ‘)’, ‘,’, ‘-’, ‘–’, ‘–n–’, ‘–unk–’, ‘–unk_adj–’, ‘–unk_adv–’, ‘–unk_digit–’, ‘–unk_noun–’, ‘–unk_punct–’, ‘–unk_upper–’, ‘–unk_verb–’, ‘.’, ‘…’, ‘0.01’, ‘0.0108’, ‘0.02’, ‘0.03’, ‘0.05’, ‘0.1’, ‘0.10’, ‘0.12’, ‘0.13’, ‘0.15’]

A few items at the end of the vocabulary list
[‘yards’, ‘yardstick’, ‘year’, ‘year-ago’, ‘year-before’, ‘year-earlier’, ‘year-end’, ‘year-on-year’, ‘year-round’, ‘year-to-date’, ‘year-to-year’, ‘yearlong’, ‘yearly’, ‘years’, ‘yeast’, ‘yelled’, ‘yelling’, ‘yellow’, ‘yen’, ‘yes’, ‘yesterday’, ‘yet’, ‘yield’, ‘yielded’, ‘yielding’, ‘yields’, ‘you’, ‘young’, ‘younger’, ‘youngest’, ‘youngsters’, ‘your’, ‘yourself’, ‘youth’, ‘youthful’, ‘yuppie’, ‘yuppies’, ‘zero’, ‘zero-coupon’, ‘zeroing’, ‘zeros’, ‘zinc’, ‘zip’, ‘zombie’, ‘zone’, ‘zones’, ‘zoning’, ‘{’, ‘}’, ‘’]
创建词典，键是单词，值是一个整数

# vocab: dictionary that has the index of the corresponding words
vocab = {}

# Get the index of the corresponding words. 
for i, word in enumerate(sorted(voc_l)): 
    vocab[word] = i       
    
print("Vocabulary dictionary, key is the word, value is a unique integer")
cnt = 0
for k,v in vocab.items():
    print(f"{k}:{v}")
    cnt += 1
    if cnt > 20:
        break

结果：
Vocabulary dictionary, key is the word, value is a unique integer
:0
!:1
#:2
$:3
%:4
&:5
':6
‘’:7
'40s:8
'60s:9
'70s:10
'80s:11
'86:12
'90s:13
'N:14
'S:15
'd:16
'em:17
'll:18
'm:19
‘n’:20

加载测试语料

# load in the test corpus
with open("./data/WSJ_24.pos", 'r') as f:
    y = f.readlines()
    
print("A sample of the test corpus")
print(y[0:10])

结果：
A sample of the test corpus
[‘The\tDT\n’, ‘economy\tNN\n’, “'s\tPOS\n”, ‘temperature\tNN\n’, ‘will\tMD\n’, ‘be\tVB\n’, ‘taken\tVBN\n’, ‘from\tIN\n’, ‘several\tJJ\n’, ‘vantage\tNN\n’]
可以看到测试语料是带词性标签的，现在需要去掉标签，便于进行预测：

#corpus without tags, preprocessed
_, prep = preprocess(vocab, "./data/test.words")     

print('The length of the preprocessed test corpus: ', len(prep))
print('This is a sample of the test_corpus: ')
print(prep[0:10])

结果：
The length of the preprocessed test corpus: 34199
This is a sample of the test_corpus:
[‘The’, ‘economy’, “'s”, ‘temperature’, ‘will’, ‘be’, ‘taken’, ‘from’, ‘several’, ‘–unk–’]

1 POS Tagging

1.1 Training

先从简单的开始，针对没有多种词性标签的单词进行处理：

例如，“is ”是一个动词，它没有别的词性标签。
在 “WSJ ”语料库中，86%$的词性标签是单一的（即它们只有一个标签）
大约 $14$ 是模棱两可的（即它们有一个以上的标记）

开始进行词性标签预测之前，先完成三个字典

Transition counts

transition_counts用于计算每个词性标签与另一个词性标签相邻出现的次数。
$P(t_i |t_{i-1}) \tag{1}$
表示位置 $i$ 的标签与位置 $i - 1$ 的标签之间的概率。
为了计算公式 1，需要创建一个 transition_counts 字典，其中

键是 (prev_tag, tag)
值是这两个标记按该顺序出现的次数。

Emission counts

emission_counts用于计算给定词性标签的条件下某个单词的出现概率。：
$P(w_i|t_i)\tag{2}$
该字典的

键是 (tag, word)。
值是该词对在训练集中出现的次数。

Tag counts

最后一个字典是Tag counts：

关键字是标签
值是每个标签出现的次数。

Exercise 01

编写函数create_dictionaries，吃进去training_corpus，返回上面提到的三个字典：transition_counts, emission_counts, tag_counts.
函数要使用到defaultdict，它是 dict 的子类。

标准 Python 字典会在尝试访问一个当前不在字典中的键时抛出一个 KeyError 错误。
相反，defaultdict 会创建一个与参数类型相同的项，在本函数中是一个默认值为 0 的整数。

# UNQ_C1 GRADED FUNCTION: create_dictionaries
def create_dictionaries(training_corpus, vocab, verbose=True):
    """
    Input: 
        training_corpus: a corpus where each line has a word followed by its tag.
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output: 
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
        tag_counts: a dictionary where the keys are the tags and the values are the counts
    """
    
    # initialize the dictionaries using defaultdict
    emission_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    
    # Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'
    prev_tag = '--s--' 
    
    # use 'i' to track the line number in the corpus
    i = 0 
    
    # Each item in the training corpus contains a word and its POS tag
    # Go through each word and its tag in the training corpus
    for word_tag in training_corpus:
        
        # Increment the word_tag count
        i += 1
        
        # Every 50,000 words, print the word count
        if i % 50000 == 0 and verbose:
            print(f"word count = {i}")
            
        ### START CODE HERE ###
        # get the word and tag using the get_word_tag helper function (imported from utils_pos.py)
        # the function is defined as: get_word_tag(line, vocab)
        word, tag = get_word_tag(word_tag, vocab)
        
        # Increment the transition count for the previous word and tag
        transition_counts[(prev_tag, tag)] += 1
        
        # Increment the emission count for the tag and word
        emission_counts[(tag, word)] += 1

        # Increment the tag count
        tag_counts[tag] += 1

        # Set the previous tag to this tag (for the next iteration of the loop)
        prev_tag = tag
        
        ### END CODE HERE ###
        
    return emission_counts, transition_counts, tag_counts

运行：

emission_counts, transition_counts, tag_counts = create_dictionaries(training_corpus, vocab)

结果：
word count = 50000
word count = 100000
word count = 150000
word count = 200000
word count = 250000
word count = 300000
word count = 350000
word count = 400000
word count = 450000
word count = 500000
word count = 550000
word count = 600000
word count = 650000
word count = 700000
word count = 750000
word count = 800000
word count = 850000
word count = 900000
word count = 950000

将所有词性标签打印出来看一看：

# get all the POS states
states = sorted(tag_counts.keys())
print(f"Number of POS tags (number of 'states'): {len(states)}")
print("View these POS tags (states)")
print(states)

结果：
Number of POS tags (number of ‘states’): 46
View these POS tags (states)
[‘#’, ‘ $^{'}, "^{''} ",^{'} (^{'},^{'})^{'},^{'},^{'},^{'} - - s - -^{'},^{'} .^{'},^{'} :^{'},^{'} C C^{'},^{'} C D^{'},^{'} D T^{'},^{'} E X^{'},^{'} F W^{'},^{'} I N^{'},^{'} J J^{'},^{'} JJ R^{'},^{'} JJ S^{'},^{'} L S^{'},^{'} M D^{'},^{'} N N^{'},^{'} NN P^{'},^{'} NNP S^{'},^{'} NN S^{'},^{'} P D T^{'},^{'} PO S^{'},^{'} PR P^{'},^{'} PRP$ ’, ‘RB’, ‘RBR’, ‘RBS’, ‘RP’, ‘SYM’, ‘TO’, ‘UH’, ‘VB’, ‘VBD’, ‘VBG’, ‘VBN’, ‘VBP’, ‘VBZ’, ‘WDT’, ‘WP’, ‘WP$’, ‘WRB’, ‘``’]
上面的词性标签是从训练集中提取出来的，里面包含有辅助词性标签，如： '–s–'表示句子的起始位置。
打印一下转移示例、观察示例和多词性标签的示例：

print("transition examples: ")
for ex in list(transition_counts.items())[:3]:
    print(ex)
print()

print("emission examples: ")
for ex in list(emission_counts.items())[200:203]:
    print (ex)
print()

print("ambiguous word example: ")
for tup,cnt in emission_counts.items():
    if tup[1] == 'back': print (tup, cnt)

结果：

transition examples: 
(('--s--', 'IN'), 5050)
(('IN', 'DT'), 32364)
(('DT', 'NNP'), 9044)

emission examples: 
(('DT', 'any'), 721)
(('NN', 'decrease'), 7)
(('NN', 'insider-trading'), 5)

ambiguous word example: 
('RB', 'back') 304
('VB', 'back') 20
('RP', 'back') 84
('JJ', 'back') 25
('NN', 'back') 29
('VBP', 'back') 4

1.2 Testing

使用 emission_counts字典测试词性标注的准确性。

在预处理过的测试语料库 prep 中，为语料库中的每个单词分配一个词性标签。
然后使用原始标签测试语料y，计算您正确标记的百分比。

Exercise 02

实现函数predict_pos，计算模型的准确性。

要为单词指定词性标签，要为该单词从训练集中指定最常见的 POS。
然后评估这种方法的效果如何。每次根据给定单词最常见的 POS 进行预测时，检查该单词的实际 POS 是否相同。如果是，则说明预测是正确的！
用正确预测的次数除以预测 POS 标记的单词总数来计算准确率。

函数 predict_pos的目的是遍历一个预处理后的词序列，根据观测/发射概率（emission probabilities）来预测每个词的词性，并与实际的词性标签进行比较，从而计算预测的准确率。以下是代码的逐步解释：
输入参数：

prep: 预处理后的词列表，其中每个元素是一个词。
y: 原始语料库，是一个由(word, POS)组成的元组列表。
emission_counts: 一个字典，其键是(tag, word)元组，值是对应的计数，表示在特定词性下的词出现的次数。
vocab: 词汇表，一个字典，其键是词，值是索引。
states: 所有可能的词性标签的排序列表。

输出：

accuracy: 预测正确的词性与实际词性标签一致的次数占总词数的比例。

函数逻辑：

初始化正确预测的数量 num_correct 为0。
通过 emission_counts.keys() 获取所有(tag, word)的集合 all_words，但这个集合在代码中没有直接使用。
计算语料库 y 中的总词对数量 total。
遍历 prep 和 y 中的每个词和词性元组。
检查每个元组是否包含词和词性，如果不是则跳过。
对于 prep 中的每个词：
- 检查这个词是否在词汇表 vocab 中。
- 如果在词汇表中，遍历所有可能的词性 states。
- 对于每个词性，构建一个(tag, word)键，并检查这个键是否存在于 emission_counts 字典中。
- 如果存在，获取该键对应的计数 count。
- 如果这个计数大于当前记录的最大计数 count_final，则更新 count_final 和预测的词性 pos_final。
如果预测的词性 pos_final 与实际的词性标签 true_label 匹配，则增加正确预测的数量 num_correct。
计算准确率 accuracy 为正确预测的数量除以总词数。
返回准确率

思考：为什么需要count_final？¹

# UNQ_C2 GRADED FUNCTION: predict_pos
def predict_pos(prep, y, emission_counts, vocab, states):
    '''
    Input: 
        prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.不带标签，经过预处理的单词语料
        y: a corpus composed of a list of tuples where each tuple consists of (word, POS)
        emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count
        vocab: a dictionary where keys are words in vocabulary and value is an index
        states: a sorted list of all possible tags for this assignment
    Output: 
        accuracy: Number of times you classified a word correctly
    '''
    
    # Initialize the number of correct predictions to zero
    num_correct = 0
    
    # Get the (tag, word) tuples, stored as a set
    all_words = set(emission_counts.keys())
    
    # Get the number of (word, POS) tuples in the corpus 'y'
    total = len(y)
    for word, y_tup in zip(prep, y): 

        # Split the (word, POS) string into a list of two items
        y_tup_l = y_tup.split()
        
        # Verify that y_tup contain both word and POS
        if len(y_tup_l) == 2:
            
            # Set the true POS label for this word
            true_label = y_tup_l[1]

        else:
            # If the y_tup didn't contain word and POS, go to next word
            continue
    
        count_final = 0
        pos_final = ''
        
        # If the word is in the vocabulary...
        if word in vocab:
            for pos in states:

            ### START CODE HERE (Replace instances of 'None' with your code) ###
            
                # define the key as the tuple containing the POS and word
                key = (pos,word)

                # check if the (pos, word) key exists in the emission_counts dictionary
                if key in emission_counts.keys(): # Replace None in this line with the proper condition.

                # get the emission count of the (pos,word) tuple 
                    count = emission_counts[key]

                    # keep track of the POS with the largest count
                    if count_final<count: # Replace None in this line with the proper condition.

                        # update the final count (largest count)
                        count_final = count

                        # update the final POS
                        pos_final = pos

            # If the final POS (with the largest count) matches the true POS:
            if pos_final==true_label: # Replace None in this line with the proper condition.
                # Update the number of correct predictions
                num_correct += 1
            
    ### END CODE HERE ###
    accuracy = num_correct / total
    
    return accuracy

测试：

accuracy_predict_pos = predict_pos(prep, y, emission_counts, vocab, states)
print(f"Accuracy of prediction using predict_pos is {accuracy_predict_pos:.4f}")