第十二周：机器学习

摘要

Abstract

一、非监督学习

二、word embedding

三、transformer

1、应用

2、encoder

3、decoder

四、各类attention

1、最常见的类别

2、其余种类

3、小结

总结

摘要

本周继续学习机器学习的相关课程，首先了解了监督学习和非监督学习的概念，以前所做的”分类和回归“任务都属于监督学习，本周主要聚焦非监督学习，并且将其应用于实践——word embedding；接着更进一步了解了transformer的构造，学习了encoder和decoder的具体流程，并且举例了几种能够应用于seq2seq模型的机器学习任务；最后，总结了几种基本的的和组合式的attention类型。

Abstract

This week, we continued to learn about machine learning. First, we learned about the concepts of supervised learning and unsupervised learning, and the “classification and regression” tasks we did before were supervised learning, but this week, we focused on unsupervised learning, and put it into practice--word embedding. This week, we focused on unsupervised learning and put it into practice - word embedding; then we learned more about the construction of transformer, the specific process of encoder and decoder, and gave several examples of machine learning tasks that can be applied to the seq2seq model; finally, we summarized the basic and combinatorial types of attention.

一、非监督学习

1、聚类

聚类：按照某种特定标准把一个数据集划分为不同的类别，同一类别中的数据对象具有较大的相似性，而不在同一类别中的数据对象的差异性较大。如何划分类、划分类的标准是什么在训练之前都是未知的，这些都需要机器自主学习所得。

2、异常检测

3、降维

二、word embedding

词嵌入：是一种由真实数字组成的稠密向量，每个向量都代表了单词表中的一个单词。

问题：占用空间很大，并且单词之间没有相互联系

解决：分布式假设distribution hypothesis。可以计算相似语义的句子间的相似性

分布式假设：每个单词的含义由它周围的单词形成，依赖于一种基本的语言假设。那些在相似语句中出现的单词，在语义上也是相互关联的。

问题：每个句子的语义属性有很多，并且它们都有可能与相似性有关。但是，应该如何设置不同属性的值呢？

解决：神经网络可以发掘一些潜在的语义属性

代码实践：

step1 导入相关库

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

step2 构建embeds的表

embeds=nn.Embedding(2,5)   #将输入的整数序列转化为密集向量来表示，embeds是一张表.第一个参数代表创建的最大词汇量；第二个参数代表一个单词有几个维度
print(embeds)
word_to_ix={'hello':0,'world':1}
lookup_tensor=torch.tensor([[word_to_ix['hello']],[word_to_ix['hello']]],dtype=torch.long)#把word_to_ix中的value转成tensor格式的数据
embeds(lookup_tensor)

step3 语料实例

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
trigrams=[((test_sentence[i],test_sentence[i+1]),test_sentence[i+2])for i in range(len(test_sentence)-2)]

step4 处理输入格式并测试

vocab=set(test_sentence)#所有单词的集合
# print(vocab)
word_to_ix={word:i for i,word in enumerate(vocab)}#把所有单词和索引一一对应，形成字典的键值对的形式
# print(word_to_ix)
word_to_ix['forty']
print(trigrams[:3])

测试结果如下：

step5 NGram模型的定义

class NGramLanguageModeler(nn.Module):
        def __init__(self, vocab_size, embedding_dim,context_size):
            super(NGramLanguageModeler, self).__init__()
            self.embeddings = nn.Embedding(vocab_size, embedding_dim)#词汇的数量（词汇的最大值）、词嵌入的维度
            self.linear1 = nn.Linear(context_size * embedding_dim, 128)#context_size=2是指预测下面单词的context
            #输入：context_size个单词*embedding_dim个维度；输出：自定义（hidden layer）
            self.linear2 = nn.Linear(128, vocab_size) #最终的输出也是vocab的维度
            #输入：上一层的hidden layer；输出：词汇表的大小vocab_size(函数的传入参数)，也就是集合的长度（无重复单词）

        def forward(self, inputs):
            embeds = self.embeddings(inputs).view((1, -1))#view函数是可以将输入的维度进行重构
            out = F.relu(self.linear1(embeds))#增加一层非线性的激活函数
            out = self.linear2(out)
            log_probs = F.log_softmax(out, dim=1)
            return log_probs#最终返回一个目标单词的概率值

step6 定义损失函数及优化器

loss_function = nn.NLLLoss()#定义损失函数
model=NGramLanguageModeler(len(vocab),10,2)#实例化一个模型对象
optimizer=optim.SGD(model.parameters(),lr=0.001)#定义优化器:参数+学习率

step7 训练并输出loss值

for epoch in range(5):
    epoch_loss=0
    for tri in trigrams:
        
        model.zero_grad()#由于会累乘梯度，所以要归零
        
        context_idxs=torch.tensor([word_to_ix[w] for w in tri[0]],dtype=torch.long)#构建语法糖：将单词转化为tensor格式
        target=torch.tensor([word_to_ix[tri[1]]],dtype=torch.long)  #将预测的目标单词也转化为tensor格式
        log_probs=model(context_idxs)  #得到单词的概率值
        loss=loss_function(log_probs,target)#计算损失函数
        #反向传播更新梯度
        loss.backward()
        optimizer.step()
        
        epoch_loss+=loss.item()#为了观察更好，item函数是把loss转化为数值格式
    print('Epoch:%d loss:%.4f'%(epoch+1,epoch_loss))

结果如下：

分类：

N-Gram

Word2Vec：skip-gram模型和CBOW模型（next week）

参考文章：

1、Word2Vec：Word2Vec ——gensim实战教程 - 简书

2、Word2Vec数学原理的讲解：word2vec 中的数学原理详解（四）基于 Hierarchical Softmax 的模型_多项式哈希-CSDN博客

3、词嵌入：词嵌入：编码形式的词汇语义 - PyTorch官方教程中文版

三、transformer

实际上transformer是一个seq2seq的模型，前面周报（八）CSDN有提到过该模型的3种形式。

seq2seq模型：由encoder和decoder两个部分组成，在该情况下，输出长度与输出长度不等并且机器无法确定其输出长度。它在自然语言处理方面有很广泛的应用。

1、应用

语音合成TTS（text-to-speech）：利用transformer模型实现文字转换语音，输入文字、输出语音信号。以闽南语为例，首先将白话文字转化为闽南语语音，再将其转化为声音讯号。

句法分析（syntactic parsing）：输入是一段文字，输出是一个句法的分析树。通过句法树能够了解到deep和learning合起来是一个名词短语，very和powerful合起来是一个形容词短语。

聊天机器人（chatbot）：是典型的seq2seq的例子，输入是用户的文本消息、输出是机器的回复文本。训练机器学习输入与输出之间的对应关系，使机器能够生成合理的回复。

多分类标签（muti-label classification）：输入是一篇文章，输出是类别，机器来决定输出类别的数量。这种“序列到序列无关“的模型也可以应用seq2seq模型。与“多类别”任务有所不同，同一篇文章可能属于多个不同的类。

目标检测（object detection）：输入是一张图片，输出是检测到的目标物体。首先需要自定义目标物体，输入的一张图片中含有多个目标物体，经过预测最终将目标物体框起来并计算是该类别的概率。

2、encoder

任务：输入一排向量，输出一排向量

输入的一组向量，需要经过多个block，每个block含有多个layer（block中做的事情包括self-attention和fully connected）

每个block中的流程：

step1 input进入self-attention

step2 self-attention输出后的向量a与self-attention输出前的向量b进行残差连接residual

step3 layer normalization

step4 fully connected

step5 residual+layer normalization=output

残差连接residual Connection：是一种通过引入跨层的直接连接来优化深度神经网络的方法。允许输入信号直接绕过一些层，并与这些层的输出相加。优点是梯度传播更容易、信息流动更快、网络层次很深时不易梯度消失。

层归一化layer normalization： 是一种神经网络中常用的归一化技能，用于在训练过程中加速收敛、稳定训练，并提高模型的泛化能力。

3、decoder

（1）autoregressive

输入的单词是一个个进入decoder的，输入“机”预测“器”、输入“器”预测“学”，以此类推...

实际上，decoder的内部构造相较于encoder来说基本差不多，encoder独有masked muti-head attention，并且还多加了一层muti-head attention和layer normalization。

掩码自注意力机制masked self-attention：输入到模型的单词是一个一个生成的，所以考虑每次输入时，只能考虑到前面的输入，不能考虑到后面的输入。

需要设置一个停止信号“END”，这样decoder的学习才会有结束的时候。就像开始信号“BEGIN”一样。

（2）non-autoregressive

所有单词一次性全部输入，可以考虑一整个输入的所有状态。（next lecture）

autoregressive V.S. non-autoregressive

AT decoder必须一个个输入，NAT decoder可以并行输入。NAT decoder的优点是，它不仅可以实现同时输入，还能控制输出的长度。但是，NAT的训练效果不如AT（next lecture）。

（3）cross attention

cross attention：是连接encoder和decoder的纽带。其输入就是由encoder的输出所提供。

cross attention的流程图入下：

左侧输入的语音在encoder中的self-attention得到向量 $a^1$ 、 $a^2$ 、 $a^3$ ，从而计算出向量 $k^i$ 、 $v^i$ ；右侧在经过decoder中的masked self-attention得到向量 $q$ 。最终两侧的向量相互应用到self-attention中。