【Transformers基础入门篇4】基础组件之Model

文章目录

一、Model简介
- 1.1 Transformer
- 1.2 注意力机制
- 1.3 模型类型
二、Model Head
- 2.1 什么是 Model Head
- 2.2 Transformers中的Model Head
三、Model基本使用方法
- 3.0 模型下载-浏览器下载
- 3.1 模型加载与保存
- 3.2 配置加载参数
- 3.3 加载config文件
- 3.2 模型调用
- - 3.2.1 带ModelHead的调用
  - 3.2.2 不带ModelHead的调用
  - 3.2.3带ModelHead的调用
四、模型微调代码实例

本文为 https://space.bilibili.com/21060026/channel/collectiondetail?sid=1357748的视频学习笔记

项目地址为：https://github.com/zyds/transformers-code

一、Model简介

1.1 Transformer

既然这个包的名字叫Transformers，那么大部分整个模型都是基于Transformer架构。

原始的Tranformer为编码器(Encoder)、解码器（Decoder)模型
Encoder部分接收输入并构建其完整的特征表示，Decoder部分使用Encoder的编码结果以及其他的输入生成目标序列
无论是编码器还是解码器，均由多个TransformerBlock堆叠而成
TransformerBloc由注意力机制（Attention）和FFN组成。

1.2 注意力机制

注意力机制的使用是Transformer的一个核心特征，在计算当前次的特征表示时，可以通过注意力机制有选择性的告诉模型使用哪些上下文

1.3 模型类型

编码器模型：自编码模型，使用Encoder，拥有双向的注意力机制，即计算每一个词的特征都看到完整的上下文
解码器模型：自回归模型，使用Decoder，拥有单向的注意力机制，即计算每一个词的特征时，都只能看到上文，无法看到下文
编码器解码器模型：序列到序列模型，使用Encoder+Decoder，Encoder部分拥有双向的注意力，Decoder部分使用单向注意力

二、Model Head

2.1 什么是 Model Head

Model Head是连接在模型后的层，通常为1个或多个全连接层
Model Head将模型的编码的表示结果进行映射，以解决不同类型的任务

2.2 Transformers中的Model Head

*Model: 模型本身，只返回编码结果
*ForCausalLM: 纯的解码器模型
*ForMaskedLM：像Bert那种，在句子中随机Mask一些token，然后来预测token是什么
*ForSeq2SeqLM：序列到序列的模型
*ForMultipleChoice：多项选择任务
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification
…其他

下图是Bert刚提出来时，给出的不同任务的解决方案

句子对分类：取CLS 分类
单一句子分类取CLS 分类
问答：将Question和Paragraph同时输入，然后对Paragraph做两个全连接，记住起始和结束位置。
NER任务（命名实体识别）：对每一个token进行分类

三、Model基本使用方法

3.0 模型下载-浏览器下载

https://huggingface.co/hfl/rbt3/tree/main
自行下，然后放到hfl/rbt3文件夹
在这里插入图片描述
也可以通过代码下载，在huggingface模型上选

!git clone "https://huggingface.co/hfl/rbt3"  # 会下载所有模型，包括tf、flax模型
!git lfs clone "https://huggingface.co/hfl/rbt3" --include="*.bin" # 仅下载pytorch模型

3.1 模型加载与保存

在HuggingFace上选一个小模型 hfl/rbt3

from transformers import AutoConfig, AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("../../models/hfl/rbt3")

3.2 配置加载参数

model.config

3.3 加载config文件

# 加载config文件
config = AutoConfig.from_pretrained("../../models/hfl/rbt3")
config

输出BertConfig，那么我们就可以进入BertConfig来看具体的参数
在这里插入图片描述 from transformers import BertConfig
查看BertConfig

class BertConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`BertModel`] or a [`TFBertModel`]. It is used to
    instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the BERT
    [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (`int`, *optional*, defaults to 2):
            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        is_decoder (`bool`, *optional*, defaults to `False`):
            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        classifier_dropout (`float`, *optional*):
            The dropout ratio for the classification head.

    Examples:

    ```python
    >>> from transformers import BertConfig, BertModel

    >>> # Initializing a BERT google-bert/bert-base-uncased style configuration
    >>> configuration = BertConfig()

    >>> # Initializing a model (with random weights) from the google-bert/bert-base-uncased style configuration
    >>> model = BertModel(configuration)

    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""

3.2 模型调用

3.2.1 带ModelHead的调用

sen = "弱小的我也有大梦想！"
tokenizer = AutoTokenizer.from_pretrained("../../models/hfl/rbt3")
inputs = tokenizer(sen, return_tensors="pt")
output = model(**inputs)

3.2.2 不带ModelHead的调用

model = AutoModel.from_pretrained("../../models/hfl/rbt3", output_attentions=True)
output = model(**inputs)
output.last_hidden_state.size() # 输出序列的输入编码的维度
len(inputs["input_ids"][0]) # 查看输入ID的长度

输出的12就是字的长度
在这里插入图片描述

3.2.3带ModelHead的调用

from transformers import AutoModelForSequenceClassification, BertForSequenceClassification
clz_model = AutoModelForSequenceClassification.from_pretrained("../../models/hfl/rbt3")
clz_model(**inputs)

输出可以看到任务是一个文本分类任务，输出loss，logits .
这时候logits还只有一个二分类模型，我们可以调整成10类
在这里插入图片描述

clz_model = AutoModelForSequenceClassification.from_pretrained("../../models/hfl/rbt3", num_labels=10)
clz_model(**inputs)

在这里插入图片描述
那么怎么看num_labels参数呢？
定位到 BertForSequenceClassification类实现里看到 config里有num_labels

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        self.bert = BertModel(config) 
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

四、模型微调代码实例

任务类型：文本分类
使用模型： hfl/rbt3
数据集地址 https://github.com/SophonPlus/ChineseNlpCorpus
Step1 导入相关包

from transformers import AutoTokenizer, AutoModelForSequenceClassification,BertForSequenceClassification

Step2 加载数据

import pandas as pd
data = pd.read_csv("./ChnSentiCorp_htl_all.csv")
data = data.dropna() # 清除数据损坏的部分

Step3 创建Dataset


class MyDataset(Dataset):

    def __init__(self) -> None:
        super().__init__()
        self.data = pd.read_csv("./ChnSentiCorp_htl_all.csv")
        self.data = self.data.dropna()

    def __getitem__(self, index):
        return self.data.iloc[index]["review"], self.data.iloc[index]["label"]
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
for i in range(5):
    print(dataset[i])

在这里插入图片描述

Step4 划分数据集

rom torch.utils.data import random_split
trainset, validset = random_split(dataset, lengths=[0.9, 0.1])
len(trainset), len(validset)

Step5 创建Dataloader

这里读取数据，然后写处理函数传到Dataloder中collate_fn，可以实现批量处理

import torch

tokenizer = AutoTokenizer.from_pretrained("../../models/hfl/rbt3")

def collate_func(batch):
    texts, labels = [], []
    for item in batch:
        texts.append(item[0])
        labels.append(item[1])
    # truncation=True 过长做截断
    inputs = tokenizer(texts, max_length=128, padding="max_length", truncation=True, return_tensors="pt")
    inputs["labels"] = torch.tensor(labels)
    return inputs

from torch.utils.data import DataLoader

trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=collate_func)
validloader = DataLoader(validset, batch_size=64, shuffle=False, collate_fn=collate_func)

# 查看loader
next(enumerate(validloader))[1]

step6 创建模型以及优化器

from torch.optim import Adam

model = AutoModelForSequenceClassification.from_pretrained("../../models/hfl/rbt3")

if torch.cuda.is_available():
    model = model.cuda()
optimizer = Adam(model.parameters(), lr=2e-5)

step7 训练与验证

def evaluate():
    model.eval()
    acc_num = 0
    with torch.inference_mode():
        for batch in validloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            output = model(**batch)
            pred = torch.argmax(output.logits, dim=-1)
            acc_num += (pred.long() == batch["labels"].long()).float().sum()
    return acc_num / len(validset)

def train(epoch=3, log_step=100):
    global_step = 0
    for ep in range(epoch):
        model.train()
        for batch in trainloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            optimizer.zero_grad()
            output = model(**batch)
            output.loss.backward()
            optimizer.step()
            if global_step % log_step == 0:
                print(f"ep: {ep}, global_step: {global_step}, loss: {output.loss.item()}")
            global_step += 1
        acc = evaluate()
        print(f"ep: {ep}, acc: {acc}")

step9 模型训练与预测

train()
# sen = "我觉得这家酒店不错，饭很好吃！"
sen = "我觉得这家酒店太差了，饭很难吃！"
id2_label = {0: "差评！", 1: "好评！"}
model.eval()
with torch.inference_mode():
    inputs = tokenizer(sen, return_tensors="pt")
    inputs = {k: v.cuda() for k, v in inputs.items()}
    logits = model(**inputs).logits
    pred = torch.argmax(logits, dim=-1)
    print(f"输入：{sen}\n模型预测结果:{id2_label.get(pred.item())}")

尝试pipeline

from transformers import pipeline

model.config.id2label = id2_label
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)
pipe(sen)