transformers 框架使用详解，bert-base-chinese

以 bert-base-chinese 模型为例，模型目录 model_name = "C:/Users/Administrator.DESKTOP-TPJL4TC/.cache/modelscope/hub/tiansz/bert-base-chinese"

bert-base-chinese 模型大小只有400多兆，参数的量级在百万级别，与现在动辄几十个G，几十亿几百亿的参数量级不在一个层次，所以 bert 的主要功能是理解语义，它的双向编码其实就是transformer论文中的自注意力的实现。既然能够理解语义，它就能实现一些延伸的能力。

1、两个句子相似度的比较。

2、实现简单的QA，即给它一段话作为context，然后根据这段话提问，它能定位到你这个问题的答案在context中的位置，然后将答案揪出来，当然，它不是generate模型，它的参数量也做不到generate，它只是简单的截取一句话作为最符合的答案。

3、命名实体识别NER

4、在NLP领域，你可以定义很多下游任务，当然要自己实现输出层的逻辑。

transformers的三大组件configuration, tokenizer和model都可以通过一致的from_pertrained()方法来实例化。

Transformers提供了三个主要的组件。

Configuration配置类。存储模型和分词器的参数，诸如词表大小，隐层维数，dropout rate等。配置类对深度学习框架是透明的。
Tokenizer分词器类。每个模型都有对应的分词器，存储token到index的映射，负责每个模型特定的序列编码解码流程，比如BPE(Byte Pair Encoding)，SentencePiece等等。也可以方便地添加特殊token或者调整词表大小，如CLS、SEP等等。
Model模型类。提供一个基类，实现模型的计算图和编码过程，实现前向传播过程，通过一系列self-attention层直到最后一个隐藏状态层。在最后一层基础上，根据不同的应用会再做些封装，比如XXXForSequenceClassification，XXXForMaskedLM这些派生类。

Transformers的作者们还为以上组件提供了一系列Auto Classes，能够从一个短的别名（如bert-base-cased）里自动推测出来应该实例化哪种配置类、分词器类和模型类。

Transformers提供两大类的模型架构，一类用于语言生成NLG任务，比如GPT、GPT-2、Transformer-XL、XLNet和XLM，另一类主要用于语言理解任务，如Bert、DistilBert、RoBERTa、XLM.

tokenizer.encode() 方法

经过层层继承，最终的实现是在文件transformers\tokenization_utils_base.py中的 class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):

def encode(
        self,
        text: Union[TextInput, PreTokenizedInput, EncodedInput],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: Optional[int] = None,
        stride: int = 0,
        padding_side: Optional[bool] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ) -> List[int]:
        """
        Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.

        Same as doing `self.convert_tokens_to_ids(self.tokenize(text))`.

        Args:
            text (`str`, `List[str]` or `List[int]`):
                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
                `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
                method).
            text_pair (`str`, `List[str]` or `List[int]`, *optional*):
                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
                method).
        """
        encoded_inputs = self.encode_plus(
            text,
            text_pair=text_pair,
            add_special_tokens=add_special_tokens,
            padding=padding,
            truncation=truncation,
            max_length=max_length,
            stride=stride,
            padding_side=padding_side,
            return_tensors=return_tensors,
            **kwargs,
        )

        return encoded_inputs["input_ids"]

model.eval() 的作用：模型在默认状态下是激活了 Dropout 模块，你此时给他输入数据会导致模型参数发生变化，所以需要调用eval()方法将模型设置为评估（evaluation）模式，deactivate DropOut modules。

python中的 `call`方法

它的作用为：当你把对象当做函数来调用时，例如 objectA(xxx)，就会被重定向到__call__方法。

在类PreTrainedTokenizerBase中

def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        text_pair_target: Optional[
            Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
        ] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: Optional[int] = None,
        stride: int = 0,
        is_split_into_words: bool = False,
        pad_to_multiple_of: Optional[int] = None,
        padding_side: Optional[bool] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        return_token_type_ids: Optional[bool] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        **kwargs,
    ) -> BatchEncoding:
    	"""
        Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`, *optional*):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            text_pair (`str`, `List[str]`, `List[List[str]]`, *optional*):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            text_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
                The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
                list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
                you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            text_pair_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
                The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
                list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
                you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        """
    	...
        ...

此时，你就看到有这样的调用

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

model() 实际上调用的就是 model.forward() 。

而 tokenizer() 并不是 tokenizer.decode() ，从返回值类型就能看出来。从代码上看，tokenizer() 做了一些判断，返回值为 BatchEncoding；而 tokenizer.decode() 返回值为 BatchEncoding[‘input_ids’]，所以也可以

input_ids = tokenizer.encode(question, context, return_tensors="pt")
outputs = model(input_ids)

model = xxx.from_pretrained(model_name) 的问题

同一个模型，可以有不同的下游任务，网络模型包括输入层，中间隐藏层，输出层三部分。我们所说的下游任务就是指输出层，我们拿到隐藏层的最后一层的计算结果之后，就可以在输出层上做些文章以实现不同的功能，所以在实例化模型的时候会有多种方式，AutoModelForxxxx，或者 BertForxxxx，所以 model() 的输出结果就不一样，参数个数也可能不一样，这个要去看它的 forward() 方法。

多去看看代码，基本上都有说明。我们以BertForQuestionAnswering为例

from transformers import BertTokenizer, BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained(model_name)
......
outputs = model(**inputs)

类的定义如下

@add_start_docstrings(
    """
    Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
    """,
    BERT_START_DOCSTRING,
)
class BertForQuestionAnswering(BertPreTrainedModel):
    def __init__(self, config):
        ......

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        start_positions: Optional[torch.Tensor] = None,
        end_positions: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:

从说明就知道此类提供question-answering任务，它的返回值是 Tuple[torch.Tensor] 或者 QuestionAnsweringModelOutput，通过传入的参数 return_dict 来决定返回值类型，默认就是返回 QuestionAnsweringModelOutput，它是一个dataclass，可以访问它的属性。

再比如

from transformers import BertTokenizer, AutoModel
model = AutoModel.from_pretrained(model_name)
print(type(model)) # <class 'transformers.models.bert.modeling_bert.BertModel'>

类的定义如下

@add_start_docstrings(
    "The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",
    BERT_START_DOCSTRING,
)
class BertModel(BertPreTrainedModel):
    """

    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
    cross-attention is added between the self-attention layers, following the architecture described in [Attention is
    all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
    """

    _no_split_modules = ["BertEmbeddings", "BertLayer"]

    def __init__(self, config, add_pooling_layer=True):
        ......

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:

从说明就知道此类只能作为encoder和decoder用，其返回值为 Tuple[torch.Tensor] 或者 BaseModelOutputWithPoolingAndCrossAttentions

@dataclass 的说明

@dataclass
class QuestionAnsweringModelOutput(ModelOutput):
    """
    Base class for outputs of question answering models.

    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
        start_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
            Span-start scores (before SoftMax).
        end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
            Span-end scores (before SoftMax).
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
    """

    loss: Optional[torch.FloatTensor] = None
    start_logits: torch.FloatTensor = None
    end_logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

此装饰器的作用相当于定义了一系列的类的属性

def __init__(self, 
    loss: Optional[torch.FloatTensor] = None
    start_logits: torch.FloatTensor = None
    end_logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
):

@classmethod 的说明

@classmethod
def from_pretrained(
    cls,
    pretrained_model_name_or_path: Union[str, os.PathLike],
    *init_inputs,
    cache_dir: Optional[Union[str, os.PathLike]] = None,
    force_download: bool = False,
    local_files_only: bool = False,
    token: Optional[Union[str, bool]] = None,
    revision: str = "main",
    trust_remote_code=False,
    **kwargs,
):
    ......
    ......

此方法是类的方法，不需要实例化就能访问，且第一个参数是类（cls），而不是对象（self）。此方法可以访问类属性和cls的方法，而不能访问self的方法。

with torch.no_grad() 的作用

torch.no_grad()是PyTorch中的一个上下文管理器（context manager），用于指定在其内部的代码块中不进行梯度计算。当你不需要计算梯度时，可以使用该上下文管理器来提高代码的执行效率，尤其是在推断（inference）阶段和梯度裁剪(grad clip)阶段的时候。不需要进行梯度计算和反向传播，只需要进行前向传播计算。，从而提高计算效率并节省内存。with torch.no_grad()常见于eval()验证集和测试集中。另外，This context manager is thread local; it will not affect computation in other threads.

logits

在神经网络中，logits通常是指模型在最后一层（全连接层）产生的原始输出，该层有多少个神经元就会有多少个值，这些输出还没有经过任何激活函数（如softmax或sigmoid）处理，根据不同的目的，将这些值输入到不同的激活函数中，就能归纳出不同的结果。
比如在多分类任务中，我们使用 logits.argmax(1)来得到预测的 label_id，也就是分类id。
其中torch.argmax(dim)函数的意思是返回最大值对应的索引，dim不指定任何参数就是指的所有元素；dim指定为0时求得是列的argmax；dim指定为1时求得是行的argmax；因为 torch 支持批量喂数据，所以很多地方得到的结果都是 [batch_size, n] 维的矩阵，batch_size 就是批的大小，相当于将这些样本的结果一行一行堆起来了，所以 dim = 1 就是逐个样本的求argmax。
比如在train的时候会按批（epoch）计算准确度

acc = (logits.argmax(1) == label).float().mean()

'''
(logits.argmax(1) == label).float() 得到 batch_size 个 0,1,0,1...
然后使用 mean() 求平均值，而这正好就是 1 出现的频率，即准确度
'''