微调大型语言模型进行命名实体识别

news2024/11/20 15:16:39

大型语言模型的目标是理解和生成与人类语言类似的文本。它们经过大规模的训练,能够对输入的文本进行分析,并生成符合语法和语境的回复。这种模型可以用于各种任务,包括问答系统、对话机器人、文本生成、翻译等。

命名实体识别(Named Entity Recognition,简称NER)是一种常见的应用方法,可以让模型学会识别文本中的命名实体,如人名、地名、组织机构名等。

大型语言模型在训练时通过大量的文本数据学习了丰富的语言结构和上下文信息。这使得模型能够更好地理解命名实体在文本中的上下文,提高了识别的准确性。即使模型在训练过程中没有见过某个命名实体,它也可以通过上下文推断该实体的类别。这意味着模型可以处理新的、未知的实体,而无需重新训练。除此以外我们还能通过微调(fine-tuning)在特定任务上进行优化。这种迁移学习的方法使得在不同领域和任务上进行NER更加高效。

这篇文章总结了命名实体识别(NER)问题微调大型语言模型的经验。我们将以个人身份信息(PII)为例来介绍大型语言模型进行NER微调的方法。

个人可识别信息(Personal Identifiable Information,PII)

个人可识别信息(Personal Identifiable Information,PII)是指可以用于识别、联系或定位个人身份的数据或信息。这些信息可以单独使用或结合其他信息,使得可以辨认特定的个人。PII通常包括但不限于以下内容:

  • 全名
  • 电子邮件地址
  • 身份证号码
  • 驾驶证号码
  • 社会安全号码
  • 银行账号
  • 生日
  • 地址

这些信息的泄露可能会导致身份盗窃、个人隐私侵犯等问题,因此保护PII对于保障个人隐私和安全至关重要。

HIPAA隐私规定(Health Insurance Portability and Accountability Act Privacy Rules,简称HIPAA Privacy Rules)是一组法规,旨在保护医疗保健信息的隐私和安全。这些规定是由美国联邦政府颁布,适用于医疗保健提供者、健康计划、医疗支付者以及与这些实体交换医疗信息的其他组织和个人。

“Safe Harbor method” 是指在HIPAA(Health Insurance Portability and Accountability Act)隐私规定中的一种数据安全标准。这个方法允许医疗保健机构和其他涉及医疗信息的实体在某些条件下共享个人健康信息,而不会被认为违反HIPAA的隐私规定。

在Safe Harbor方法下,共享的个人健康信息必须经过匿名化处理,以使其不再能够识别特定的个人。HIPAA规定了一组特定的标识符,包括但不限于以下信息:

  • 医疗记录号码
  • 社会安全号码
  • 驾驶证号码
  • 信用卡号码

如果这些标识符被移除,或者通过某种方式使得个人健康信息无法与特定的个人相关联,那么这些信息就被视为符合Safe Harbor标准。

所以NER就派上了用武之地,可以对LLM进行微调,生成包含检测到的PII实体的结构良好的字符串,然后再进行匿名化处理来保证个人健康信息的安全性。

对大型语言模型进行微调

微调LLM主要有以下2个方面的挑战:

调优的LLM不应该产生命名实体的幻觉。应该从一组受控的实体标签中进行检测。

微调LLM应该生成结构良好的输出。LLM输出不应包含无关信息(例如,解释为什么检测到某些实体)。因为输出中的额外令牌导致每个输入的推理成本更高。并且下游任务也无法使用。

所以我们就先格式化训练数据

一个典型的NER输入和输出格式如下。

 # INPUT
 
 test_example="My name is John Doe and I can be contacted at 111-222-3334"
 
 
 # GROUND TRUTH NER DETECTIONS
 test_detections= [
   {
     "entity_type": "PERSON",
     "entity_value": "John Doe",
     "start_position": 11,
     "end_position": 19,
   },
   {
     "entity_type": "PHONE_NUMBER",
     "entity_value": "111-222-3334",
     "start_position": 46,
     "end_position": 58,
   }
 ]

输出数据可以通过多种方式进行格式化。对于典型的NER数据集,广泛采用BIO格式。

BIO 格式是命名实体识别(Named Entity Recognition,NER)任务中常用的标注格式,用于标记文本中的命名实体。BIO 格式包括三种标记:B、I 和 O。

  • B(Beginning):表示一个命名实体的开头。
  • I(Inside):表示一个命名实体的内部。
  • O(Outside):表示不是命名实体的词。
 ## BIO Tags for sentence - Alex is going to Los Angeles in California
 
 Alex I-PER
 is O
 going O
 to O
 Los I-LOC
 Angeles I-LOC
 in O
 California I-LOC

BIO格式是非常具体的。它需要理解标记是特定实体标签的“内部”(I)和“外部”(O)。并且还要标识出实体标签开始的字符,这会在制定LLM任务描述提示时增加不必要的复杂性。

我们可以尝试了以下两种输出格式

 # JSON encoded string with NER detections
 
 llm_output_str = "[{\"entity_type\": \"PERSON\",\"entity_value\": \"John Doe\",\"start_position\": 11,\"end_position\": 19,},{\"entity_type\": \"PHONE_NUMBER\",\"entity_value\": \"111-222-3334\",\"start_position\": 46,\"end_position\": 58}]"

这些字符串看起来与一般的NER数据完全相同。无需对LLM输出进行任何额外的数据处理。并且我们可以直接使用json.loads(llm_output_str)来读取模型的输出。

但是这里我们必须要保证模型输出必须为正确的JSON编码字符串,并且还要记录字符串在输入的位置,这对于模型来说是有点困难的。

或者我们让模型直接将NER的标签进行标记,比如:

 llm_output_str = "My name is <PERSON>John Doe</PERSON> and I can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER>"

让模型在输入中包含相关的<ENTITY_LABEL> </ENTITY_LABEL>标记,这样对于我们查看结果是非常方便的,但是对于编码来说还必须对LLM生成的输出进行后期处理,解析检测到的实体的实体以及开始和结束字符索引,这会增加我们的代码量。并且这种方法我们需要保证在输出时没有任何令牌产生幻觉,而且输入中的所有字符、标点和词序都需要保留,这对于LLM来说也有一些困难。

那么我们选择哪一个呢?在本文的最后,我们将看到哪种格式的输出字符串效果更好。

提示设计

现在我们有了输入和输出数据格式,下面就需要设计一个向LLM描述任务的提示符。提示设计是非常非常重要的,这回影响到LLM的输出

《 QUANTIFYING LANGUAGE MODELS’ SENSITIVITY TO SPURIOUS FEATURES IN PROMPT DESIGN or: How I learned to start worrying about prompt formatting》论文讨论了提示对于模型性能的变化,有兴趣的可以看看

对于任务描述,我们使用单独提示来生成json和格式的输出字符串。对于生成json字符串的模型,使用了以下任务描述

 You are given a user utterance that may contain Personal Identifiable 
 Information (PII). You are also given a list of entity types representing 
 personal identifiable information (PII). Your task is to detect and identify 
 all instances of the supplied PII entity types in the user utterance. Provide 
 a JSON output with keys: 'entity_type' (label of the detected entity), 
 'entity_value' (actual string value of the entity), 'start_position' 
 (start character index of the entity in the user utterance string), and 
 'end_position' (end character index of the entity in the user utterance string) 
 Ensure accuracy in identification of entities with correct start_position and 
 end_position character indices. Ensure that all entities are identified. Do 
 not perform false identifications.

对于输出字符串,我使用了以下任务描述

 You are given a user utterance that may contain Personal Identifiable 
 Information (PII). You are also given a list of entity types representing 
 Personal Identifiable Information (PII). Your task is to detect and identify 
 all instances of the supplied PII entity types in the user utterance. 
 The output must have the same content as the input. Only the tokens that match 
 the PII entities in the list should be enclosed within XML tags. The XML tag 
 comes from the PII entities described in the list below. For example, a name 
 of a person should be enclosed within <PERSON></PERSON> tags. Ensure that all 
 entities are identified. Do not perform false identifications.

提示还需要包含实体类型及其描述的列表,以确保模型只检测来自受控标签列表的实体。我选择了下面的模板

 List Of Entities
 PERSON: Name of a person
 Rx_NUMBER: Number identifying a medical prescription
 ORDER_NUMBER: Number identifying a retail order
 EMAIL_ADDRESS: Email address
 PHONE_NUMBER: Telephone or mobile number
 DATE_TIME: Dates and Times
 US_SSN: Social Security Number in the United States

我们针对上面的提示进行以下的测试

 ## Few shot example input
 "My name is John Doe and I can be contacted at 111-222-3334"
 ## Few shot example output
 "My name is <PERSON>John Doe</PERSON> and I can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER>"
 
 ## Actual input
 "My phone number is 222-333-4445 and my name is Ana Jones"
 ## Incorrect Model output - model rephrases the output to match closer to few shot example output
 "My name is <PERSON>Ana Jones</PERSON> and my phone number is <PHONE_NUMBER>222-333-4445</PHONE_NUMBER>"
 
 ## What model should have generated
 "My phone number is <PHONE_NUMBER>222-333-4445</PHONE_NUMBER> and my name is <PERSON>Ana Jones</PERSON>"

我们可以非常明显的看到模型输出中生成虚假的标记,实体都已经区分出来了,但是位置变了。我们可以在prompt中加入一些少样本的示例来让模型强制学习,但是这回增加prompt的输入令牌数。

在提示中加入Chain-Of-Thought

除了在会话中嵌入少样本示例外,我们还可以让模型以简洁的方式重新描述指令。这加强了模型对任务的理解,可以获得更好、更一致的格式化输出。我还让模型向我“解释”,给定任务描述,为什么示例输入和输出是有意义的。比如说以下的提示

 # First user message
 usr_msg1="""
 You are given a user utterance that may contain Personal Identifiable 
 Information (PII). You are also given a list of entity types representing 
 Personal Identifiable Information (PII). Your task is to detect and identify 
 all instances of the supplied PII entity types in the user utterance. 
 The output must have the same content as the input. Only the tokens that match 
 the PII entities in the list should be enclosed within XML tags. The XML tag 
 comes from the PII entities described in the list below. For example, a name 
 of a person should be enclosed within <PERSON></PERSON> tags. Ensure that all 
 entities are identified. Do not perform false identifications.
 
 List Of Entities
 PERSON: Name of a person
 Rx_NUMBER: Number identifying a medical prescription
 ORDER_NUMBER: Number identifying a retail order
 EMAIL_ADDRESS: Email address
 PHONE_NUMBER: Telephone or mobile number
 DATE_TIME: Dates and Times
 US_SSN: Social Security Number in the United States
 
 Are the instructions clear to you?
 """
 
 # First assistant message which is a reply to usr_msg1
 # I hardcode this msg once the model produced a resonably good response.
 # Note that the response comes from a non-fine-tuned version of the model.
 # The idea is to sample a good 'thought' from the base version of the model.
 asst_msg1="""
 Yes, the instructions are clear. I will identify and enclose within the 
 corresponding XML tags, all instances of the specified PII entity types in 
 the user utterance. For example, <PERSON><Name of a person></PERSON>, 
 <Rx_NUMBER><Number identifying a medical prescription></Rx_NUMBER>, etc. 
 leaving the rest of the user utterance unchanged.
 """
 
 # Here I hardcode a few shot example as a user message
 usr_msg2="""
 My name is John Doe, and can be contacted at 111-222-3334
 """
 
 # Here I hardcode the appropriate response from the assitant as the
 # correct output of the few shot example
 asst_msg2="""
 My name is <PERSON>John Doe</PERSON>, and can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER>
 """
 
 # Here I ask the assistant why its response to the last user message was the 
 # correct response
 usr_msg3="""
 Give a brief explanation of why your answer is correct.
 """
 
 # I hardcode the explaination from the model.
 # Note that the explaination comes from a non-fine-tuned version of the model
 asst_msg3="""
 I identified and enclosed within corresponding XML tags, all instances of 
 the specified PII entity types in the user utterance - a Person Name 
 "John Doe" within the <PERSON></PERSON> tag, and a Phone Number 
 "111-222-3334" within the <PHONE_NUMBER></PHONE_NUMBER> tag. 
 The rest of the user utterance was left unchanged as it did not contain any 
 other identified PII entities.
 """
 
 # The rest of the user and assistant msgs are hardcoded so that the model is
 # in a state where it expects another input from the user
 
 usr_msg4="""
 Great! I am now going to give you another user utterance. Please 
 detect PII entities in it according to the previous instructions. Do 
 not include an explanation in your answer.
 """
 
 asst_msg4="""
 Sure! Please give me the user utterance.
 """
 
 # usr_msg5 would be the actual input string on which we want to detect the 
 # PII entities

创建完整训练数据文件的样例函数如下所示

 def get_fine_tune_prompt_xml(
     rule_set: List[str], 
     input_str: str,
     label_str: str,
     tokenizer: PreTrainedTokenizerBase,
 ) -> torch.Tensor:
 
     """
     Args:
     rule_set (List[str]): List of strings representing entity labels and its
                           corresponding description
     input_str (str): Actual input string on which detections need to be
                      performed
     label_str (str): Expected output string corresponding to input_str
     tokenizer (PreTrainedTokenizerBase): A tokenizer corresponding to the model
                                          being fine-tuned
 
     Returns: 
     torch.Tensor: Tensor of tokenized input ids
     """
     rule_str = "\n".join(rule_set)
 
     usr_msg1 = "You are given a user utterance that may contain Personal Identifiable Information (PII). " \
         "You are also given a list of entity types representing Personal Identifiable Information (PII). " \
         "Your task is to detect and identify all instances of the supplied PII entity types in the user utterance. " \
         "The output must have the same content as the input. Only the tokens that match the PII entities in the " \
         "list should be enclosed within XML tags. The XML tag comes from the PII entities described in the list below. " \
         "For example, a name of a person should be enclosed within <PERSON></PERSON> tags." \
         "Ensure that all entities are identified. Do not perform false identifications." \
         f"""\n\nList Of Entities\n{rule_str}"""\
         "\n\n" \
         "Are the instructions clear to you?"
     
     asst_msg1 = "Yes, the instructions are clear. I will identify and enclose within the corresponding XML tags, " \
         "all instances of the specified PII entity types in the user utterance. For example, " \
         "<PERSON><Name of a person></PERSON>, <Rx_NUMBER><Number identifying a medical prescription></Rx_NUMBER>, etc. " \
         "leaving the rest of the user utterance unchanged."
     
     usr_msg2 = "My name is John Doe, and can be contacted at 111-222-3334"
 
     asst_msg2 = "My name is <PERSON>John Doe</PERSON>, and can be contacted at <PHONE_NUMBER>536-647-8464</PHONE_NUMBER>"
 
     usr_msg3 = "Give a brief explanation of why your answer is correct."
 
     asst_msg3 = "I identified and enclosed within corresponding XML tags, all instances of the specified PII " \
         "entity types in the user utterance - a Person Name \"John Doe\" within the <PERSON></PERSON> tag, and " \
         "a Phone Number \"536-647-8464\" within the <PHONE_NUMBER></PHONE_NUMBER> tag. The rest of the user " \
         "utterance was left unchanged as it did not contain any other identified PII entities."
     
     usr_msg4 = "Great! I am now going to give you another user utterance. Please detect PII entities in it " \
         "according to the previous instructions. Do not include an explanation in your answer."
     
     asst_msg4 = "Sure! Please give me the user utterance."
 
     messages = [
         {"role": "user", "content": usr_msg1},
         {"role": "assistant", "content": asst_msg1},
         {"role": "user", "content": usr_msg2},
         {"role": "assistant", "content": asst_msg2},
         {"role": "user", "content": usr_msg3},
         {"role": "assistant", "content": asst_msg3},
         {"role": "user", "content": usr_msg4},
         {"role": "assistant", "content": asst_msg4},
         {"role": "user", "content": input_str},
         {"role": "assistant", "content": label_str},
     ]
     
    encoded_input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
 
    return encoded_input_ids

函数的输出如下所示

 <s> [INST] You are given a user utterance that may contain Personal Identifiable Information (PII). You are also given a list of entity types representing Personal Identifiable Information (PII). Your task is to detect and identify all instances of the supplied PII entity types in the user utterance. The output must have the same content as the input. Only the tokens that match the PII entities in the list should be enclosed within XML tags. The XML tag comes from the PII entities described in the list below. For example, a name of a person should be enclosed within <PERSON></PERSON> tags. Ensure that all entities are identified. Do not perform false identifications.
 
 List Of Entities
 PERSON: Name of a person
 Rx_NUMBER: Number identifying a medical prescription
 ORDER_NUMBER: Number identifying a retail order
 EMAIL_ADDRESS: Email address
 PHONE_NUMBER: Telephone or mobile number
 DATE_TIME: Dates and Times
 US_SSN: Social Security Number in the United States
 
 Are the instructions clear to you? [/INST]Yes, the instructions are clear. I will identify and enclose within the corresponding XML tags, all instances of the specified PII entity types in the user utterance. For example, <PERSON><Name of a person></PERSON>, <Rx_NUMBER><Number identifying a medical prescription></Rx_NUMBER>, etc. leaving the rest of the user utterance unchanged.</s> [INST] My name is John Doe, and can be contacted at 111-222-3334 [/INST]My name is <PERSON>John Doe</PERSON>, and can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER></s> [INST] Give a brief explanation of why your answer is correct. [/INST]I identified and enclosed within corresponding XML tags, all instances of the specified PII entity types in the user utterance - a Person Name "John Doe" within the <PERSON></PERSON> tag, and a Phone Number "111-222-3334" within the <PHONE_NUMBER></PHONE_NUMBER> tag. The rest of the user utterance was left unchanged as it did not contain any other identified PII entities.</s> [INST] Great! I am now going to give you another user utterance. Please detect PII entities in it according to the previous instructions. Do not include an explanation in your answer. [/INST]Sure! Please give me the user utterance.</s> [INST] Hi! is Dr. Danielle Boyd at the clinic [/INST]Hi! is Dr. <PERSON>Danielle Boyd</PERSON> at the clinic</s>

tokenizer.apply_chat_template()负责将’ [INST] ‘和’ [/INST] ‘应用于用户消息,并将’ ‘(序列令牌结束)应用于辅助消息。还要注意,标记器负责将’ '(序列的开始)标记应用到提示符的开头。这些微小的细节对模型在微调过程中是否能有效地学习和收敛有巨大的影响。

自定义损失

自回归模型(像大多数llm一样)被训练来正确预测“下一个令牌”。给定我们刚刚创建的训练数据样本和微调训练设置,模型将学习预测文本所有部分的下一个标记,即任务描述、实体列表、样本示例、会话历史中硬编码的模型思维链等。

这将使模型除了学习预测正确的结果外,还学习任务描述中的令牌分布。这使得我们的学习任务有点繁琐。我们对LLM进行微调的主要目标是为给定的输入字符串生成结构良好且正确的检测。因此,我们应该只计算输出字符串中令牌的损失。所以在我们的示例训练数据中,模型应该只计算以下令牌的损失

 Hi! is Dr. <PERSON>Danielle Boyd</PERSON> at the clinic</s>

这将鼓励模型“忘记”之前的所有标记,只是“注意”主要标记并生成正确的输出字符串。我们可以使用HuggingFace的DataCollator API。

 fromdataclassesimportdataclass
 fromtransformers.tokenization_utils_baseimportPreTrainedTokenizerBase
 fromtransformers.utilsimportPaddingStrategy
 fromtypingimportAny, Callable, Dict, List, NewType, Optional, Tuple, Union
 
 @dataclass
 classCustomDataCollatorWithPadding:
     """
     Data collator that will dynamically pad the inputs received.
 
     Args:
         tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
             The tokenizer used for encoding the data.
         padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
 
             - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
               sequence is provided).
             - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
               acceptable input length for the model if that argument is not provided.
             - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
         max_length (`int`, *optional*):
             Maximum length of the returned list and optionally padding length (see above).
         pad_to_multiple_of (`int`, *optional*):
             If set will pad the sequence to a multiple of the provided value.
 
             This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
             7.5 (Volta).
         return_tensors (`str`, *optional*, defaults to `"pt"`):
             The type of Tensor to return. Allowable values are "np", "pt" and "tf".
     """
 
     tokenizer: PreTrainedTokenizerBase
     padding: Union[bool, str, PaddingStrategy] =True
     max_length: Optional[int] =None
     pad_to_multiple_of: Optional[int] =None
     return_tensors: str="pt"
 
     def__call__(self, features: List[Dict[str, Any]]) ->Dict[str, Any]:
         batch=self.tokenizer.pad(
             features,
             padding=self.padding,
             max_length=self.max_length,
             pad_to_multiple_of=self.pad_to_multiple_of,
             return_tensors=self.return_tensors,
         )
         labels=batch["input_ids"].clone()
         
         # Set loss mask for all pad tokens
         labels[labels==self.tokenizer.pad_token_id] =-100
         
         # Compute loss mask for appropriate tokens only
         foriinrange(batch['input_ids'].shape[0]):
             
             # Decode the training input
             text_content=self.tokenizer.decode(batch['input_ids'][i][1:])  # slicing from [1:] is important because tokenizer adds bos token
             
             # Extract substrings for prompt text in the training input
             # The training input ends at the last user msg ending in [/INST]
             prompt_gen_boundary=text_content.rfind("[/INST]") +len("[/INST]")
             prompt_text=text_content[:prompt_gen_boundary]
             
             # print(f"""PROMPT TEXT:\n{prompt_text}""")
             
             # retokenize the prompt text only
             prompt_text_tokenized=self.tokenizer(
                 prompt_text,
                 return_overflowing_tokens=False,
                 return_length=False,
             )
             # compute index where prompt text ends in the training input
             prompt_tok_idx=len(prompt_text_tokenized['input_ids'])
             
             # Set loss mask for all tokens in prompt text
             labels[i][range(prompt_tok_idx)] =-100
             
             # print("================DEBUGGING INFORMATION===============")
             # for idx, tok in enumerate(labels[i]):
             #     token_id = batch['input_ids'][i][idx]
             #     decoded_token_id = self.tokenizer.decode(batch['input_ids'][i][idx])
             #     print(f"""TOKID: {token_id} | LABEL: {tok} || DECODED: {decoded_token_id}""")
                     
         batch["labels"] =labels
         returnbatch

CustomDataCollatorWithPadding类可以像下面这样传递给SFTTrainer

 trainer=SFTTrainer(
     model=model,
     train_dataset=dataset["train"],
     eval_dataset=eval_dataset["val"],
     peft_config=peft_config,
     dataset_text_field="text",
     max_seq_length=max_seq_length,
     tokenizer=tokenizer,
     args=training_arguments,
     packing=packing,
     # Using custom data collator inside SFTTrainer
     data_collator=CustomDataCollatorWithPadding(
         tokenizer=tokenizer, 
         padding="longest", 
         max_length=max_seq_length, 
         return_tensors="pt"
     )
 )

通过将标签令牌id设置为-100,我们可以将这些令牌位置的损失归零。这意味着从序列令牌的开头()到最后一个用户msg(以[/INST]结尾)的所有令牌都不会参与损失的计算。

结果

用这个设置微调了mistral /Mistral-7B-Instruct-v0.2模型。我有大约800个训练数据样本,大约400个测试样本和大约400个验证样本。

训练了3轮的模型,并在测试集上取得了相当高的精度/召回率/F1(96%以上)。

这里说一个结果,使用字符串标注的方法超过了生成JSON编码的方法,虽然JSON的格式是正确的,但是正如我们前面所述的,在预测正确的’ start_position ‘和’ end_position '字符索引方面结果并不好。

这里我没有确认模型是否也能很好地处理BIO输出格式,我个人认为应该不会太好。

我们添加了自定义的损失掩码,这是否有助于模型更好地泛化到看不见的实体?这个也没有进行测试。

如果把7B的模型改为13B或者34B等更大的模型的性能如何变化?训练和推理的成本是否值得性能的提升?这都是我们可以继续研究的问题,如果你对NER感兴趣可以自行研究,我也会在有结果后分享我的发现。

https://avoid.overfit.cn/post/0cb82b2fee6440b6b762a23bbf85bf59

作者:Nitin Kumar

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1525154.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

基于深度学习的口罩人脸识别研究进展

MTCNN模型训练输入的所有图像都是正样本&#xff08;戴口罩的照片&#xff09;&#xff0c;没有负样本作为模型输入。在后续的识别任务模块中&#xff0c;导入MTCNN模型检测结果&#xff0c;对特征点进行编码比较进行识别。 基于MTCNN的口罩人脸识别框架可分为四个阶段&#xf…

Linux 时间系统调用

UNIX及LinuxQ的时间系统是由「新纪元时间」Epoch开始计算起。Epoch是指定为1970年1月1日凌晨零点零分零秒&#xff0c;格林威治时间。目前大部份的UNX系统都是用32位来记录时间&#xff0c;正值表示为1970以后&#xff0c;负值则表示1970年以前。 对于当前时间到Epoch 我们用两…

刷题日记——干碎那个BFS!(含国科大机试2021)

例题小引——迷宫问题 问题描述: 迷宫由n行m列的单元格组成(n&#xff0c;m都小于等于50&#xff09;&#xff0c;每个单元格要么是空地&#xff0c;要么是障碍物。 现请你找到一条从起点到终点的最短路径长度。 分析——&#xff08;迷宫问题BFS解法&#xff09; 使用BFS…

鸿蒙Harmony应用开发—ArkTS声明式开发(容器组件:GridRow)

栅格布局可以为布局提供规律性的结构&#xff0c;解决多尺寸多设备的动态布局问题&#xff0c;保证不同设备上各个模块的布局一致性。 栅格容器组件&#xff0c;仅可以和栅格子组件(GridCol)在栅格布局场景中使用。 说明&#xff1a; 该组件从API Version 9开始支持。后续版本…

实战!wsl 与主机网络通信,在 wsl 中搭建服务器。学了计算机网络,但只能刷刷面试题?那也太无聊了!这篇文章可以让你检测你的计网知识!

前言&#xff08;碎碎念&#xff09;&#xff1a;每次发布文章时&#xff0c;我都是一个纠结的过程。因为我给自己写笔记时&#xff0c;只需要记录自己不清晰或者易忘的知识点就可以了&#xff0c;但一旦想要作为文章发布&#xff0c;那么我就得考虑到很多人是纯新手&#xff0…

1TGE120011R1111变频器全国发货质保一年

1TGE120011R1111 ABB ABB的1TGE120011R1111是一个属于其TGE系列的小型断路器&#xff08;也称为微型断路器或MCB&#xff09;。这个系列的断路器主要用于低压配电系统&#xff0c;为电路提供过载和短路保护。以下是这款断路器的一些特点&#xff1a; 紧凑设计&#xff1a;TGE系…

【数据库】数据库基本知识

1.数据库的四个基本概念 1.1 数据&#xff1a;描述事务的符号记录 1.2 数据库&#xff1a;概括的说&#xff0c;数据库数据具有永久存储、有组织的、可共享的大量数据的集合&#xff0c;数据库中的数据按一定的数据模型组织、描述和储存&#xff0c;具有较小的冗余度、较高的…

2、鸿蒙学习-申请调试证书和调试Profile文件

申请发布证书 发布证书由AGC颁发的、为HarmonyOS应用配置签名信息的数字证书&#xff0c;可保障软件代码完整性和发布者身份真实性。证书格式为.cer&#xff0c;包含公钥、证书指纹等信息。 说明 请确保您的开发者帐号已实名认证。每个帐号最多申请1个发布证书。 1、登录AppGa…

AI美图设计室试用,可以生成PPT,以及模特试衣

文章目录 美图设计室试用 美图设计室试用 美图设计室是美图秀秀的公司推出的AI图像处理工具&#xff0c;其功能涵盖图片编辑、抠图、海报设计、文生图等常用的AI功能。尽管很多功能需要开通会员使用&#xff0c;但一些免费功能的表现也还不错&#xff0c;值得一用。 美图设计…

【解读】NIST网络安全框架CSF 2.0

2014年&#xff0c;NIST&#xff08;美国国家标准与技术研究所&#xff0c;类似于中国的工信部&#xff09;首次发布了网络安全框架CSF&#xff08;Cybersecurity Framework)&#xff0c;十年后&#xff0c;在2024年2月26日发布了重大更新&#xff08;CSF 2.0&#xff09;&…

【蓝屏分析】WHEA_UNCORRECTABLE_ERROR 问题分析与解决

背景信息 电脑名字电脑类型厂商使用时间magicbook14 2020款 R5 4500U笔记本电脑荣耀HONOR3年9个月 内存CPUGPU硬盘焊死在主板上焊死在主板上集显PCIe 4.0 NVMe M.2 固态 软硬件错误源确定 电脑莫名频繁随机蓝屏&#xff0c;由于在软件环境上无迹可寻推测是硬件问题 蓝屏画面…

Pretrain-finetune、Prompting、Instruct-tuning训练方法的区别

来自&#xff1a;【多模态】28、LLaVA 第一版 | Visual Instruction Tuning 多模态模型的指令微调_多模态指令跟随数据-CSDN博客 几种模型训练方法的区别&#xff1a; 1、Pretrain-finetune&#xff1a;先在大量数据集上做预训练&#xff0c;然后针对某个子任务做 finetune 2…

Python图像处理指南:PIL与OpenCV的比较【第136篇—PIL】

&#x1f47d;发现宝藏 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。【点击进入巨牛的人工智能学习网站】。 Python图像处理指南&#xff1a;PIL与OpenCV的比较 图像处理在计算机视觉和图像识别等领域…

实现悲观协议,除了锁还能咋办?

相对乐观和局部悲观是一体两面的关系&#xff0c;识别它的要点就在于是否有全局有效性验证&#xff0c;这也和分布式数据库的架构特点息息相关。但是关于悲观协议&#xff0c;还有很多内容没有提及&#xff0c;下面我们就来填补这一大块空白。 悲观协议的分类 要先跳出来&…

Word使用通配符替换

1.通配符替换 使用&#xff08;[通配替换文本]&#xff09; M-MM12-00([0123456789])-0([0123456789])([0123456789])-0([0123456789])([0123456789]) 2.根据自定义格式替换 根据格式、样式替换&#xff0c;如只替换标题的内容&#xff0c;不替换征文

ASP.NET 服务器控件

目录 一、使用的软件 1、下载 2、新建文件&#xff08;写一个简单的web网页&#xff09; 二、相关知识点 1、Web窗体网页的组件 &#xff08;1&#xff09;可视化组件 &#xff08;2&#xff09;用户接口逻辑 2、Web Form网页的代码模型 &#xff08;1&#xff09;单文件…

STM32的USART能否支持9位数据格式话题

1、问题描述 STM32L051 这款单片机。平常的 USART 串口传输是 8 位数据&#xff0c;但是他的项目需要用串口传输 9 位数据。当设置为 8 位数据时&#xff0c;串口响应中断正常。但是&#xff0c;当设置为 9 位数据时&#xff0c;串口就不产生中断了。USART2 的 ISR 寄存器 RXN…

前端基础篇-深入了解 JavaScript(一)

&#x1f525;博客主页&#xff1a; 【小扳_-CSDN博客】 ❤感谢大家点赞&#x1f44d;收藏⭐评论✍ 文章目录 1.0 JavaScript 概述 2.0 JS - 引入方式 3.0 JS - 基础语法 4.0 JS - 数据类型 5.0 JS - 函数 6.0 JS - Array 数组 7.0 JS - String 字符串 1.0 JavaScript 概述…

Java推荐算法——特征加权推荐算法(以申请学校为例)

加权推荐算法 文章目录 加权推荐算法1.推荐算法的简单介绍2.加权推荐算法详细介绍3.代码实现4.总结 1.推荐算法的简单介绍 众所周知&#xff0c;推荐算法有很多种&#xff0c;例如&#xff1a; 1.加权推荐&#xff1a;分为简单的特征加权&#xff0c;以及复杂的混合加权。主要…

Qt文件读写

做一个简单的文件读写&#xff0c;我们把一个结构体内的数据写入到二进制文件中&#xff0c;并重新读取解析。代码结构如下&#xff1a; 项目名称随便起就好了。main.cpp是主函数&#xff1b;DataHandler实现文件的写与读&#xff0c;还要模拟过程&#xff1b;Definition.h放置…