大模型笔记5 Extractive QA任务评估

Extractive QA任务评估

Extractive QA评测指标

precision, recall, f1

ROUGE

划分训练与评估数据集

token位置评估

单个token位置评估

输入label的token位置

预测token位置

评估

Wandb

共享机器同时登录

样本类别平衡

标记token label时对窗口进行筛选

训练输入json数据格式调整

GPU内存不足

服务器远程连接断开后进程停止运行

Extractive QA任务评估

Extractive QA评测指标

Extractive QA Evaluation Metrics:

参考:

Evaluating Question Answering Evaluation

Evaluating Question Answering Evaluation - ACL Anthology

现有指标（BLEU、ROUGE、METEOR 和 F1）是使用 n-gram 相似性计算的

how-to-evaluate-question-answering代码

How to Evaluate a Question Answering System | deepset

Evaluation of a QA System | Haystack

slides:

https://anthonywchen.github.io/Papers/evaluatingqa/mrqa_slides.pdf

QAEval:

https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00397/106792/Towards-Question-Answering-as-an-Automatic-Metric

代码:

https://github.com/CogComp/qaeval-experiments

precision：candidate中匹配reference的内容占candidate比例

recall：candidate中匹配reference的内容占reference比例

Reference: I work on machine learning.

Candidate A: I work.

Candidate B: He works on machine learning.

Precision A>B, recall B>A

import evaluate

metric = evaluate.load("squad")

metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 83.0, 'f1': 88.25}

ROUGE (Recall Oriented Understudy for Gisting Evaluation)

https://aclanthology.org/W04-1013/

分类：ROUGE-N（常用其中的ROUGE-1和ROUGE-2）, ROUGE-L，ROUGE-W，ROUGE-S（后两种不常用）原版论文中ROUGE主要关注recall值，但事实上在用的时候可以用precision、recall和F值。

ROUGE-N：基于n-grams，如ROUGE-1计算基于匹配unigrams的recall，以此类推。 ROUGE-L：基于longest common subsequence (LCS)

BLUE

precision用modified n-gram precision估计，recall用best match length估计。

Modified n-gram precision:

n-gram precision是candidate中与reference匹配的n-grams占candidates的比例

Reference: I work on machine learning.

Candidate 1: He works on machine learning.

Precision=60%（3/5）

best match length

precision, recall, f1

数据集标签labels_texts中一篇文章的数据集描述为一个字符串list,

模型输出prediction_strings中一篇文章的数据集描述为连在一起的字符串.

示例数据:

labels_texts = [["description1 in paper1", "description2 in paper1"], ["description1 in paper2"]]

prediction_strings = ["description1 in paper1. description2 in paper1", "description1 in paper2"]

使用 F1 分数来评估模型的输出

1. 将 labels_texts 转化为 token 级别的标签。

2. 训练模型并生成预测结果 prediction_strings。

3. 比较预测的 token 和参考的 token，并基于它们的交集计算评估指标。

# 评估模型输出

def evaluate(predictions, references):

y_true = []

y_pred = []

for ref, pred in zip(references, predictions):

ref_tokens = tokenizer.tokenize(" ".join(ref))

pred_tokens = tokenizer.tokenize(pred)

common = set(ref_tokens) & set(pred_tokens)

y_true.extend([1] * len(common) + [0] * (len(ref_tokens) - len(common)))

y_pred.extend([1] * len(common) + [0] * (len(pred_tokens) - len(common)))

precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')

return precision, recall, f1

precision, recall, f1 = evaluate(prediction_strings, labels_texts)

print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")

precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')这一句报错ValueError: Found input variables with inconsistent numbers of samples: [40, 42]

这是因为生成的句子比参考的句子更长

常见的处理方法包括

截断生成的句子：将生成的句子截断到与参考句子相同的长度。

填充参考句子：将参考句子填充到与生成句子相同的长度。

对齐比较：在评估时只比较重叠部分，并忽略多出的部分。

截断比较

# 两个字符串长度不一样, 会报错, 截断比较

min_len = min(len(ref_tokens), len(pred_tokens))

ref_tokens = ref_tokens[:min_len]

pred_tokens = pred_tokens[:min_len]

得到输出

Prediction Strings: ['impossible. In this paper, we aim to solve this problem by introducing FAR-Trans, the first public dataset for FAR, containing pricing in- formation and retail investor transactions acquired from a large European financial institution.']

Precision: 1.0000, Recall: 0.9750, F1 Score: 0.9873

ROUGE

(Recall Oriented Understudy for Gisting Evaluation)

ROUGE: A Package for Automatic Evaluation of Summaries - ACL Anthology

文本生成评估指标简单介绍BLEU+ROUGE+Perplexity+Meteor 代码实现_meteor指标-CSDN博客

简介：主要用于评估机器翻译、文本摘要（或其他自然语言处理任务）的质量，即：衡量目标文本与生成文本之间的匹配程度，此外还考虑生成文本的召回率，BLEU则相对更看重生成文本的准确率，着重于涵盖参考摘要的内容和信息的完整性。

主要有两种形式：

ROUGE-N(N = 1, 2, 3, …)

ROUGE-L

ROUGE-N计算方式为：

ROUGE-N = Candidate ∩ Reference l e n ( Reference ) \text{ROUGE-N} = \frac{\text{Candidate} \cap \text{Reference}}{len(\text{Reference})}

这里的分子交集不像ROUGE-L的最长公共子串一样，这里的交集不考虑顺序。

交集主要考虑n-gram

参考:

https://zhuanlan.zhihu.com/p/647310970

n代表连续的n个词的组合。"n"可以是1、2、3，或者更高。

1-gram：也称为unigram，是指单个的词语。例如，在句子 "我喜欢学习自然语言处理。" 中的1-gram为：["我", "喜欢", "学习", "自然语言处理", "。"]
2-gram：也称为bigram，是指两个连续的词语组合。例如，在句子 "我喜欢学习自然语言处理。" 中的2-gram为：["我喜欢", "喜欢学习", "学习自然语言处理", "自然语言处理。"]

ROUGE-L
考虑最长公共子串（是区分顺序的）

单句ROUGE-L

ROUGE-L = 最长公共子串 ( Candidate , Reference ) l e n ( Reference ) \text{ROUGE-L} = \frac{\text{最长公共子串}(\text{Candidate}, \text{Reference})}{len(\text{Reference})}

Rouge库:

rouge · PyPI

https://www.cnblogs.com/bonelee/p/18152511

发现环境中已经按照了rouge-score

rouge-score · PyPI

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeL'], use_stemmer=True)

scores = scorer.score('The quick brown fox jumps over the lazy dog',

'The quick brown dog jumps on the log.')

print(scores["rouge1"])

print(scores["rouge2"])

print(scores["rougeL"])

看看多组不能直接预测, 要拆开每一个样本对比预测.

这样多个样本时候如何计算呢, 查到一个例子, 是把列表中所有字符串拼接在一起

自然语言处理评估指标_自然语言处理结果-CSDN博客

文本摘要教程

https://github.com/hellotransformers/Natural_Language_Processing_with_Transformers/blob/main/chapter6.md

多个样本例子:

https://stackoverflow.com/questions/67390427/rouge-score-append-a-list

#同一个文档拼接到同一个字符串

==================rouge====================

from nltk.translate.bleu_score import sentence_bleu

from nltk.translate.rouge_score import rouge_n, rouge_scorer

def gouge(evaluated_sentences, reference_sentences):

"""

:param evaluated_sentences: 生成的摘要句子列表

:param reference_sentences: 参考摘要句子列表

:return: GOUGE指标

"""

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge 3', 'rouge4'])

scores = scorer.score(' '.join(evaluated_sentences),

' '.join(reference_sentences))

rouge_n_scores = [scores[f'rouge{i}'].precision for i in range(1, 5)]

return np.exp(np.mean(np.log(rouge_n_scores)))

# 不同文档分别计算评估指标

# importing the native rouge library

from rouge_score import rouge_scorer

# a list of the hypothesis documents

hyp = ['This is the first sample', 'This is another example']

# a list of the references documents

ref = ['This is the first sentence', 'It is one more sentence']

# make a RougeScorer object with rouge_types=['rouge1']

scorer = rouge_scorer.RougeScorer(['rouge1'])

# a dictionary that will contain the results

results = {'precision': [], 'recall': [], 'fmeasure': []}

# for each of the hypothesis and reference documents pair

for (h, r) in zip(hyp, ref):

# computing the ROUGE

score = scorer.score(h, r)

# separating the measurements

precision, recall, fmeasure = score['rouge1']

# add them to the proper list in the dictionary

results['precision'].append(precision)

results['recall'].append(recall)

results['fmeasure'].append(fmeasure)

print(results)

{'precision': [0.8, 0.2], 'recall': [0.8, 0.25], 'fmeasure': [0.8000000000000002, 0.22222222222222224]}

但是拼接不同样本为同一个字符串再一起计算rouge的方式在含义上不太合适, 所以每个样本分别计算, 然后对所有样本取均值.

pip安装rouge_score

计算评估指标

def rouge_evaluate(predictions, refs):

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

for ref, pred in zip(refs, predictions):

score = scorer.score(" ".join(ref), pred)

for key in rouge_scores:

rouge_scores[key].append(score[key].fmeasure)

avg_rouge_scores = {key: sum(scores) / len(scores) for key, scores in rouge_scores.items()}

#指标直接储存在文件中

with open('output/evaluation_rouge.txt', 'w') as eval_file:

# eval_file.write(f"rouge1: {precision:.4f}\n")

# eval_file.write(f"rouge2: {recall:.4f}\n")

# eval_file.write(f"rougeL: {f1:.4f}\n")

eval_file.write(f"ROUGE Scores:\n")

for key, score in avg_rouge_scores.items():

eval_file.write(f"{key}: {score:.4f}\n")

return avg_rouge_scores

rouge_results = rouge_evaluate(dataset_descriptions, labels_texts)

此处只计算f值的均值, 若有需要, 后续再补充其它值.

划分训练与评估数据集

from sklearn.model_selection import train_test_split

在分割完token label后

# Split the dataset into training and evaluation sets

train_size = 0.8

train_indices, val_indices = train_test_split(list(range(len(inputs["input_ids"]))), train_size=train_size, random_state=42)

train_inputs = {key: val[train_indices] for key, val in inputs.items()}

val_inputs = {key: val[val_indices] for key, val in inputs.items()}

train_dataset = TensorDataset(train_inputs["input_ids"], train_inputs["attention_mask"], train_inputs["labels"])

val_dataset = TensorDataset(val_inputs["input_ids"], val_inputs["attention_mask"], val_inputs["labels"])

train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)

val_dataloader = DataLoader(val_dataset, batch_size=2, shuffle=False)

#训练时

for batch in train_dataloader:

…

avg_epoch_loss = epoch_loss / len(train_dataloader)

#评估时

# Evaluate on the validation set

val_predictions, val_labels = evaluate(model, val_dataloader, tokenizer)

…

precision, recall, f1, rouge_scores = calculate_metrics(val_predictions, val_labels, tokenizer)

检查batch中是否有paper_idx

注意DataLoader中的key和修改后的输入的key对齐

["input_ids"]["attention_mask"]["labels"]["paper_idx"]

val_dataset = TensorDataset(val_inputs["input_ids"], val_inputs["attention_mask"], val_inputs["labels"], val_inputs["paper_idx"])

还有一个问题, 抽样的时候把不同样本的滑窗分开了怎么办

为了确保从同一样本生成的滑动窗口保持在一起，我们需要修改数据拆分过程。我们不会在标记化后拆分数据集，而是拆分原始数据，然后分别对每个子集进行标记。这样，滑动窗口将保留在相同的训练或验证拆分中。

Tokenize前拆分

# Split the original data into training and evaluation subsets

train_sentences, val_sentences, train_labels_texts, val_labels_texts, train_titles, val_titles = train_test_split(

sentences, labels_texts, titles, train_size=0.8, random_state=42

)

分别tokenize, 创建dataloder

# Tokenize the training and validation data separately

train_inputs = tokenize_and_align_labels(train_sentences, train_labels_texts, train_titles, tokenizer)

val_inputs = tokenize_and_align_labels(val_sentences, val_labels_texts, val_titles, tokenizer)

model.to(device)

train_inputs = {key: val.to(device) for key, val in train_inputs.items()}

val_inputs = {key: val.to(device) for key, val in val_inputs.items()}

train_dataset = TensorDataset(train_inputs["input_ids"], train_inputs["attention_mask"], train_inputs["labels"])

val_dataset = TensorDataset(val_inputs["input_ids"], val_inputs["attention_mask"], val_inputs["labels"])

train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)

val_dataloader = DataLoader(val_dataset, batch_size=2, shuffle=False)#防止生成的同一篇文章断开

评估时的label也更换

precision, recall, f1 = token_evaluate(dataset_descriptions, val_labels_texts)

rouge_results = rouge_evaluate(dataset_descriptions, val_labels_texts)

输出时

dataset_descriptions = get_extracted_description(val_predictions, val_inputs["input_ids"], val_inputs["paper_idx"])

token位置评估

常见方法:

准确率 (Accuracy)：衡量模型预测的答案是否完全正确，适用于答案只有一个标准的情况。
精确率 (Precision), 召回率 (Recall) 和 F1 分数：这些指标常用于衡量模型在预测多个可能答案时的表现。精确率衡量正确预测的答案在所有预测答案中的比例，召回率衡量正确预测的答案在所有正确答案中的比例，F1 分数是精确率和召回率的调和平均数。
EM (Exact Match)：衡量模型预测的答案与参考答案完全匹配的比例。
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)：衡量模型生成的答案与参考答案之间的重叠情况，常用于评估生成任务。ROUGE-1 和 ROUGE-L 常用于评估答案的词汇和序列匹配度。
BLEU (Bilingual Evaluation Understudy)：衡量模型生成答案与参考答案之间的 n-gram 重叠情况，常用于机器翻译任务，但在 QA 任务中也可作为辅助指标。

使用输出 token 的位置作为评估指标是合适的，特别是在以下情况中：

开始和结束位置的准确性：对于 Extractive QA 任务，模型通常预测答案在文本中的开始和结束位置。这可以直接用于评估模型是否准确地定位了答案的位置。
重叠率：评估预测答案的起止位置与真实答案的起止位置之间的重叠情况，可以使用 Intersection over Union (IoU) 或者其他重叠率指标。

单个token位置评估

此处模型输出不是单个连续序列, 因此使用label中每个token的位置与预测token位置对比进行评估

输入label的token位置

1.利用label_text构建token_label时同时保存token位置(匹配成功的每一个token在该滑窗中的下标), 保存在tokenized_inputs中

添加了一个名为 token_positions 的键，用于保存每个匹配成功的 token 在滑窗中的下标。

tokenized_inputs = {"input_ids": [], "attention_mask": [], "labels": [], "paper_idx": [], "token_positions": []}

token_positions 列表只包含与 token_label 值为 1 的部分对应的下标

同一篇文章中的每一个滑窗, 其中每一个滑窗是一个数组, 有的label在这个滑窗中, 有的不在, 所以会出现部分滑窗token位置中为空的情况

for j in range(len(input_ids)):#同一篇文章中的每一个滑窗

tokenized_sentence = tokenizer.convert_ids_to_tokens(input_ids[j])

token_label = [0] * len(tokenized_sentence)

main_body_start=find_main_body(tokenized_sentence)

token_positions = []

for label_text in labels_text:

# print("label_text:",label_text)

tokenized_label = tokenizer.tokenize(label_text)

# print("label_text:",label_text)

tokenized_label = [token for token in tokenized_label if token != '<pad>']#删除pad

label_length = len(tokenized_label)

# print("tokenized_label:",tokenized_label)

# 处理label跨滑窗

for k in range(main_body_start,len(tokenized_sentence)-1):#从正文部分开始匹配

# print("tokenized_sentence:",tokenized_sentence[k:k + label_length])

end_position=min(len(tokenized_sentence)-1, k + label_length)

if tokenized_sentence[k:end_position] == tokenized_label[0:end_position]:#后半部分没有的情况

print("matched tokenized_label:", tokenized_sentence[k:end_position])

# print("matched tokenized_label:",tokenized_label,"\n",tokenized_sentence[k:end_position])

token_label[k:end_position] = [1] * (end_position-k)

# print("matched clsfy_label:",token_label[k:len(tokenized_sentence)-1])

for pos in range(k, end_position):

token_positions.append(pos)

for label_start in range(label_length-1):#前半部分没有的情况

# print("tokenized_sentence:",tokenized_sentence[k:k + label_length])

if tokenized_sentence[main_body_start:label_length-label_start] == tokenized_label[label_start:]:#后半部分没有的情况

print("matched tokenized_label:",tokenized_sentence[main_body_start:label_length-label_start])

# print("matched tokenized_label:",tokenized_sentence[main_body_start:label_length-label_start],"\n",tokenized_label[label_start:])

token_label[main_body_start:label_length-label_start] = [1] * (label_length-label_start)

# print("matched clsfy_label:",token_label[k:len(tokenized_sentence)-1])

for pos in range(main_body_start, label_length - label_start):

token_positions.append(pos)

tokenized_inputs["token_positions"].append(token_positions)

因为tokenized_inputs["token_positions"]中的数组不等长, 无法转换成tensor使用torch.stack(), 无法和其它key一起转到gpu中, 所以, 转到gpu的操作只在分批之后的其它key上进行

预测token位置

2.输出token分类转换成句子时候, 同时输出token预测位置

将get_prediction_string中predicted_tokens_classes的类别预测为'Dataset description'的token位置也记录下来并一起返回

新增一个pre_token_positions列表用于记录预测类别为 'Dataset description' 的 token 的位置。函数最后返回 dataset_description_string 和pre_token_positions

pre_token_positions = []

for idx, (token, pred_class) in enumerate(zip(tokenized_sub_sentence, predicted_tokens_classes)):

is_descrp=(pred_class == 'Dataset description')

if(is_descrp):

pre_token_positions.append(idx)

return dataset_description_string, pre_token_positions

dataset_description, pre_token_positions=get_prediction_string(prediction_class, predicted_input_id, is_same_paper)

不同窗口的token位置分别评测, 避免不同窗口的token位置计算混乱

Token位置不需要拼接吧, 毕竟不是输出包含句意的内容, 放在一个大列表中分开不同窗口评测更好, 拼接之后不同窗口的位置反而容易串了

每一个pre_token_positions列表作为一个元素存入papers_pre_token_positions中

papers_pre_token_positions=[]

dataset_description, pre_token_positions=get_prediction_string(prediction_class, predicted_input_id, is_same_paper)

papers_pre_token_positions.append(pre_token_positions)

return dataset_descriptions, papers_pre_token_positions

dataset_descriptions,papers_pre_token_positions = get_extracted_description(val_predictions, val_inputs["input_ids"], val_inputs["paper_idx"])

评估

评估时的label与预测中token数量可能不同

其中label, 预测分别为papers_pre_token_positions, val_inputs["token_positions"]

准确率（Precision）：P=TP/(TP+FP)。通俗地讲，就是预测正确的正例数据占预测为正例数据的比例。

召回率（Recall）：R=TP/(TP+FN)。通俗地讲，就是预测为正例的数据占实际为正例数据的比

F1=(2*P*R)/(P+R)

true_positive 表示预测正确的 token 位置的数量，false_positive 表示错误预测的 token 位置的数量，false_negative 表示遗漏的实际 token 位置的数量。通过遍历每个滑窗的预测位置和实际位置，计算这些指标并最终得到 precision, recall 和 F1-score

def token_evaluate(pre_token_positions, label_token_positions):

true_positive = 0

false_positive = 0

false_negative = 0

for pred_positions, label_positions in zip(pre_token_positions, label_token_positions):

pred_positions_set = set(pred_positions)

label_positions_set = set(label_positions)

true_positive += len(pred_positions_set & label_positions_set)

false_positive += len(pred_positions_set - label_positions_set)

false_negative += len(label_positions_set - pred_positions_set)

precision = true_positive / (true_positive + false_positive) if (true_positive + false_positive) > 0 else 0

recall = true_positive / (true_positive + false_negative) if (true_positive + false_negative) > 0 else 0

f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

return precision, recall, f1

# 例子

pre_token_positions = [[], [2], []]

label_token_positions = [[1], [2], [3]]

precision, recall, f1 = token_evaluate(pre_token_positions, label_token_positions)

print(f"Precision: {precision}")

print(f"Recall: {recall}")

print(f"F1-score: {f1}")

Wandb

Wandb是一个模型训练日志自动记录工具, 配置好后可以比较方便地在wandb.ai的网页查看每一次训练记录的plot和每条曲线的中的值

使用Wandb记录训练日志

安装wandb

pip install wandb

在命令行登录

wandb login

输入注册后生成的key, 登录成功

运行测试样例

import wandb

import random

# start a new wandb run to track this script

wandb.init(

# set the wandb project where this run will be logged

project="my-awesome-project",

# track hyperparameters and run metadata

config={

"learning_rate": 0.02,

"architecture": "CNN",

"dataset": "CIFAR-100",

"epochs": 10,

}

)

# simulate training

epochs = 10

offset = random.random() / 5

for epoch in range(2, epochs):

acc = 1 - 2 ** -epoch - random.random() / epoch - offset

loss = 2 ** -epoch + random.random() / epoch + offset

# log metrics to wandb

wandb.log({"acc": acc, "loss": loss})

# [optional] finish the wandb run, necessary in notebooks

wandb.finish()

获得提示

wandb: Run data is saved locally in D:\Projects\longformer\wandb\run-20240805_204153-4hznct2i

wandb: Run `wandb offline` to turn off syncing.

wandb: Syncing run hearty-eon-1

wandb: View project at https://wandb.ai/lalagoon-north-china-electric-power-university/test-project

wandb: View run at https://wandb.ai/lalagoon-north-china-electric-power-university/test-project/runs/4hznct2i

打开网址可以看到生成的图像, 说明成功了

在自己的项目中加入

import wandb

# start a new wandb run to track this script

wandb.init(

# set the wandb project where this run will be logged

project="extract-dataset-description-project",

# track hyperparameters and run metadata

config={

"learning_rate": 5e-5,

"architecture": "Longformer",

"dataset": "Description-500",

"epochs": 100,

}

)

运行过程中的epoch进行记录

# 记录损失到 wandb

wandb.log({"loss": avg_epoch_loss})

完事关掉

wandb.finish()

得到loss图在

https://wandb.ai/lalagoon-north-china-electric-power-university/extract-dataset-description-project

共享机器同时登录

服务器上有另一个wandb账号已经登录了, 则在命令行用这个代替login

export WANDB_API_KEY='xxxx'

样本类别平衡

标记token label时对窗口进行筛选

如果窗口中没有正例则不输入进行训练

在标记label时, 传入一个flag判断是否标记的是训练集, 训练集中判断token_positions, 如果是空数组, 则不在tokenized_inputs加入这个滑窗.

如果要过滤且没有label在滑窗, 则打断这个滑窗的循环, 不把值加入inputs

if(filter_empty_window and len(token_positions)==0):

continue

在训练集中开启

train_inputs = tokenize_and_align_labels(train_papers, tokenizer,filter_empty_window=True)

训练输入json数据格式调整

将输入格式由字符串数组调整为json数组, 同一篇文章的不同信息放在一起方便对比

从json数组中读取

descri_file_path='input/papers_and_datasets.json'

def read_json(file_path):

with open(file_path, 'r', encoding='utf-8') as file:

data = json.load(file)

return data

papers_info = read_json(descri_file_path)

之后所有的循环改成从papers_info中读取

sentences = data['paper_texts']

labels_texts = data['dataset_descriptions']

titles= data['titles']

训练集测试集划分, 改为划分paper_info

train_papers, val_papers = train_test_split(

papers_info, train_size=0.8, random_state=42

)

输入tokenize_and_align_labels匹配token label时使用paper_info,

train_inputs = tokenize_and_align_labels(train_papers, tokenizer)

val_inputs = tokenize_and_align_labels(val_papers, tokenizer)

def tokenize_and_align_labels(papers, tokenizer, max_length=4096, stride=256):

使用tokenizer时循环paper_info, 并取出其中各项.

for i, paper in enumerate(papers_info):

sentence=paper.get("paper_text")

labels_text = paper.get("dataset_descriptions")

title = paper.get("title")

在评估时候也循环取出

1.token评估

def token_evaluate(predictions, val_papers, tokenizer):

for paper, pred in zip(val_papers, predictions):#ref_tokens中是一篇文章的label, y_true.extend后是所有样本的输出

ref=paper.get("dataset_descriptions")

precision, recall, f1 = token_evaluate(dataset_descriptions, val_papers, tokenizer)

2.rouge

def rouge_evaluate(predictions, val_papers):

for paper, pred in zip(val_papers, predictions):

ref=paper.get("dataset_descriptions")

rouge_results = rouge_evaluate(dataset_descriptions, val_papers)

同时, 为方便不同设备路径修改, 将使用到的路径统一汇总到开头

descri_file_path='input/papers_and_datasets.json'

loss_log_path='output/training_loss.txt'

loss_fig_path='output/training_loss.png'

model_save_path="output/trained_model"

eval_token_path='output/evaluation_token.txt'

eval_rouge_path='output/evaluation_rouge.txt'

GPU内存不足

Colab上分批训练时将输入转移到GPU

input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

报错

# OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU

注释了移GPU的, RAM又不够, 换了个号一样说不够用, 之后完善一下代码放华为卡上试试吧

华为卡跑多了也报错:

RuntimeError: NPU out of memory. Tried to allocate 578.00 MiB (NPU 0; 60.97 GiB total capacity; 8.23 GiB already allocated; 8.23 GiB current active; 362.04 MiB free; 8.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

清除缓存和减少batch size, 参考

python - How to avoid "CUDA out of memory" in PyTorch - Stack Overflow

import torch

torch.cuda.empty_cache()

torch_npu.npu.empty_cache()

设置max_split_size_mb, 参考:

环境变量方法:

export 'PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:512'

代码方法:

torch._C._debug_set_max_split_size_mb(512)

torch_npu._C._debug_set_max_split_size_mb(512)

华为卡查看剩余NPU资源

npu-smi info

可以看到有人在跑

+---------------------------+---------------+----------------------------------------------------+

+===========================+===============+====================================================+

| 0 0 | 1349884 | python | 30216 |

+===========================+===============+====================================================+

指定一块GPU运行python程序, 参考:

https://www.cnblogs.com/tyty-Somnuspoppy/p/10071716.html

os.environ["NPU_VISIBLE_DEVICES"] = "1,2,3"

device = torch.device("npu:1" if torch.npu.is_available() else "npu:2")

检查每个设备的可用内存是否足够:

def select_available_npu(required_memory_mb):

if torch.npu.is_available():

for i in range(1, 8):

props = torch.npu.get_device_properties(f"npu:{i}")

if props.total_memory - props.reserved_memory >= required_memory_mb * 1024 * 1024:

return torch.device(f"npu:{i}")

return torch.device("cpu")

查看总内存

torch_npu.npu.get_device_properties("npu:1").total_memory

设置特定npu, 注意这里查询要用torch_npu.npu, 不然取不到

device = torch.device("npu:1" if torch_npu.npu.is_available() else "npu:2")

longformer的输入序列长, 4096也很大

不像现在的infini transformer这种先分割在分块处理

Longformer不支持多卡训练, 只能把batchsize设置小一点，显存会下降一点

batch_size=8

能跑起来了

----+

+===========================+===============+====================================================+

| 0 0 | 47784 | python | 52657 |

+===========================+===============+================================================

服务器远程连接断开后进程停止运行

Nohup参考:

在Linux系统的ECS实例内,当断开SSH客户端后,如何保持进程继续运行的解决方案_云服务器 ECS(ECS)-阿里云帮助中心

Linux服务器SSH客户端断开后保持程序继续运行的方法_ssh退出后如何保持程序继续运行-CSDN博客

SSH 断开后使进程仍在后台运行 — Linux latest 文档

nohup ping www.baidu.com &

cat nohup.out

ps -ef | grep ping

kill [$PID]

[$PID]为之前nohup命令输出的值

[1] 1255914

nohup: ignoring input and appending output to 'nohup.out'

用于执行程序

nohup python tkn_clsfy.py &

ps -ef | grep python

得到输出

[1] 27308

root 27308 27285 99 13:19 pts/14 00:01:49 python tkn_clsfy.py