目录
Extractive QA任务评估
Extractive QA评测指标
precision, recall, f1
ROUGE
划分训练与评估数据集
token位置评估
单个token位置评估
输入label的token位置
预测token位置
评估
Wandb
共享机器同时登录
样本类别平衡
标记token label时对窗口进行筛选
训练输入json数据格式调整
GPU内存不足
服务器远程连接断开后进程停止运行
Extractive QA任务评估
Extractive QA评测指标
Extractive QA Evaluation Metrics:
参考:
Evaluating Question Answering Evaluation
Evaluating Question Answering Evaluation - ACL Anthology
现有指标(BLEU、ROUGE、METEOR 和 F1)是使用 n-gram 相似性计算的
how-to-evaluate-question-answering代码
How to Evaluate a Question Answering System | deepset
Evaluation of a QA System | Haystack
slides:
https://anthonywchen.github.io/Papers/evaluatingqa/mrqa_slides.pdf
QAEval:
https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00397/106792/Towards-Question-Answering-as-an-Automatic-Metric
代码:
https://github.com/CogComp/qaeval-experiments
precision:candidate中匹配reference的内容占candidate比例
recall:candidate中匹配reference的内容占reference比例
Reference: I work on machine learning. Candidate A: I work. Candidate B: He works on machine learning. |
Precision A>B, recall B>A
F1
import evaluate metric = evaluate.load("squad") |
metric.compute(predictions=predicted_answers, references=theoretical_answers) |
{'exact_match': 83.0, 'f1': 88.25} |
ROUGE (Recall Oriented Understudy for Gisting Evaluation)
https://aclanthology.org/W04-1013/
分类:ROUGE-N(常用其中的ROUGE-1和ROUGE-2), ROUGE-L,ROUGE-W,ROUGE-S(后两种不常用) 原版论文中ROUGE主要关注recall值,但事实上在用的时候可以用precision、recall和F值。
ROUGE-N:基于n-grams,如ROUGE-1计算基于匹配unigrams的recall,以此类推。 ROUGE-L:基于longest common subsequence (LCS)
BLUE
precision用modified n-gram precision估计,recall用best match length估计。
Modified n-gram precision:
n-gram precision是candidate中与reference匹配的n-grams占candidates的比例
Reference: I work on machine learning. Candidate 1: He works on machine learning. |
Precision=60%(3/5)
best match length
precision, recall, f1
数据集标签labels_texts中一篇文章的数据集描述为一个字符串list,
模型输出prediction_strings中一篇文章的数据集描述为连在一起的字符串.
示例数据:
labels_texts = [["description1 in paper1", "description2 in paper1"], ["description1 in paper2"]] prediction_strings = ["description1 in paper1. description2 in paper1", "description1 in paper2"] |
使用 F1 分数来评估模型的输出
1. 将 labels_texts 转化为 token 级别的标签。
2. 训练模型并生成预测结果 prediction_strings。
3. 比较预测的 token 和参考的 token,并基于它们的交集计算评估指标。
# 评估模型输出 def evaluate(predictions, references): y_true = [] y_pred = [] for ref, pred in zip(references, predictions): ref_tokens = tokenizer.tokenize(" ".join(ref)) pred_tokens = tokenizer.tokenize(pred) common = set(ref_tokens) & set(pred_tokens)
y_true.extend([1] * len(common) + [0] * (len(ref_tokens) - len(common))) y_pred.extend([1] * len(common) + [0] * (len(pred_tokens) - len(common)))
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary') return precision, recall, f1 precision, recall, f1 = evaluate(prediction_strings, labels_texts) print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}") |
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')这一句报错ValueError: Found input variables with inconsistent numbers of samples: [40, 42]
这是因为生成的句子比参考的句子更长
常见的处理方法包括
截断生成的句子:将生成的句子截断到与参考句子相同的长度。
填充参考句子:将参考句子填充到与生成句子相同的长度。
对齐比较:在评估时只比较重叠部分,并忽略多出的部分。
截断比较
# 两个字符串长度不一样, 会报错, 截断比较 min_len = min(len(ref_tokens), len(pred_tokens)) ref_tokens = ref_tokens[:min_len] pred_tokens = pred_tokens[:min_len] |
得到输出
Prediction Strings: ['impossible. In this paper, we aim to solve this problem by introducing FAR-Trans, the first public dataset for FAR, containing pricing in- formation and retail investor transactions acquired from a large European financial institution.'] Precision: 1.0000, Recall: 0.9750, F1 Score: 0.9873 |
ROUGE
(Recall Oriented Understudy for Gisting Evaluation)
ROUGE: A Package for Automatic Evaluation of Summaries - ACL Anthology
文本生成评估指标简单介绍BLEU+ROUGE+Perplexity+Meteor 代码实现_meteor指标-CSDN博客
简介:主要用于评估机器翻译、文本摘要(或其他自然语言处理任务)的质量,即:衡量目标文本与生成文本之间的匹配程度,此外还考虑生成文本的召回率,BLEU则相对更看重生成文本的准确率,着重于涵盖参考摘要的内容和信息的完整性。
主要有两种形式:
ROUGE-N(N = 1, 2, 3, …)
ROUGE-L
ROUGE-N计算方式为:
ROUGE-N = Candidate ∩ Reference l e n ( Reference ) \text{ROUGE-N} = \frac{\text{Candidate} \cap \text{Reference}}{len(\text{Reference})} |
这里的分子交集不像ROUGE-L的最长公共子串一样,这里的交集不考虑顺序。
交集主要考虑n-gram
参考:
https://zhuanlan.zhihu.com/p/647310970
n代表连续的n个词的组合。"n"可以是1、2、3,或者更高。
- 1-gram:也称为unigram,是指单个的词语。例如,在句子 "我喜欢学习自然语言处理。" 中的1-gram为:["我", "喜欢", "学习", "自然语言处理", "。"]
- 2-gram:也称为bigram,是指两个连续的词语组合。例如,在句子 "我喜欢学习自然语言处理。" 中的2-gram为:["我喜欢", "喜欢学习", "学习自然语言处理", "自然语言处理。"]
ROUGE-L
考虑最长公共子串(是区分顺序的)
单句ROUGE-L
ROUGE-L = 最长公共子串 ( Candidate , Reference ) l e n ( Reference ) \text{ROUGE-L} = \frac{\text{最长公共子串}(\text{Candidate}, \text{Reference})}{len(\text{Reference})} |
Rouge库:
rouge · PyPI
https://www.cnblogs.com/bonelee/p/18152511
发现环境中已经按照了rouge-score
rouge-score · PyPI
from rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeL'], use_stemmer=True) scores = scorer.score('The quick brown fox jumps over the lazy dog', 'The quick brown dog jumps on the log.') print(scores["rouge1"]) print(scores["rouge2"]) print(scores["rougeL"]) |
看看多组不能直接预测, 要拆开每一个样本对比预测.
这样多个样本时候如何计算呢, 查到一个例子, 是把列表中所有字符串拼接在一起
自然语言处理评估指标_自然语言处理结果-CSDN博客
文本摘要教程
https://github.com/hellotransformers/Natural_Language_Processing_with_Transformers/blob/main/chapter6.md
多个样本例子:
https://stackoverflow.com/questions/67390427/rouge-score-append-a-list
#同一个文档拼接到同一个字符串 ==================rouge==================== from nltk.translate.bleu_score import sentence_bleu from nltk.translate.rouge_score import rouge_n, rouge_scorer def gouge(evaluated_sentences, reference_sentences): """ :param evaluated_sentences: 生成的摘要句子列表 :param reference_sentences: 参考摘要句子列表 :return: GOUGE指标 """ scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge 3', 'rouge4']) scores = scorer.score(' '.join(evaluated_sentences), ' '.join(reference_sentences)) rouge_n_scores = [scores[f'rouge{i}'].precision for i in range(1, 5)] return np.exp(np.mean(np.log(rouge_n_scores))) |
# 不同文档分别计算评估指标 # importing the native rouge library from rouge_score import rouge_scorer # a list of the hypothesis documents hyp = ['This is the first sample', 'This is another example'] # a list of the references documents ref = ['This is the first sentence', 'It is one more sentence'] # make a RougeScorer object with rouge_types=['rouge1'] scorer = rouge_scorer.RougeScorer(['rouge1']) # a dictionary that will contain the results results = {'precision': [], 'recall': [], 'fmeasure': []} # for each of the hypothesis and reference documents pair for (h, r) in zip(hyp, ref): # computing the ROUGE score = scorer.score(h, r) # separating the measurements precision, recall, fmeasure = score['rouge1'] # add them to the proper list in the dictionary results['precision'].append(precision) results['recall'].append(recall) results['fmeasure'].append(fmeasure) print(results) |
{'precision': [0.8, 0.2], 'recall': [0.8, 0.25], 'fmeasure': [0.8000000000000002, 0.22222222222222224]} |
但是拼接不同样本为同一个字符串再一起计算rouge的方式在含义上不太合适, 所以每个样本分别计算, 然后对所有样本取均值.
pip安装rouge_score
计算评估指标
def rouge_evaluate(predictions, refs): scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
for ref, pred in zip(refs, predictions): score = scorer.score(" ".join(ref), pred) for key in rouge_scores: rouge_scores[key].append(score[key].fmeasure)
avg_rouge_scores = {key: sum(scores) / len(scores) for key, scores in rouge_scores.items()} #指标直接储存在文件中 with open('output/evaluation_rouge.txt', 'w') as eval_file: # eval_file.write(f"rouge1: {precision:.4f}\n") # eval_file.write(f"rouge2: {recall:.4f}\n") # eval_file.write(f"rougeL: {f1:.4f}\n") eval_file.write(f"ROUGE Scores:\n") for key, score in avg_rouge_scores.items(): eval_file.write(f"{key}: {score:.4f}\n") return avg_rouge_scores rouge_results = rouge_evaluate(dataset_descriptions, labels_texts) |
此处只计算f值的均值, 若有需要, 后续再补充其它值.
划分训练与评估数据集
from sklearn.model_selection import train_test_split |
在分割完token label后
# Split the dataset into training and evaluation sets train_size = 0.8 train_indices, val_indices = train_test_split(list(range(len(inputs["input_ids"]))), train_size=train_size, random_state=42) train_inputs = {key: val[train_indices] for key, val in inputs.items()} val_inputs = {key: val[val_indices] for key, val in inputs.items()} train_dataset = TensorDataset(train_inputs["input_ids"], train_inputs["attention_mask"], train_inputs["labels"]) val_dataset = TensorDataset(val_inputs["input_ids"], val_inputs["attention_mask"], val_inputs["labels"]) train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True) val_dataloader = DataLoader(val_dataset, batch_size=2, shuffle=False) |
#训练时 for batch in train_dataloader: … avg_epoch_loss = epoch_loss / len(train_dataloader) |
#评估时 # Evaluate on the validation set val_predictions, val_labels = evaluate(model, val_dataloader, tokenizer) … precision, recall, f1, rouge_scores = calculate_metrics(val_predictions, val_labels, tokenizer) |
检查batch中是否有paper_idx
注意DataLoader中的key和修改后的输入的key对齐
["input_ids"]["attention_mask"]["labels"]["paper_idx"]
val_dataset = TensorDataset(val_inputs["input_ids"], val_inputs["attention_mask"], val_inputs["labels"], val_inputs["paper_idx"]) |
还有一个问题, 抽样的时候把不同样本的滑窗分开了怎么办
为了确保从同一样本生成的滑动窗口保持在一起,我们需要修改数据拆分过程。我们不会在标记化后拆分数据集,而是拆分原始数据,然后分别对每个子集进行标记。这样,滑动窗口将保留在相同的训练或验证拆分中。
Tokenize前拆分
# Split the original data into training and evaluation subsets train_sentences, val_sentences, train_labels_texts, val_labels_texts, train_titles, val_titles = train_test_split( sentences, labels_texts, titles, train_size=0.8, random_state=42 ) |
分别tokenize, 创建dataloder
# Tokenize the training and validation data separately train_inputs = tokenize_and_align_labels(train_sentences, train_labels_texts, train_titles, tokenizer) val_inputs = tokenize_and_align_labels(val_sentences, val_labels_texts, val_titles, tokenizer) model.to(device) train_inputs = {key: val.to(device) for key, val in train_inputs.items()} val_inputs = {key: val.to(device) for key, val in val_inputs.items()} train_dataset = TensorDataset(train_inputs["input_ids"], train_inputs["attention_mask"], train_inputs["labels"]) val_dataset = TensorDataset(val_inputs["input_ids"], val_inputs["attention_mask"], val_inputs["labels"]) train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True) val_dataloader = DataLoader(val_dataset, batch_size=2, shuffle=False)#防止生成的同一篇文章断开 |
评估时的label也更换
precision, recall, f1 = token_evaluate(dataset_descriptions, val_labels_texts) |
rouge_results = rouge_evaluate(dataset_descriptions, val_labels_texts) |
输出时
dataset_descriptions = get_extracted_description(val_predictions, val_inputs["input_ids"], val_inputs["paper_idx"]) |
token位置评估
常见方法:
- 准确率 (Accuracy):衡量模型预测的答案是否完全正确,适用于答案只有一个标准的情况。
- 精确率 (Precision), 召回率 (Recall) 和 F1 分数:这些指标常用于衡量模型在预测多个可能答案时的表现。精确率衡量正确预测的答案在所有预测答案中的比例,召回率衡量正确预测的答案在所有正确答案中的比例,F1 分数是精确率和召回率的调和平均数。
- EM (Exact Match):衡量模型预测的答案与参考答案完全匹配的比例。
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation):衡量模型生成的答案与参考答案之间的重叠情况,常用于评估生成任务。ROUGE-1 和 ROUGE-L 常用于评估答案的词汇和序列匹配度。
- BLEU (Bilingual Evaluation Understudy):衡量模型生成答案与参考答案之间的 n-gram 重叠情况,常用于机器翻译任务,但在 QA 任务中也可作为辅助指标。
使用输出 token 的位置作为评估指标是合适的,特别是在以下情况中:
- 开始和结束位置的准确性:对于 Extractive QA 任务,模型通常预测答案在文本中的开始和结束位置。这可以直接用于评估模型是否准确地定位了答案的位置。
- 重叠率:评估预测答案的起止位置与真实答案的起止位置之间的重叠情况,可以使用 Intersection over Union (IoU) 或者其他重叠率指标。
单个token位置评估
此处模型输出不是单个连续序列, 因此使用label中每个token的位置与预测token位置对比进行评估
输入label的token位置
1.利用label_text构建token_label时同时保存token位置(匹配成功的每一个token在该滑窗中的下标), 保存在tokenized_inputs中
添加了一个名为 token_positions 的键,用于保存每个匹配成功的 token 在滑窗中的下标。
tokenized_inputs = {"input_ids": [], "attention_mask": [], "labels": [], "paper_idx": [], "token_positions": []} |
token_positions 列表只包含与 token_label 值为 1 的部分对应的下标
同一篇文章中的每一个滑窗, 其中每一个滑窗是一个数组, 有的label在这个滑窗中, 有的不在, 所以会出现部分滑窗token位置中为空的情况
for j in range(len(input_ids)):#同一篇文章中的每一个滑窗 tokenized_sentence = tokenizer.convert_ids_to_tokens(input_ids[j]) token_label = [0] * len(tokenized_sentence) main_body_start=find_main_body(tokenized_sentence) token_positions = [] for label_text in labels_text: # print("label_text:",label_text) tokenized_label = tokenizer.tokenize(label_text) # print("label_text:",label_text) tokenized_label = [token for token in tokenized_label if token != '<pad>']#删除pad label_length = len(tokenized_label) # print("tokenized_label:",tokenized_label) # 处理label跨滑窗 for k in range(main_body_start,len(tokenized_sentence)-1):#从正文部分开始匹配 # print("tokenized_sentence:",tokenized_sentence[k:k + label_length]) end_position=min(len(tokenized_sentence)-1, k + label_length) if tokenized_sentence[k:end_position] == tokenized_label[0:end_position]:#后半部分没有的情况 print("matched tokenized_label:", tokenized_sentence[k:end_position]) # print("matched tokenized_label:",tokenized_label,"\n",tokenized_sentence[k:end_position]) token_label[k:end_position] = [1] * (end_position-k) # print("matched clsfy_label:",token_label[k:len(tokenized_sentence)-1]) for pos in range(k, end_position): token_positions.append(pos) for label_start in range(label_length-1):#前半部分没有的情况 # print("tokenized_sentence:",tokenized_sentence[k:k + label_length]) if tokenized_sentence[main_body_start:label_length-label_start] == tokenized_label[label_start:]:#后半部分没有的情况 print("matched tokenized_label:",tokenized_sentence[main_body_start:label_length-label_start]) # print("matched tokenized_label:",tokenized_sentence[main_body_start:label_length-label_start],"\n",tokenized_label[label_start:]) token_label[main_body_start:label_length-label_start] = [1] * (label_length-label_start) # print("matched clsfy_label:",token_label[k:len(tokenized_sentence)-1]) for pos in range(main_body_start, label_length - label_start): token_positions.append(pos) |
tokenized_inputs["token_positions"].append(token_positions) |
因为tokenized_inputs["token_positions"]中的数组不等长, 无法转换成tensor使用torch.stack(), 无法和其它key一起转到gpu中, 所以, 转到gpu的操作只在分批之后的其它key上进行
预测token位置
2.输出token分类转换成句子时候, 同时输出token预测位置
将get_prediction_string中predicted_tokens_classes的类别预测为'Dataset description'的token位置也记录下来并一起返回
新增一个pre_token_positions列表用于记录预测类别为 'Dataset description' 的 token 的位置。函数最后返回 dataset_description_string 和pre_token_positions
pre_token_positions = [] |
for idx, (token, pred_class) in enumerate(zip(tokenized_sub_sentence, predicted_tokens_classes)): is_descrp=(pred_class == 'Dataset description') if(is_descrp): pre_token_positions.append(idx) |
return dataset_description_string, pre_token_positions |
dataset_description, pre_token_positions=get_prediction_string(prediction_class, predicted_input_id, is_same_paper) |
不同窗口的token位置分别评测, 避免不同窗口的token位置计算混乱
Token位置不需要拼接吧, 毕竟不是输出包含句意的内容, 放在一个大列表中分开不同窗口评测更好, 拼接之后不同窗口的位置反而容易串了
每一个pre_token_positions列表作为一个元素存入papers_pre_token_positions中
papers_pre_token_positions=[] |
dataset_description, pre_token_positions=get_prediction_string(prediction_class, predicted_input_id, is_same_paper) papers_pre_token_positions.append(pre_token_positions) |
return dataset_descriptions, papers_pre_token_positions |
dataset_descriptions,papers_pre_token_positions = get_extracted_description(val_predictions, val_inputs["input_ids"], val_inputs["paper_idx"]) |
评估
评估时的label与预测中token数量可能不同
其中label, 预测分别为papers_pre_token_positions, val_inputs["token_positions"]
准确率(Precision):P=TP/(TP+FP)。通俗地讲,就是预测正确的正例数据占预测为正例数据的比例。
召回率(Recall):R=TP/(TP+FN)。通俗地讲,就是预测为正例的数据占实际为正例数据的比
F1=(2*P*R)/(P+R)
true_positive 表示预测正确的 token 位置的数量,false_positive 表示错误预测的 token 位置的数量,false_negative 表示遗漏的实际 token 位置的数量。通过遍历每个滑窗的预测位置和实际位置,计算这些指标并最终得到 precision, recall 和 F1-score
def token_evaluate(pre_token_positions, label_token_positions): true_positive = 0 false_positive = 0 false_negative = 0
for pred_positions, label_positions in zip(pre_token_positions, label_token_positions): pred_positions_set = set(pred_positions) label_positions_set = set(label_positions)
true_positive += len(pred_positions_set & label_positions_set) false_positive += len(pred_positions_set - label_positions_set) false_negative += len(label_positions_set - pred_positions_set)
precision = true_positive / (true_positive + false_positive) if (true_positive + false_positive) > 0 else 0 recall = true_positive / (true_positive + false_negative) if (true_positive + false_negative) > 0 else 0 f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
return precision, recall, f1 |
# 例子 pre_token_positions = [[], [2], []] label_token_positions = [[1], [2], [3]] precision, recall, f1 = token_evaluate(pre_token_positions, label_token_positions) print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F1-score: {f1}") |
Wandb
Wandb是一个模型训练日志自动记录工具, 配置好后可以比较方便地在wandb.ai的网页查看每一次训练记录的plot和每条曲线的中的值
使用Wandb记录训练日志
安装wandb
pip install wandb |
在命令行登录
wandb login |
输入注册后生成的key, 登录成功
运行测试样例
import wandb import random # start a new wandb run to track this script wandb.init( # set the wandb project where this run will be logged project="my-awesome-project", # track hyperparameters and run metadata config={ "learning_rate": 0.02, "architecture": "CNN", "dataset": "CIFAR-100", "epochs": 10, } ) # simulate training epochs = 10 offset = random.random() / 5 for epoch in range(2, epochs): acc = 1 - 2 ** -epoch - random.random() / epoch - offset loss = 2 ** -epoch + random.random() / epoch + offset # log metrics to wandb wandb.log({"acc": acc, "loss": loss}) # [optional] finish the wandb run, necessary in notebooks wandb.finish() |
获得提示
wandb: Run data is saved locally in D:\Projects\longformer\wandb\run-20240805_204153-4hznct2i wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run hearty-eon-1 wandb: View project at https://wandb.ai/lalagoon-north-china-electric-power-university/test-project wandb: View run at https://wandb.ai/lalagoon-north-china-electric-power-university/test-project/runs/4hznct2i |
打开网址可以看到生成的图像, 说明成功了
在自己的项目中加入
import wandb |
# start a new wandb run to track this script wandb.init( # set the wandb project where this run will be logged project="extract-dataset-description-project", # track hyperparameters and run metadata config={ "learning_rate": 5e-5, "architecture": "Longformer", "dataset": "Description-500", "epochs": 100, } ) |
运行过程中的epoch进行记录
# 记录损失到 wandb wandb.log({"loss": avg_epoch_loss}) |
完事关掉
wandb.finish() |
得到loss图在
https://wandb.ai/lalagoon-north-china-electric-power-university/extract-dataset-description-project
共享机器同时登录
服务器上有另一个wandb账号已经登录了, 则在命令行用这个代替login
export WANDB_API_KEY='xxxx' |
样本类别平衡
标记token label时对窗口进行筛选
如果窗口中没有正例则不输入进行训练
在标记label时, 传入一个flag判断是否标记的是训练集, 训练集中判断token_positions, 如果是空数组, 则不在tokenized_inputs加入这个滑窗.
如果要过滤且没有label在滑窗, 则打断这个滑窗的循环, 不把值加入inputs
if(filter_empty_window and len(token_positions)==0): continue |
在训练集中开启
train_inputs = tokenize_and_align_labels(train_papers, tokenizer,filter_empty_window=True) |
训练输入json数据格式调整
将输入格式由字符串数组调整为json数组, 同一篇文章的不同信息放在一起方便对比
从json数组中读取
descri_file_path='input/papers_and_datasets.json' def read_json(file_path): with open(file_path, 'r', encoding='utf-8') as file: data = json.load(file) return data papers_info = read_json(descri_file_path) |
之后所有的循环改成从papers_info中读取
sentences = data['paper_texts'] labels_texts = data['dataset_descriptions'] titles= data['titles'] |
训练集测试集划分, 改为划分paper_info
train_papers, val_papers = train_test_split( papers_info, train_size=0.8, random_state=42 ) |
输入tokenize_and_align_labels匹配token label时使用paper_info,
train_inputs = tokenize_and_align_labels(train_papers, tokenizer) val_inputs = tokenize_and_align_labels(val_papers, tokenizer) |
def tokenize_and_align_labels(papers, tokenizer, max_length=4096, stride=256): |
使用tokenizer时循环paper_info, 并取出其中各项.
for i, paper in enumerate(papers_info): sentence=paper.get("paper_text") labels_text = paper.get("dataset_descriptions") title = paper.get("title") |
在评估时候也循环取出
1.token评估
def token_evaluate(predictions, val_papers, tokenizer): |
for paper, pred in zip(val_papers, predictions):#ref_tokens中是一篇文章的label, y_true.extend后是所有样本的输出 ref=paper.get("dataset_descriptions") |
precision, recall, f1 = token_evaluate(dataset_descriptions, val_papers, tokenizer) |
2.rouge
def rouge_evaluate(predictions, val_papers): |
for paper, pred in zip(val_papers, predictions): ref=paper.get("dataset_descriptions") |
rouge_results = rouge_evaluate(dataset_descriptions, val_papers) |
同时, 为方便不同设备路径修改, 将使用到的路径统一汇总到开头
descri_file_path='input/papers_and_datasets.json' loss_log_path='output/training_loss.txt' loss_fig_path='output/training_loss.png' model_save_path="output/trained_model" eval_token_path='output/evaluation_token.txt' eval_rouge_path='output/evaluation_rouge.txt' |
GPU内存不足
Colab上分批训练时将输入转移到GPU
input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device) |
报错
# OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB. GPU |
注释了移GPU的, RAM又不够, 换了个号一样说不够用, 之后完善一下代码放华为卡上试试吧
华为卡跑多了也报错:
RuntimeError: NPU out of memory. Tried to allocate 578.00 MiB (NPU 0; 60.97 GiB total capacity; 8.23 GiB already allocated; 8.23 GiB current active; 362.04 MiB free; 8.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. |
清除缓存和减少batch size, 参考
python - How to avoid "CUDA out of memory" in PyTorch - Stack Overflow
import torch torch.cuda.empty_cache() |
torch_npu.npu.empty_cache() |
设置max_split_size_mb, 参考:
环境变量方法:
export 'PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:512' |
代码方法:
torch._C._debug_set_max_split_size_mb(512) torch_npu._C._debug_set_max_split_size_mb(512) |
华为卡查看剩余NPU资源
npu-smi info |
可以看到有人在跑
+---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | 0 0 | 1349884 | python | 30216 | +===========================+===============+====================================================+ |
指定一块GPU运行python程序, 参考:
https://www.cnblogs.com/tyty-Somnuspoppy/p/10071716.html
os.environ["NPU_VISIBLE_DEVICES"] = "1,2,3" |
device = torch.device("npu:1" if torch.npu.is_available() else "npu:2") |
检查每个设备的可用内存是否足够:
def select_available_npu(required_memory_mb): if torch.npu.is_available(): for i in range(1, 8): props = torch.npu.get_device_properties(f"npu:{i}") if props.total_memory - props.reserved_memory >= required_memory_mb * 1024 * 1024: return torch.device(f"npu:{i}") return torch.device("cpu") |
查看总内存 torch_npu.npu.get_device_properties("npu:1").total_memory |
设置特定npu, 注意这里查询要用torch_npu.npu, 不然取不到 device = torch.device("npu:1" if torch_npu.npu.is_available() else "npu:2") |
longformer的输入序列长, 4096也很大
不像现在的infini transformer这种先分割在分块处理
Longformer不支持多卡训练, 只能把batchsize设置小一点,显存会下降一点
batch_size=8
能跑起来了
----+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | 0 0 | 47784 | python | 52657 | +===========================+===============+================================================ |
服务器远程连接断开后进程停止运行
Nohup参考:
在Linux系统的ECS实例内,当断开SSH客户端后,如何保持进程继续运行的解决方案_云服务器 ECS(ECS)-阿里云帮助中心
Linux服务器SSH客户端断开后保持程序继续运行的方法_ssh退出后如何保持程序继续运行-CSDN博客
SSH 断开后使进程仍在后台运行 — Linux latest 文档
nohup ping www.baidu.com & |
ls |
cat nohup.out |
ps -ef | grep ping |
kill [$PID] |
[$PID]为之前nohup命令输出的值
[1] 1255914 nohup: ignoring input and appending output to 'nohup.out' |
用于执行程序
nohup python tkn_clsfy.py & |
ps -ef | grep python |
得到输出
[1] 27308 root 27308 27285 99 13:19 pts/14 00:01:49 python tkn_clsfy.py |