微调大模型(Finetuning Large Language Models)—Where finetuning fits in(二)

news2024/9/26 21:49:50

1.什么时候适合用finetune

在这里插入图片描述
微调(finetuning)对人的作用包括行为改变和知识获取。行为改变方面,包括学习更一致地回应、学会专注(如适度)以及发挥能力(如更擅长对话);知识获取方面,包括增加对新特定概念的了解、纠正旧的不正确信息。总的来说,微调既能带来行为改变,也能实现知识获取。

2.准备数据集,为微调做准备

2.1准备相应的环境

import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset

2.2读取预训练数据

# 读取c4数据集,具体数据集介绍地址为:https://huggingface.co/datasets/legacy-datasets/c4
# load_dataset设置为True表示在线流式读取数据,不至于一次性读取过大
pretrained_dataset = load_dataset("c4", "en", split="train", streaming=True)

n = 5
print("Pretrained dataset:")
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

输出如下:

Pretrained dataset:
{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z', 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}
{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb internal drive and a 240gb SSD.\nWhen trying to restore using disk utility i\'m given the error "Not enough space on disk ____ to restore"\nBut I shouldn\'t have to do that!!!\nAny ideas or workarounds before resorting to the above?\nUse Carbon Copy Cloner to copy one drive to the other. I\'ve done this several times going from larger HDD to smaller SSD and I wound up with a bootable SSD drive. One step you have to remember not to skip is to use Disk Utility to partition the SSD as GUID partition scheme HFS+ before doing the clone. If it came Apple Partition Scheme, even if you let CCC do the clone, the resulting drive won\'t be bootable. CCC usually works in "file mode" and it can easily copy a larger drive (that\'s mostly empty) onto a smaller drive. If you tell CCC to clone a drive you did NOT boot from, it can work in block copy mode where the destination drive must be the same size or larger than the drive you are cloning from (if I recall).\nI\'ve actually done this somehow on Disk Utility several times (booting from a different drive (or even the dvd) so not running disk utility from the drive your cloning) and had it work just fine from larger to smaller bootable clone. Definitely format the drive cloning to first, as bootable Apple etc..\nThanks for pointing this out. My only experience using DU to go larger to smaller was when I was trying to make a Lion install stick and I was unable to restore InstallESD.dmg to a 4 GB USB stick but of course the reason that wouldn\'t fit is there was slightly more than 4 GB of data.', 'timestamp': '2019-04-21T10:07:13Z', 'url': 'https://forums.macrumors.com/threads/restore-from-larger-disk-to-smaller-disk.1311329/'}
{'text': 'Foil plaid lycra and spandex shortall with metallic slinky insets. Attached metallic elastic belt with O-ring. Headband included. Great hip hop or jazz dance costume. Made in the USA.', 'timestamp': '2019-04-25T10:40:23Z', 'url': 'https://awishcometrue.com/Catalogs/Clearance/Tweens/V1960-Find-A-Way'}
{'text': "How many backlinks per day for new site?\nDiscussion in 'Black Hat SEO' started by Omoplata, Dec 3, 2010.\n1) for a newly created site, what's the max # backlinks per day I should do to be safe?\n2) how long do I have to let my site age before I can start making more blinks?\nI did about 6000 forum profiles every 24 hours for 10 days for one of my sites which had a brand new domain.\nThere is three backlinks for every of these forum profile so thats 18 000 backlinks every 24 hours and nothing happened in terms of being penalized or sandboxed. This is now maybe 3 months ago and the site is ranking on first page for a lot of my targeted keywords.\nbuild more you can in starting but do manual submission and not spammy type means manual + relevant to the post.. then after 1 month you can make a big blast..\nWow, dude, you built 18k backlinks a day on a brand new site? How quickly did you rank up? What kind of competition/searches did those keywords have?", 'timestamp': '2019-04-21T12:46:19Z', 'url': 'https://www.blackhatworld.com/seo/how-many-backlinks-per-day-for-new-site.258615/'}
{'text': 'The Denver Board of Education opened the 2017-18 school year with an update on projects that include new construction, upgrades, heat mitigation and quality learning environments.\nWe are excited that Denver students will be the beneficiaries of a four year, $572 million General Obligation Bond. Since the passage of the bond, our construction team has worked to schedule the projects over the four-year term of the bond.\nDenver voters on Tuesday approved bond and mill funding measures for students in Denver Public Schools, agreeing to invest $572 million in bond funding to build and improve schools and $56.6 million in operating dollars to support proven initiatives, such as early literacy.\nDenver voters say yes to bond and mill levy funding support for DPS students and schools. Click to learn more about the details of the voter-approved bond measure.\nDenver voters on Nov. 8 approved bond and mill funding measures for DPS students and schools. Learn more about what’s included in the mill levy measure.', 'timestamp': '2019-04-20T14:33:21Z', 'url': 'http://bond.dpsk12.org/category/news/'}

2.3本地数据如下

filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

输出如下:
在这里插入图片描述

2.4不同方式格式化数据

examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
text

输出如下:

"What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/."

又或者以判断方式格式化数据

if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]

2.5格式化模板

prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

输出如下:

"### Question:\nWhat are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?\n\n### Answer:\nLamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/."
prompt_template_q = """### Question:
{question}

### Answer:"""

num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

构造的纯文本格式数据结果如下:

pprint(finetuning_dataset_text_only[0])

输出如下:

{'text': '### Question:\n'
         'What are the different types of documents available in the '
         "repository (e.g., installation guide, API documentation, developer's "
         'guide)?\n'
         '\n'
         '### Answer:\n'
         'Lamini has documentation on Getting Started, Authentication, '
         'Question Answer Model, Python Library, Batching, Error Handling, '
         'Advanced topics, and class documentation on LLM Engine available at '
         'https://lamini-ai.github.io/.'}

构造的QA格式数据集如下:

pprint(finetuning_dataset_question_answer[0])

输出如下:

{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}

3.其他方式初始化读取数据

with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

输出如下:

{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}

又或者

finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

输出如下:

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})

4.总结

数据的准备是微调的基础,良好的数据质量是成功的一半,数据准备前置工作举足轻重。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2168080.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

三、人物骨骼介绍

一、迁移骨骼包 注意:要迁移所有骨骼压缩包(单独可能会有问题) 二、认识骨骼 点击骨骼 进入骨骼页面

尚庭公寓-接口定义

5. 接口定义 5.1 后台管理系统接口定义 5.1.1 公寓信息管理 5.1.1.1 属性管理 属性管理页面包含公寓和房间各种可选的属性信息,其中包括房间的可选支付方式、房间的可选租期、房间的配套、公寓的配套等等。其所需接口如下 房间支付方式管理 页面如下 所需接口如…

【Linux】Linux 的 权限

一、 Linux 权限的概念 Linux下有两种用户:超级用户(root)、普通用户。 超级用户:可以在 linux 系统下做任何事情,不受限制普通用户:在 linux 下做有限的事情。超级用户的命令提示符是“#”,普…

每日一题|2535. 数组元素和与数字和的绝对差|数位运算

简单题。先加后减,可以剪枝。 先加后减就是对于每一个数字之间完成该数字的值-数位和,然后再去下一个数字。 特别的,对于小于10的数字,减自身就是0,没必要计算,可以跳过。 class Solution(object):def d…

C++之美:代码整洁、安全又跑得快的30个要诀(好书推荐)

在编程领域,C 以其高效性和灵活性著称,但同时也因其复杂性和易出错性而闻名。如何写出既整洁、又安全且高效的 C 代码,是每个 C 开发者都需要思考的问题。《C之美:代码整洁、安全又跑得快的30个要诀》这本书为我们提供了宝贵的指导…

git clone代码报错Permission denied (publickey)

git clone gerrit SSH的Clone with commit-msg hook代码连接,报错Permission denied (publickey). 一般在C:\Users\用户名.ssh文件夹下有一个id_rsa.pub文件 把文件里的内容复制 到gerrit网站上User Settings的SSH keys里 在New SSH key里粘贴刚刚复制的内容&…

【递归】6.LPC 44 开幕式火焰

1 题目描述 题目链接:开幕式火焰 2 解答思路 递归分为三步,接下来就按照这三步来思考问题 第一步:挖掘出相同的子问题 (关系到具体函数头的设计) 第二步:只关心具体子问题做了什么 (关系…

那些年我和 ChatGPT 合谋摸鱼的日子

文章目录 那些年我和 ChatGPT 合谋摸鱼的日子1 序个言2 说正事3 这次是真正的正事4 总个结 那些年我和 ChatGPT 合谋摸鱼的日子 1 序个言 看到 CSDN 出这个活动有段时间了,奈何俗务缠身,一直没静下心来想想怎么写。今天碰巧赶上了,就顺便聊聊…

【Linux扩容根分区】LVM分区扩容过程踩坑记录

最近想要给自己使用的Linux操作系统的根分区进行扩容,解决完发现,原来问题如此简单。 特此记录,希望能帮助到有需要的人。 通过df -Th查看系统磁盘分区情况 通过vgdisplay 查看内容 实操过程中,原来红框中,Free PE …

2024年双十一买啥最划算?双十一好物推荐闭眼入!

一年一度的双十一购物狂欢节已悄然临近,这不仅是一场消费者的盛宴,更是各大品牌竞相展示实力、推出优惠的绝佳时机。在这个全民狂欢的日子里,数码产品作为科技与生活的桥梁,相信已经有不少朋友想要大买特买了。无论是追求极致性能…

《python语言程序设计》2018版第8章18题几何circle2D类(下部)

前言、从9.20激动发言到现在一直没有克制住的心情中,回到编程 比如删掉我设计的导入第二个园的x,y,radius的函数我做了之前设计的变化.建立了两个可以将x,y拿出来的函数out计算两个坐标之间的距离利用已知的两个坐标之间的距离来比对第1个园里的半径,看第2个园的坐标是否在第一…

Linux文本内容管理命令_2

find:-查找命令执行文件 which 命令 whereis 命令 type 命令----查看命令类型 alias (命令别名) cat 查看文件--更新文件时间,再次cat,时间不会改变 touch--会更新所有属性的时间,文件诞生时间不会改变 …

求n的阶乘的相反数(c语言)

1./请编写函数fun,其功能是:计算并输出下列多项式的值: // s11/1!1/2!1/3!1/4!1/5!1/6!1/7!...1/n! //例如,在主函数中从键盘给n输入15,则输出为:s 2.718282。 //注意:要求n的值大于1但不大于100。 2.我们先输入数字n,然后先讲n!的阶乘计算…

NMOS的原理

NMOS(N型金属氧化物半导体场效应晶体管)是常见的场效应晶体管(FET)的一种,其主要电极包括D极(Drain)、S极(Source)和G极(Gate),每个电…

JavaSE——lombok、juint单元测试、断言

一、lombok的使用 默认jvm不解析第三方注解,需要手动开启 链式调用 二、juint单元测试 下载juint包 public class TestDemo {// 在每一个单元测试方法执行之前执行Beforepublic void before() {// 例如可以在before部分创建IO流System.out.println("befor…

【数据结构】栈和队列(Stack Queue)

引言 在对顺序表,链表有了充分的理解之后,现在让我们学习栈和队列!!! 【链表】 👈链表 【顺序表】👈顺序表 目录 💯栈 1.栈的概念及结构 2.栈的实现 ⭐初始化栈 ⭐入栈 ⭐…

【C++】入门基础知识-1

🍬个人主页:Yanni.— 🌈数据结构:Data Structure.​​​​​​ 🎂C语言笔记:C Language Notes 🏀OJ题分享: Topic Sharing 目录 前言: C关键字 命名空间 命名空间介…

【论文翻译】AFLGuard: Byzantine-robust Asynchronous Federated Learning

提示:该论文标题为AFLGuard: Byzantine-robust Asynchronous Federated Learning,我将对其进行部分翻译,便于后续阅读。 文章目录 AFLGuard:拜占庭鲁棒的异步联邦学习一、摘要二、引言三、知识前提拜占庭鲁棒联邦学习 四、问题表述…

JVM(HotSpot):程序计数器(Program Counter Register)

文章目录 一、内存结构图二、案例解读三、工作流程四、特点 一、内存结构图 二、案例解读 我们使用javap对字节码进行反编译,来看下程序计数器怎么体现的。 IDEA写一个简单的Java代码 反编译命令 javap -verbose InitTest.class $ javap -verbose InitTest.clas…

解决Typora图片复制到CSDN无法查看问题

下载安装picgo 山东大学镜像源:https://mirrors.sdu.edu.cn/github-release/Molunerfinn_PicGo 开通阿里云对象存储oss 选择创建 填入内容 购买资源包 创建AccessKey 配置PicGo 设定bucket填入创建bucket名称 注意:设定存储区域只需要填写到区域前缀即…