LLM - 微调(Fine-Tuning) Llama3 以及合并微调模型教程

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/141218047

Llama3

在微调 Llama3 大模型时，需要注意一些事项：

合适的预训练模型：不同的预训练模型具有不同的特点和适用范围。根据具体任务选择最适合的模型非常重要。
合适的微调方法：常见的微调方法包括全量微调、参数高效微调（如LoRA、QLoRA）、适配器调整、前缀调整等。不同任务和数据集需要采用不同的方法，以达到最佳效果。
数据集的选择：目标领域的样本数据应该具备足够的代表性，同时需要避免过拟合和欠拟合现象。确保数据集质量高且多样化。
防止模型过拟合：由于微调时使用的样本数据较少，特别注意模型过拟合的问题。可以采用正则化、dropout等方法进行防止。
计算资源管理：微调大模型需要大量计算资源，特别是对于参数量较大的模型。合理分配和管理计算资源，确保训练过程顺利进行。
超参数调整：微调过程中需要对学习率、批量大小等超参数进行调整，以找到最佳配置。可以使用网格搜索或贝叶斯优化等方法来优化超参数。
监控和评估：在微调过程中，持续监控模型的性能，并使用验证集进行评估。确保模型在目标任务上的表现不断提升。

这些注意事项可以更有效地微调大模型，提高其在特定任务上的表现。

1. 准备数据集与基础模型

准备 HuggingFace 数据集：https://huggingface.co/datasets/ruslanmv/ai-medical-chatbot

安装 HuggingFace 下载工具，使用镜像下载速度明显加快：

export HF_ENDPOINT="https://hf-mirror.com"
pip install -U huggingface_hub hf-transfer

以 ruslanmv/ai-medical-chatbot 为例，下载数据集(dataset)的脚本，如下：

huggingface-cli download --token [your token] ruslanmv/ai-medical-chatbot --local-dir ai-medical-chatbot --repo-type dataset

如果下载的是 HuggingFace 数据集样式，需要指定 --repo-type dataset

参考：HuggingFace - Command Line Interface

下载的数据格式是 Parguet 格式：

ai-medical-chatbot/
├── README.md
├── dialogues.parquet
└── future.jpg

Parquet 是列存储的文件格式，被设计用于高效地存储和处理大型数据集。

基础模型使用 meta-llama/Meta-Llama-3.1-8B-Instruct

安装 Python 库：

pip install -U transformers datasets accelerate peft trl bitsandbytes wandb

transformers: NLP tasks
datasets: accessing and processing datasets
accelerate: optimizing model training
peft: parameter-efficient fine-tuning
trl: reinforcement learning
bitsandbytes: efficient computation
wandb: experiment tracking and logging

Python 库的版本如下：

transformers==4.44.0
datasets==2.20.0
accelerate==0.33.0
peft==0.12.0
trl==0.9.6
bitsandbytes==0.43.3
wandb==0.17.6

验证 PyTorch 是否可用：

python

import torch
print(torch.__version__)  # 1.13.1
print(torch.cuda.is_available())  # True
exit()

2. 微调大模型

准备 WandB Token，导出 Token：

export WANDB_TOKEN=[your token]

导入必要的包：

# transformers==4.44.0
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
# peft==0.12.0
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
# trl==0.9.6
from trl import SFTTrainer, setup_chat_format

初始化 WandB，用于缓存参数，即：

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune Llama 3 8B on Medical Dataset', 
    job_type="training", 
    anonymous="allow"
)

其中：

project: 指定新的运行记录发送到的项目名称。如果没有指定项目名称，W&B 尝试从 git 根目录或当前程序文件中推断项目名称。
job_type: 描述运行的类型，job_type 设置为 “training”，表示这是一个训练任务，支持 training、evaluation、preprocessing、inference。
anonymous: 是否允许匿名运行，anonymous 被设置为 "allow"，即允许匿名运行。

匿名模式非常适合希望快速分享代码和结果，而不需要用户创建账户的场景。不过，匿名运行的链接是敏感的，任何人都可以在 7 天内查看和认领实验结果，因此应谨慎分享这些链接。

设置文件参数：

base_model = "llm/Meta-Llama-3.1-8B-Instruct/"
dataset_name = "llm/datasets/ai-medical-chatbot/"
new_model = "llm/llama-3-8b-chat-doctor"

设置类型参数：

torch_dtype = torch.float16
attn_implementation = "eager"

配置 QLoRA 参数：

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

QLoRA 参数解释：

load_in_4bit=True: 模型将以4位量化的形式加载，显著减少内存使用，同时保持模型性能。
bnb_4bit_quant_type="nf4": nf4 代表 NormalFloat 4-bit，专门为正态分布权重设计的数据类型，在信息理论上是最优的，在减少内存占用的同时保持较高的精度。
bnb_4bit_compute_dtype=torch_dtype: 指定计算时的数据类型，torch_dtype 通常是 torch.float16 或 torch.bfloat16，用于在计算过程中保持较高的计算效率和精度。
bnb_4bit_use_double_quant=True: 双重量化是一种技术，通过对量化常数进行再次量化来进一步减少平均内存占用，在不显著影响性能的情况下，进一步优化内存使用。

这些配置参数共同作用，使得 QLoRA 能够在较小的 GPU 内存上高效地微调大型语言模型，同时保持接近全参数微调的性能。

配置 AutoModelForCausalLM 参数：

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

配置参数解释：

base_model: 预训练模型的名称或路径。
quantization_config=bnb_config: 传递量化配置对象 bnb_config，用于指定模型的量化设置。量化减少模型的内存占用和计算需求，具体配置参考 QLoRA。
device_map="auto": 自动分配模型到可用的设备 (如 GPU 或 CPU)，设置为 “auto” 时，模型会根据当前系统的硬件配置自动选择最优的设备进行加载和推理。
attn_implementation="eager": 注意力机制的实现方式，“eager” 表示使用急切执行模式，这种模式可以提高计算效率和性能。

配置 Tokenizer：

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)
model, tokenizer = setup_chat_format(model, tokenizer)

注意 setup_chat_format 将模型和分词器设置为适合聊天的格式，来源于 trl 库

配置 LoRA 导入模型：

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)

LoRA 参数解释：

r=16: LoRA 的秩(rank)，表示低秩矩阵的维度
lora_alpha=32: LoRA 的缩放因子，用于调整低秩矩阵的影响
lora_dropout=0.05: Dropout 概率，用于防止过拟合
bias="none": 是否在 LoRA 中使用偏置项
task_type="CAUSAL_LM": 任务类型是因果语言模型(Causal Language Model)
target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']: LoRA 应用的目标模块。

将 LoRA 添加至模型，使用 get_peft_model 函数将已配置的 LoRA 应用到模型中。

将数据集 Parguet 转换成 chat 格式：

load_dataset: 加载数据集
dataset.shuffle().select(): 数据集 shuffle 以及采样 (select)
format_chat_template: 数据格式，转换成 user 和 assistant 的格式
dataset.map: 映射 dataset 的数据格式
dataset.train_test_split: 拆分训练集和验证集，test_size 验证集比例是 0.1(10%)

源码如下：

#Importing the dataset
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=65).select(range(1000)) # Only use 1000 samples for quick demo

def format_chat_template(row):
    row_json = [
        {
            "role": "user", 
             "content": row["Patient"]
        },
        {
            "role": "assistant", 
            "content": row["Doctor"]
        }
    ]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc=4,
)

dataset = dataset.train_test_split(test_size=0.1)

数据格式 dataset['text'][3] 如下：

<|im_start|>user
Fell on sidewalk face first about 8 hrs ago. Swollen, cut lip bruised and cut knee, and hurt pride initially. Now have muscle and shoulder pain, stiff jaw(think this is from the really swollen lip),pain in wrist, and headache. I assume this is all normal but are there specific things I should look for or will I just be in pain for a while given the hard fall?<|im_end|>
<|im_start|>assistant
Hello and welcome to HCM,The injuries caused on various body parts have to be managed.The cut and swollen lip has to be managed by sterile dressing.The body pains, pain on injured site and jaw pain should be managed by pain killer and muscle relaxant.I suggest you to consult your primary healthcare provider for clinical assessment.In case there is evidence of infection in any of the injured sites, a course of antibiotics may have to be started to control the infection.Thanks and take careDr Shailja P Wahal<|im_end|>

配置训练参数 Transformer 的 TrainingArguments 类：

training_arguments = TrainingArguments(
    output_dir=new_model,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="wandb"
)

TrainingArguments 参数：

output_dir=new_model：训练输出的目录，模型和其他输出文件将保存在这个目录中
per_device_train_batch_size=1: 每个设备(如 GPU)的训练批次大小为 1
per_device_eval_batch_size=1: 每个设备上的评估批次大小为 1
gradient_accumulation_steps=2: 梯度累积步数，每 2 步累积一次梯度，相当于有效批次大小为 2
optim="paged_adamw_32bit": 使用 paged_adamw_32bit 优化器，特定的 AdamW 优化器变体，对于内存优化
num_train_epochs=1: 训练的总轮数为 1
eval_strategy="steps": 评估策略为按步数进行评估
eval_steps=0.2: 运行 20% 的训练步骤，进行评估，即评估 5 次
logging_steps=1: 每 1 步记录一次日志
warmup_steps=10: 预热步数，在训练开始时，前 10 步逐渐增加学习率
logging_strategy="steps": 日志记录策略为按步数记录
learning_rate=2e-4: 学习率设置为 0.0002
fp16=False: 不使用 16 位浮点数精度训练
bf16=False: 不使用 bfloat16 精度训练
group_by_length=True: 按长度分组批次，以提高训练效率
report_to="wandb": 将训练日志报告到 Weights & Biases (WandB) 平台

这些参数共同配置了模型训练的各种细节。

配置参数：

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,
    dataset_text_field="text"
    packing= False
)

注意，在运行过程中，提示 Warning，建议使用 SFTConfig 代替 TrainingArguments，进行训练。

trl 库 (Transformer Reinforcement Learning) 用于微调和对齐大型语言模型的全栈工具库，支持多种训练方法，包括 监督微调(SFT)、奖励建模(RM) 和 近端策略优化(PPO) 等。

SFTTrainer 是用于监督微调的一个类：

model: 微调的模型
train_dataset: 训练的数据集
eval_dataset: 评估的数据集
peft_config: 参数高效微调(PEFT) 的配置，优化模型训练
tokenizer: 处理文本数据的分词器
args: 训练参数，包括学习率、批次大小等
max_seq_length: 输入序列的最大长度。
dataset_text_field: 数据集中包含文本的字段名称
packing: 是否启用数据打包

启动模型训练：

trainer.train()

在训练时，显存占用：

+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:59:00.0 Off |                    0 |
| N/A   34C    P0    81W / 400W |   5750MiB / 81920MiB |     28%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:6E:00.0 Off |                    0 |
| N/A   45C    P0   303W / 400W |  11884MiB / 81920MiB |     55%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

全部的训练输出：

Step	Training Loss	Validation Loss
90	2.339900	2.484635
180	2.881300	2.444273
270	2.296900	2.416931
360	2.669700	2.384277
450	2.389500	2.374748

TrainOutput(global_step=450, training_loss=2.5055274669329326, metrics={'train_runtime': 322.5655, 'train_samples_per_second': 2.79, 'train_steps_per_second': 1.395, 'total_flos': 9329115929985024.0, 'train_loss': 2.5055274669329326, 'epoch': 1.0})

运行结束：

wandb.finish()
model.config.use_cache = True

测试输出：

messages = [
    {
        "role": "user",
        "content": "Hello doctor, I have bad acne. How do I get rid of it?"
    }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")
outputs = model.generate(**inputs, max_length=150, num_return_sequences=1)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])

问题：

Hello doctor, I have bad acne. How do I get rid of it?
你好，医生，我脸上的痤疮很严重。我该如何去除它？

回答：

Hi. I have gone through your query. You have acne vulgaris. Acne vulgaris is a common skin condition that causes spots and pimples (zits) on the face, chest, and back. Acne vulgaris is a chronic disorder of the pilosebaceous unit (hair follicle and its associated sebaceous gland). It is a disease of the pilosebaceous unit, which includes the hair follicle and its associated sebaceous gland. Acne vulgaris is the most common skin condition in the world. Acne vulgaris is not a serious health problem, but it can cause permanent scars.

你好。我已经了解了你的查询。你患有痤疮。痤疮是一种常见的皮肤状况，会导致面部、胸部和背部出现斑点和丘疹（粉刺）。痤疮是一种慢性毛囊皮脂腺单位（毛发毛囊及其相关的皮脂腺）的疾病。它是一种包括毛发毛囊及其相关皮脂腺的毛囊皮脂腺单位疾病。痤疮是世界上最常见的皮肤状况。痤疮不是一个严重的健康问题，但它可能导致永久性疤痕。

存储模型：

trainer.model.save_pretrained(new_model)

输出文件夹：

adapter_config.json
adapter_model.safetensors  # 2.2G

其中 adapter_config.json 包括一些微调的信息

3. 合并模型

基础模型与微调模型路径：

base_model = "llm/Meta-Llama-3.1-8B-Instruct/"
new_model = "llm/llama-3-8b-chat-doctor"

合并基础模型与微调路径：

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch
from trl import setup_chat_format
# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)

base_model_reload = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
)

base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)

# Merge adapter with base model
model = PeftModel.from_pretrained(base_model_reload, new_model)

model = model.merge_and_unload()

评估合并的模型：

messages = [{"role": "user", "content": "Hello doctor, I have bad acne. How do I get rid of it?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)
outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

测试输出：

Hello. I just read your query. For further doubts consult a dermatologist online --> I suggest you to use a combination of tablet Doxycycline 100 mg twice daily and tablet Clindamycin 150 mg twice daily for five days. Apply Clindamycin gel twice daily for five days. Apply Adapalene gel at night for five days. Apply Benzoyl peroxide gel in the morning for five days. Do not use soap, use a mild cleanser instead. Avoid oily cosmetics. Wash your face after every two to three.
我建议您使用以下药物组合：每天两次，每次100毫克的多西环素片和每天两次，每次150毫克的克林霉素片，持续五天。每天两次涂抹克林霉素凝胶，持续五天。每晚涂抹阿达帕林凝胶，持续五天。每天早上涂抹过氧化苯甲酰凝胶，持续五天。不要使用肥皂，改用温和的清洁剂。避免使用油性化妆品。每两到三小时洗一次脸。

存储完整模型：

model.save_pretrained("llama-3-8b-chat-doctor-full")
tokenizer.save_pretrained("llama-3-8b-chat-doctor-full")

模型输出：

llm/llama-3-8b-chat-doctor-full
├── [ 763]  config.json
├── [ 198]  generation_config.json
├── [4.6G]  model-00001-of-00004.safetensors
├── [4.7G]  model-00002-of-00004.safetensors
├── [4.6G]  model-00003-of-00004.safetensors
├── [1.1G]  model-00004-of-00004.safetensors
├── [ 23K]  model.safetensors.index.json
├── [ 419]  special_tokens_map.json
├── [8.7M]  tokenizer.json
└── [ 50K]  tokenizer_config.json

Meta-Llama-3-8B 模型的输出，如下：

Hi, I am a dermatologist and I would be happy to help. Please tell me more about your skin condition. I will try to answer your question.
I have bad acne. I have been taking medication for it for almost a year. I am on the medication for a long time. It has not worked. I have been taking medication for it for almost a year. I am on the medication for a long time. It has not worked.
I am on the medication for a long time. It has not worked.
What medications have you been taking? What is your skin type? What is your skin type?

参考：

使用wandb可视化训练过程
使用缩放点积注意力（SDPA）实现高性能Transformers

Warning:

huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': max_seq_length, dataset_text_field. Will not be supported from version '1.0.0'.

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  warnings.warn(message, FutureWarning)
trl/trainer/sft_trainer.py:280: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(
trl/trainer/sft_trainer.py:318: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
  warnings.warn(

Parquet 数据处理，源码：

import pandas as pd

# 创建数据
data = {
    'A': [1, 2, 3, 4, 5],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'C': [1.1, 2.2, 3.3, 4.4, 5.5]
}
df = pd.DataFrame(data)

# 写入 Parquet 文件
df.to_parquet('data.parquet', engine='pyarrow')

# 读取 Parquet 文件
df = pd.read_parquet('data.parquet', engine='pyarrow')
print(df.head())

全部源码汇总：

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # 保证程序cuda序号与实际cuda序号对应
os.environ['CUDA_VISIBLE_DEVICES'] = "1,2"      # 代表仅使用第0，1号GPU

import torch
import wandb
from datasets import load_dataset
from peft import (
    LoraConfig,
    get_peft_model,
)
from transformers import (
    BitsAndBytesConfig,
    TrainingArguments,
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
)
from trl import SFTTrainer, setup_chat_format
from peft import PeftModel

WANDB_TOKEN = "your token"


class Llama3FineTuning(object):
    """
    Fine-tuning Llama 3 8B on Medical Dataset
    """
    def __init__(self):
        self.base_dir = "your folder"
        # 模型路径
        self.base_model = os.path.join(self.base_dir, "Meta-Llama-3-8B")
        self.dataset_name = os.path.join(self.base_dir, "datasets/ai-medical-chatbot")
        self.new_model = os.path.join(self.base_dir, "llama-3-8b-chat-doctor-v2")
        self.new_full_model = os.path.join(self.base_dir, "llama-3-8b-chat-doctor-v2-full")

    @staticmethod
    def init_wandb():
        wb_token = WANDB_TOKEN

        wandb.login(key=wb_token)
        run = wandb.init(
            project='Fine-tune Llama 3 8B on Medical Dataset',
            job_type="training",
            anonymous="allow"
        )

    def save_full_model(self):
        # Reload tokenizer and model
        tokenizer = AutoTokenizer.from_pretrained(self.base_model)

        base_model_reload = AutoModelForCausalLM.from_pretrained(
            self.base_model,
            return_dict=True,
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
        )

        base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)

        # Merge adapter with base model
        model = PeftModel.from_pretrained(base_model_reload, self.new_model)
        model = model.merge_and_unload()

        messages = [{"role": "user", "content": "Hello doctor, I have bad acne. How do I get rid of it?"}]

        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            torch_dtype=torch.float16,
            device_map="auto",
        )

        outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
        print(outputs[0]["generated_text"])
        model.save_pretrained(self.new_full_model)
        tokenizer.save_pretrained(self.new_full_model)

    def run(self):
        self.init_wandb()

        torch_dtype = torch.float16
        attn_implementation = "eager"

        # QLoRA config
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_use_double_quant=True,
        )

        # Load model
        model = AutoModelForCausalLM.from_pretrained(
            self.base_model,
            quantization_config=bnb_config,
            device_map="auto",
            attn_implementation=attn_implementation
        )

        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.base_model)
        model, tokenizer = setup_chat_format(model, tokenizer)

        # LoRA config
        peft_config = LoraConfig(
            r=16,
            lora_alpha=32,
            lora_dropout=0.05,
            bias="none",
            task_type="CAUSAL_LM",
            target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
        )
        model = get_peft_model(model, peft_config)

        # Importing the dataset
        dataset = load_dataset(self.dataset_name, split="all")
        dataset = dataset.shuffle(seed=65).select(range(2000))  # Only use 1000 samples for quick demo
        # dataset = dataset.shuffle(seed=65)  # 256916 samples
        print(f"[Info] dataset: {len(dataset)}")

        def format_chat_template(row):
            row_json = [
                {
                    "role": "user",
                    "content": row["Patient"]
                },
                {
                    "role": "assistant",
                    "content": row["Doctor"]
                }
            ]
            row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
            return row

        dataset = dataset.map(
            format_chat_template,
            num_proc=24,
        )

        dataset = dataset.train_test_split(test_size=0.1)

        training_arguments = TrainingArguments(
            output_dir=self.new_model,
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            gradient_accumulation_steps=2,
            optim="paged_adamw_32bit",
            num_train_epochs=1,
            eval_strategy="steps",
            eval_steps=0.2,
            logging_steps=1,
            warmup_steps=10,
            logging_strategy="steps",
            learning_rate=2e-4,
            fp16=False,
            bf16=False,
            group_by_length=True,
            report_to="wandb"
        )

        trainer = SFTTrainer(
            model=model,
            train_dataset=dataset["train"],
            eval_dataset=dataset["test"],
            peft_config=peft_config,
            tokenizer=tokenizer,
            max_seq_length=512,
            dataset_text_field="text",
            args=training_arguments,
            packing=False,
        )

        trainer.train()

        wandb.finish()
        model.config.use_cache = True

        messages = [
            {
                "role": "user",
                "content": "Hello doctor, I have bad acne. How do I get rid of it?"
            }
        ]

        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")
        outputs = model.generate(**inputs, max_length=300, num_return_sequences=1)
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        print(f"[Info] output: {text.split('assistant')[1]}")

        trainer.model.save_pretrained(self.new_model)

        self.save_full_model()  # Save the full model


def main():
    lft = Llama3FineTuning()
    lft.run()


if __name__ == '__main__':
    main()