使用LoRA对Llama3微调

使用LoRA（Low-Rank Adaptation of Large Language Models）技术对Llama-3语言模型进行微调。

理论知识参考百度安全验证

微调的前提条件

现在huggingface上下载llama2或llama3的huggingface版本。

我下载的是llama-2-13b-chat。

大语言模型微调方法

全参数微调（Full parameter fine-tuning）是一种对预训练模型所有层的所有参数进行微调的方法。一般来说，它可以实现最佳性能，但也是最耗资源和耗时的：它需要最多的 GPU 资源，并且耗时最长。

PEFT（Parameter-Efficient Fine-Tuning），允许以最少的资源和成本微调模型。有两种重要的 PEFT 方法：LoRA（Low Rank Adaptation）和 QLoRA（Quantized LoRA），其中预训练模型分别作为量化的 8 位和 4 位权重加载到 GPU。您很可能可以使用 LoRA 或 QLoRA 微调在具有 24GB 内存的单个消费级 GPU 上微调 Llama 2-13B 模型，并且使用 QLoRA 所需的 GPU 内存和微调时间甚至比 LoRA 更少。

通常，应该先尝试 LoRA，如果资源极其有限，则应先尝试 QLoRA，然后在完成微调后评估性能。只有当性能不理想时才考虑进行全面微调。

使用LoRA对Llama2微调

1. 导入库

from transformers import LlamaTokenizer, LlamaForCausalLM, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_from_disk, Dataset

解释：导入所需的库：

transformers 用于处理语言模型和分词器。
peft 用于实现LoRA微调。
datasets 用于处理数据集。

2. 构建数据集

此处的数据集你可以根据你领域的数据构建。

data = {
    "instruction": [
        "What is the capital of France?",
        "What is 2 + 2?",
        "How do you greet someone in English?"
    ],
    "response": [
        "The capital of France is Paris.",
        "2 + 2 equals 4.",
        "You greet someone by saying 'Hello' in English."
    ]
}

dataset = Dataset.from_dict(data)
dataset.save_to_disk("simple_dataset")

解释：创建了一个简单的数据集，包括instruction和response字段。然后将其转换为Dataset对象，并保存到磁盘中。

3. 加载模型和分词器

model_name = r"C:\apps\ml_model\llama-2-13b-chat-hf"
print("starting to load tokenizer.")
tokenizer = LlamaTokenizer.from_pretrained(model_name)
print("starting to load model.")
model = LlamaForCausalLM.from_pretrained(model_name)
print("Finished to load model")

解释：指定模型路径，加载Llama-2的分词器和模型，打印加载过程的日志信息。

4. 定义LoRA配置

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
)

解释：配置LoRA相关参数，如低秩分解的维度r，缩放因子lora_alpha，以及Dropout比例lora_dropout。

5. 包装模型

print("Starting to get peft model")
model = get_peft_model(model, lora_config)
tokenizer.pad_token = tokenizer.eos_token

解释：将预训练模型通过LoRA技术进行包装，并设置分词器的填充标记。

6. 加载数据集

dataset = load_from_disk("simple_dataset")

解释：从磁盘加载之前保存的数据集。

7. 数据集分词

def tokenize_function(examples):
    combined_texts = [
        instruction + " " + response
        for instruction, response in zip(examples["instruction"], examples["response"])
    ]
    tokenized_inputs = tokenizer(
        combined_texts,
        truncation=True,
        padding="max_length",
        max_length=128
    )
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_function, batched=True)

解释：定义了一个函数，将指令和响应文本连接起来，并对其进行分词。结果包括输入的input_ids，并将其复制为标签。

8. 定义训练参数

training_args = TrainingArguments(
    output_dir="./llama2-lora-finetuned",
    per_device_train_batch_size=1,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
)

解释：配置训练参数，如输出目录、每设备的训练批量大小、训练轮数、日志记录频率、保存频率等。

9. 初始化Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

解释：使用Hugging Face的Trainer类，传入微调后的模型、训练参数和数据集，准备开始训练。

10. 模型训练和保存

print("Starting to training")
trainer.train()
print("Finished to training")
trainer.save_model("llama2-lora-finetuned")
print("model saved")

解释：开始微调模型，训练完成后将微调后的模型保存到指定目录中，并打印相关日志信息。

执行结果如下

Saving the dataset (1/1 shards): 100%|██████████| 3/3 [00:00<00:00, 428.59 examples/s]
starting to load tokenizer.
starting to load model.
Loading checkpoint shards: 100%|██████████| 3/3 [03:10<00:00, 63.57s/it]
Finished to load model
Starting to get peft model
get peft model
Map: 100%|██████████| 3/3 [00:00<00:00, 37.04 examples/s]
C:\Users\Harry\anaconda3\envs\ai_service\lib\site-packages\accelerate\accelerator.py:451: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
  warnings.warn(
C:\Users\Harry\anaconda3\envs\ai_service\lib\site-packages\pydantic\_internal\_fields.py:151: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
C:\Users\Harry\anaconda3\envs\ai_service\lib\site-packages\pydantic\_internal\_config.py:322: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
  warnings.warn(message, UserWarning)
Starting to training
100%|██████████| 9/9 [47:12<00:00, 314.57s/it]{'train_runtime': 2832.8354, 'train_samples_per_second': 0.003, 'train_steps_per_second': 0.003, 'train_loss': 5.6725747850206165, 'epoch': 3.0}
100%|██████████| 9/9 [47:12<00:00, 314.72s/it]
Finished to training
model saved 

Process finished with exit code 0

在llama2-lora-finetuned目录下你会发现lora模型

全部代码

from transformers import LlamaTokenizer, LlamaForCausalLM, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_from_disk

from datasets import Dataset

data = {
    "instruction": [
        "What is the capital of France?",
        "What is 2 + 2?",
        "How do you greet someone in English?"
    ],
    "response": [
        "The capital of France is Paris.",
        "2 + 2 equals 4.",
        "You greet someone by saying 'Hello' in English."
    ]
}

dataset = Dataset.from_dict(data)
dataset.save_to_disk("simple_dataset")



# Load the model and tokenizer
model_name = r"C:\apps\ml_model\llama-2-13b-chat-hf"
print("starting to load tokenizer.")
tokenizer = LlamaTokenizer.from_pretrained(model_name)
print("starting to load model.")
# Load model with 8-bit precision
model = LlamaForCausalLM.from_pretrained(model_name)
print("Finished to load model")
# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # This is for language modeling
    r=16,  # LoRA attention dimension
    lora_alpha=32,  # LoRA scaling factor
    lora_dropout=0.1,  # Dropout for LoRA layers
)

print("Starting to get peft model")
# Wrap the model with LoRA
model = get_peft_model(model, lora_config)

tokenizer.pad_token = tokenizer.eos_token

# if tokenizer.pad_token is None:
#     tokenizer.add_special_tokens({'pad_token':'PAD'})
#     model.resize_token_embeddings(len(tokenizer))


print("get peft model")
# Load the dataset
dataset = load_from_disk("simple_dataset")

# Tokenize the dataset
def tokenize_function(examples):
    combined_texts = [
        instruction + " " + response
        for instruction, response in zip(examples["instruction"], examples["response"])
    ]
    tokenized_inputs = tokenizer(
        combined_texts,
        truncation=True,
        padding="max_length",  # Ensure consistent padding
        max_length=128  # Adjust max_length as needed
    )
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./llama2-lora-finetuned",
    per_device_train_batch_size=1,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    # fp16=True,  # Enable mixed precision
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)
print("Starting to training")
# Fine-tune the model with LoRA
trainer.train()

print("Finished to training")
# Save the fine-tuned model
trainer.save_model("llama2-lora-finetuned")
print("model saved ")

合并LoRA微调权重到基础模型中，生成一个完整的模型

导入必要的库：

from transformers import LlamaForCausalLM, LlamaTokenizer
from peft import PeftModel, PeftConfig

解释：导入transformers库中的Llama模型和分词器，以及peft库中的用于处理LoRA微调模型的PeftModel类。

加载基础Llama 2模型：

model_name = r"C:\apps\ml_model\llama-2-13b-chat-hf"
base_model = LlamaForCausalLM.from_pretrained(model_name)
tokenizer = LlamaTokenizer.from_pretrained(model_name)
base_model.resize_token_embeddings(len(tokenizer))

解释：

model_name 是基础Llama-2模型的路径。
LlamaForCausalLM.from_pretrained(model_name) 用于加载预训练的基础模型。
LlamaTokenizer.from_pretrained(model_name) 加载对应的分词器。
base_model.resize_token_embeddings(len(tokenizer)) 调整模型的词嵌入大小，以适应分词器的词汇量。

加载PEFT微调模型：

peft_model_path = "./llama2-lora-finetuned"  # Path to your fine-tuned model
peft_model = PeftModel.from_pretrained(base_model, peft_model_path)

解释：加载经过LoRA微调的模型，路径是peft_model_path，并基于已经加载的基础模型base_model。

合并LoRA权重并卸载LoRA配置：

peft_model = peft_model.merge_and_unload()

解释：合并LoRA权重到基础模型中，这一步将LoRA微调权重应用到基础模型的权重上，并且卸载LoRA配置，使得模型变成一个完整的标准模型。

保存合并后的模型：

peft_model.save_pretrained("./llama2-finetuned-combined")
tokenizer.save_pretrained("./llama2-finetuned-combined")

解释：将合并后的模型和分词器保存到指定的路径"./llama2-finetuned-combined"。

重新加载合并后的模型和分词器：

from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("./llama2-finetuned-combined")
tokenizer = LlamaTokenizer.from_pretrained("./llama2-finetuned-combined")

解释：从保存的路径中加载已经合并权重的Llama-2模型和对应的分词器。

使用模型进行推理：

inputs = tokenizer("What is the capital of France?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

解释：

使用分词器将问题"What is the capital of France?" 转换为模型可以处理的张量格式。
调用模型的generate()方法生成模型的输出。
使用分词器将模型输出的token序列解码为可读文本，并打印输出。

完整代码

from transformers import LlamaForCausalLM, LlamaTokenizer
from peft import PeftModel, PeftConfig

# Load the base LLaMA 2 model
model_name = r"C:\apps\ml_model\llama-2-13b-chat-hf"
base_model = LlamaForCausalLM.from_pretrained(model_name)
tokenizer = LlamaTokenizer.from_pretrained(model_name)

base_model.resize_token_embeddings(len(tokenizer))

# Load the PEFT fine-tuned model
peft_model_path = "./llama2-lora-finetuned"  # Path to your fine-tuned model
peft_model = PeftModel.from_pretrained(base_model, peft_model_path)

# Merge the LoRA weights into the base model
peft_model = peft_model.merge_and_unload()

peft_model.save_pretrained("./llama2-finetuned-combined")
tokenizer.save_pretrained("./llama2-finetuned-combined")


from transformers import LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("./llama2-finetuned-combined")
tokenizer = LlamaTokenizer.from_pretrained("./llama2-finetuned-combined")

# Now you can use the model for inference
inputs = tokenizer("What is the capital of France?", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

执行后你会发现下面的模型代码生成

huggingface模型转换为gguf并量化请参考下面博客

Huggingface 模型转换成gguf并且量化_sft训练模型转gguf-CSDN博客

其它

把huggingface的模型转换成GGUF格式

python convert.py C:\Users\Harry\PycharmProjects\llm-finetuning\llama2-finetuned-combined --outfile C:\Users\Harry\PycharmProjects\llm-finetuning\llama2-finetuned-combined\llama2-7b-chat_f16.gguf --outtype f16

量化gguf格式模型

运行命令测试模型

main -m C:\Users\Harry\PycharmProjects\llm-finetuning\llama2-finetuned-combined\llama2-7b-chat_f16-q4_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.