LLMs之unsloth：unsloth的简介、安装和使用方法、案例应用之详细攻略

unsloth的简介

0、特点和功能：

特点

功能

1、免费微调

2、Unsloth.ai新闻

3、链接和资源

4、主要特征

5、性能基准测试

unsloth的安装和使用方法

1、安装说明

Conda安装

Pip 安装

通过以下方式查找您的 CUDA 版本

对于 Pytorch 2.1.0：您可以通过 Pip 更新 Pytorch（互换 cu121 / cu118）。

对于 Pytorch 2.1.1：对较新的 RTX 30xx GPU 或更高版本使用“ampere”路径。

对于 Pytorch 2.2.0：对较新的 RTX 30xx GPU 或更高版本使用“ampere”路径。

如果出现错误，请先尝试以下操作，然后返回步骤 1：

对于 Pytorch 2.2.1：

要解决安装问题，请尝试以下操作（全部必须成功）。 Xformers 应该大部分都可用

2、文档

DPO 支持

3、详细的基准测试表

Llama-Factory 第三方基准测试

流行模型之间的性能比较

Mistral 7b

CodeLlama 34b

1 Tesla T4

2 Tesla T4s via DDP

Tesla T4 GPU 上的性能比较：

unsloth的案例应用

LLMs之LLaMA3：基于Colab平台(采用T4 GPU+至少37G)采用中文语料数据利用unsloth框架(速度更快/量化功能)并采用LoRA进行微调LLaMA-3-8b(合并原始模型和LoRA模型)同时进行4位量化(16位的hf格式→16位的gguf格式→4位的gguf格式)最后将模型导出到本地

unsloth的简介

unsloth微调Llama 3, Mistral和Gemma速度快2-5倍，内存减少80% !unsloth是一个开源项目，它可以比HuggingFace快2-5倍地微调Llama 3、Mistral和Gemma语言模型，同时内存消耗减少80%。

官网地址：GitHub - unslothai/unsloth: Finetune Llama 3, Mistral & Gemma LLMs 2-5x faster with 80% less memory

0、特点和功能：

特点

所有笔记本都是初学者友好的，可以加入自己的数据集，点击"Run All"，即可获得速度更快的微调模型。
支持Llama 3、Mistral、Gemma等几个知名LLM，可以实现更快和更节省内存的微调。
所有的核心代码都用OpenAI开发的Triton编写，确保NumPy一致性。

功能

微调预训练模型
支持HuggingFace的Trainer、SFTTrainer等训练循环
支持 continuing pretraining和文本完成功能
支持DPO直接偏好优化算法

1、免费微调

所有笔记本都是初学者友好的!添加你的数据集，点击“全部运行”，你会得到一个2倍快的微调模型，可以导出到GGUF, vLLM或上传到Hugging Face。

Unsloth supports	Free Notebooks	Performance	Memory use
Llama 3 (8B)	▶️ Start for free	2x faster	60% less
Mistral (7B)	▶️ Start for free	2.2x faster	73% less
Gemma (7B)	▶️ Start for free	2.4x faster	71% less
ORPO	▶️ Start for free	1.9x faster	43% less
DPO Zephyr	▶️ Start for free	1.9x faster	43% less
Phi-3 (3.8B)	▶️ Start for free	2x faster	50% less
TinyLlama	▶️ Start for free	3.9x faster	74% less

与 FA2 + Hugging Face 组合进行基准比较。
适用于 Llama-3 8b、Gemma 7b、Mistral 7b 的 Kaggle 笔记本
这款会话笔记本对于 Llama-3 非常有用。以及 Mistral 7b 的 ChatML。
此文本完成笔记本用于持续预训练/原始文本。
Benchmarking compared to FA2 + Hugging Face combined.
Kaggle Notebooks for Llama-3 8b, Gemma 7b, Mistral 7b
This conversational notebook is useful for Llama-3. And ChatML for Mistral 7b.
This text completion notebook is for continued pretraining / raw text.

2、Unsloth.ai新闻

📣 新！ Llama-3 8b 现在可以使用了！ Llama-3 70b 也可以（更改笔记本中的型号名称）。
📣 新！ ORPO 支持就在这里！
📣 新！ Phi-3 3.8b 支持就在这里！
📣 新！我们将内存使用量进一步减少了 30%，现在支持使用 4 倍长的上下文窗口对 LLM 进行微调！如果您使用我们的笔记本电脑，则无需进行任何更改。要启用，只需更改 1 行：
model = FastLanguageModel.get_peft_model( model, use_gradient_checkpointing = "unsloth", # <<<<<<< )
📣CodeGemma 现在可与 Gemma 7b 和 Gemma 2b 一起使用
📣 我们所有模型的推理速度提高了 2 倍

model = FastLanguageModel.get_peft_model(
    model,
    use_gradient_checkpointing = "unsloth", # <<<<<<<
)

3、链接和资源

Type	Links
📚 Wiki & FAQ	Read Our Wiki
编辑 Twitter (aka X)	Follow us on X
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

4、主要特征

所有内核均采用 OpenAI 的 Triton 语言编写。手动反向传播引擎。
精度损失为 0% - 无近似方法 - 全部精确。
没有改变硬件。自 2018 年起支持 NVIDIA GPU。最低 CUDA 功能 7.0（V100、T4、Titan V、RTX 20、30、40x、A100、H100、L40 等）检查您的 GPU！ GTX 1070、1080 可以工作，但速度很慢。
通过 WSL 在 Linux 和 Windows 上运行。
通过位和字节支持 4 位和 16 位 QLoRA / LoRA 微调。
开源训练速度提高了 5 倍 - 请参阅 Unsloth Pro 以获得高达 30 倍的训练速度！
如果您使用 Unsloth 训练了模型，则可以使用这个很酷的贴纸！

5、性能基准测试

有关可重复基准测试表的完整列表，请访问我们的网站, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

下表的基准测试由 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

unsloth的安装和使用方法

1、安装说明

Conda安装

选择 pytorch-cuda=11.8（对于 CUDA 11.8）或 pytorch-cuda=12.1（对于 CUDA 12.1）。如果您有mamba ，请使用mamba 而不是conda 来更快地解决问题。请参阅此 Github 问题以获取有关调试 Conda 安装的帮助。

conda create --name unsloth_env python=3.10
conda activate unsloth_env

conda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps trl peft accelerate bitsandbytes

Pip 安装

如果您有 Anaconda，请勿使用此功能。你必须使用 Conda 安装方法，否则东西会崩溃。

通过以下方式查找您的 CUDA 版本

import torch; 
torch.version.cuda

对于 Pytorch 2.1.0：您可以通过 Pip 更新 Pytorch（互换 cu121 / cu118）。

前往 PyTorch 了解更多信息。选择 cu118（适用于 CUDA 11.8）或 cu121（适用于 CUDA 12.1）。如果您有 RTX 3060 或更高版本（A100、H100 等），请使用“安培”路径。对于 Pytorch 2.1.1：转到步骤 3。对于 Pytorch 2.2.0：转到步骤 4。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"

对于 Pytorch 2.1.1：对较新的 RTX 30xx GPU 或更高版本使用“ampere”路径。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"

对于 Pytorch 2.2.0：对较新的 RTX 30xx GPU 或更高版本使用“ampere”路径。

pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
  --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"

如果出现错误，请先尝试以下操作，然后返回步骤 1：

pip install --upgrade pip

对于 Pytorch 2.2.1：

# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

# Pre Ampere RTX 2080, T4, GTX 1080 GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytes

要解决安装问题，请尝试以下操作（全部必须成功）。 Xformers 应该大部分都可用

nvcc
python -m xformers.info
python -m bitsandbytes

2、文档

Go to our Wiki page for saving to GGUF, checkpointing, evaluation and more!
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
    "unsloth/Phi-3-mini-4k-instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Cutomized chat templates

DPO 支持

DPO（直接偏好优化）、PPO、奖励建模似乎都按照 Llama-Factory 的第 3 方独立测试工作。我们有一个初步的 Google Colab 笔记本，用于在 Tesla T4 上复制 Zephyr：笔记本。

我们在🤗Hugging Face 的官方文档中！我们正在查看 SFT 文档和 DPO 文档！

from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
)

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = YOUR_DATASET_HERE,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
dpo_trainer.train()

3、详细的基准测试表

单击“代码”以获得完全可重现的示例
“Unsloth Equal”是我们 PRO 版本的预览版，其中删除了代码。所有设置和损失曲线保持相同。
如需基准测试表的完整列表，请访问我们的网站

1 A100 40GB	🤗Hugging Face	Flash Attention 2	🦥Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.04x	1.98x	2.48x	5.32x	15.64x
code	Code	Code	Code	Code
seconds	1040	1001	525	419	196	67
memory MB	18235	15365	9631	8525
% saved		15.74	47.18	53.25

Llama-Factory 第三方基准测试

链接到性能表。 TGS：每 GPU 每秒的令牌数。型号：LLaMA2-7B。 GPU：NVIDIA A100 * 1。批量大小：4。梯度累积：2。LoRA 等级：8。最大长度：1024。

Method	Bits	TGS	GRAM	Speed
HF	16	2392	18GB	100%
HF+FA2	16	2954	17GB	123%
Unsloth+FA2	16	4007	16GB	168%
HF	4	2415	9GB	101%
Unsloth+FA2	4	3726	7GB	160%

流行模型之间的性能比较

单击查看特定型号基准测试表（Mistral 7b、CodeLlama 34b 等）

Mistral 7b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Mistral 7B Slim Orca	1x	1.15x	2.15x	2.53x	4.61x	13.69x
code	Code	Code	Code	Code
seconds	1813	1571	842	718	393	132
memory MB	32853	19385	12465	10271
% saved		40.99	62.06	68.74

CodeLlama 34b

1 A100 40GB	Hugging Face	Flash Attention 2	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Code Llama 34B	OOM ❌	0.99x	1.87x	2.61x	4.27x	12.82x
code	▶️ Code	Code	Code	Code
seconds	1953	1982	1043	748	458	152
memory MB	40000	33217	27413	22161
% saved		16.96	31.47	44.60

1 Tesla T4

1 T4 16GB	Hugging Face	Flash Attention	Unsloth Open	Unsloth Pro Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	1.09x	1.69x	1.79x	2.93x	8.3x
code	▶️ Code	Code	Code	Code
seconds	1599	1468	942	894	545	193
memory MB	7199	7059	6459	5443
% saved		1.94	10.28	24.39

2 Tesla T4s via DDP

2 T4 DDP	Hugging Face	Flash Attention	Unsloth Open	Unsloth Equal	Unsloth Pro	Unsloth Max
Alpaca	1x	0.99x	4.95x	4.44x	7.28x	20.61x
code	▶️ Code	Code	Code
seconds	9882	9946	1996	2227	1357	480
memory MB	9176	9128	6904	6782
% saved		0.52	24.76	26.09

Tesla T4 GPU 上的性能比较：

单击查看 1 epoch 所用时间

单击通过 DDP 在 2 个 Tesla T4 GPU 上进行性能比较：

One Tesla T4 on Google Colab bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	23h 15m	56h 28m	8h 38m	391h 41m
Unsloth Open	1 T4	13h 7m (1.8x)	31h 47m (1.8x)	4h 27m (1.9x)	240h 4m (1.6x)
Unsloth Pro	1 T4	3h 6m (7.5x)	5h 17m (10.7x)	1h 7m (7.7x)	59h 53m (6.5x)
Unsloth Max	1 T4	2h 39m (8.8x)	4h 31m (12.5x)	0h 58m (8.9x)	51h 30m (7.6x)

Peak Memory Usage

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K)
Huggingface	1 T4	7.3GB	5.9GB	14.0GB	13.3GB
Unsloth Open	1 T4	6.8GB	5.7GB	7.8GB	7.7GB
Unsloth Pro	1 T4	6.4GB	6.4GB	6.4GB	6.4GB
Unsloth Max	1 T4	11.4GB	12.4GB	11.9GB	14.4GB

Click for Performance Comparisons on 2 Tesla T4 GPUs via DDP:**Time taken for 1 epoch**

Two Tesla T4s on Kaggle bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	84h 47m	163h 48m	30h 51m	1301h 24m *
Unsloth Pro	2 T4	3h 20m (25.4x)	5h 43m (28.7x)	1h 12m (25.7x)	71h 40m (18.1x) *
Unsloth Max	2 T4	3h 4m (27.6x)	5h 14m (31.3x)	1h 6m (28.1x)	54h 20m (23.9x) *

Peak Memory Usage on a Multi GPU System (2 GPUs)

System	GPU	Alpaca (52K)	LAION OIG (210K)	Open Assistant (10K)	SlimOrca (518K) *
Huggingface	2 T4	8.4GB \| 6GB	7.2GB \| 5.3GB	14.3GB \| 6.6GB	10.9GB \| 5.9GB *
Unsloth Pro	2 T4	7.7GB \| 4.9GB	7.5GB \| 4.9GB	8.5GB \| 4.9GB	6.2GB \| 4.7GB *
Unsloth Max	2 T4	10.5GB \| 5GB	10.6GB \| 5GB	10.6GB \| 5GB	10.5GB \| 5GB *