在 Mac 上进行本地 LLM 微调（M1 16GB）

适合初学者的 Python 代码演练 (ft. MLX)

欢迎来到雲闪世界。本文展示了如何使用 Google Colab 上的单个（免费）GPU 微调 LLM。虽然该示例（以及许多其他示例）可以在 Nvidia 硬件上轻松运行，但它们并不容易适应 M 系列 Mac。介绍一种在 Mac 上本地微调 LLM 的简单方法。

随着开源大型语言模型 (LLM) 和高效微调方法的兴起，构建自定义 ML 解决方案从未如此简单。现在，任何拥有单个 GPU 的人都可以在本地机器上微调 LLM。

然而，由于苹果的 M 系列芯片，Mac 用户在很大程度上被排除在这一趋势之外。这些芯片采用统一的内存框架，无需 GPU。因此，许多用于运行和训练 LLM 的（以 GPU 为中心的）开源工具与现代 Mac 计算能力不兼容（或没有充分利用）。

在发现 MLX Python 库之前，我几乎放弃了在本地培养法学硕士的梦想。

美拉尼西亚

MLX是 Apple 机器学习研究团队开发的 Python 库，用于在 Apple 芯片上高效运行矩阵运算。这很重要，因为矩阵运算是神经网络底层的核心计算。

MLX 的主要优势在于它充分利用了 M 系列芯片的统一内存范式，这使得中等系统（如我的 M1 16GB）能够在大型模型（例如 Mistral 7b Instruct）上运行微调作业。

虽然该库没有用于训练 Hugging Face 等模型的高级抽象，但有一个LoRA 的示例实现，可以轻松破解并适应其他用例。

这正是我在下面的例子中所做的。

示例代码：微调 Mistral 7b 指令

我不会使用 Hugging Face 的 Transformers 库和 Google Colab，而是使用 MLX 库和我的本地机器（2020 Mac Mini M1 16GB）。

与前面的示例类似，我将对Mistral-7b-Instruct 的量化版本进行微调，以按照我的喜好回应 YouTube 评论。我使用 QLoRA 参数高效微调方法。如果您不熟悉 QLoRA，我在此处对该方法进行了概述。

1）设置环境

在运行示例代码之前，我们需要设置 Python 环境。第一步是从GitHub repo下载代码。

git克隆https://github.com/ShawhinT/YouTube-Blog.git

本示例的代码位于LLMs/qlora-mlx子目录中。我们可以导航到此文件夹并创建一个新的 Python 环境（这里，我将其称为mlx-env）。

# change dir
cd LLMs/qlora-mlx

# create py venv
python -m venv mlx-env

接下来，我们激活环境并从 requirements.txt 文件中安装要求。注意：mlx 要求您的系统具有 M 系列芯片、Python >= 3.8 和 macOS >= 13.5。

# activate venv
source mlx-env/bin/activate

# install requirements
pip install -r requirements.txt

2）使用未微调模型进行推理

现在我们已经安装了mlx和其他依赖项，让我们运行一些 Python 代码！我们首先导入有用的库。

# import modules (this is Python code now)
import subprocess
from mlx_lm import load, generate

我们将使用子进程模块通过 Python 运行终端命令，并使用mlx-lm库在我们预先训练的模型上运行推理。

mlx-lm建立在mlx之上，专门用于运行来自 Hugging Face hub 的模型。下面介绍如何使用它从现有模型生成文本。

# define inputs
model_path = "mlx-community/Mistral-7B-Instruct-v0.2-4bit"
prompt = prompt_builder("Great content, thank you!")
max_tokens = 140

# load model
model, tokenizer = load(model_path)

# generate response
response = generate(model, tokenizer, prompt=prompt, 
                                      max_tokens = max_tokens, 
                                      verbose=True)

注意：Hugging Face mlx-community 页面上的数百种模型中的任何一种都可以轻松用于推理。如果您想使用不可用的模型（不太可能），您可以使用scripts/convert.py脚本将其转换为兼容格式。

prompt_builder ()函数接受 YouTube 评论并将其集成到提示模板中，如下所示。

# prompt format
intstructions_string = f"""ShawGPT, functioning as a virtual data science \
consultant on YouTube, communicates in clear, accessible language, escalating \
to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, \
providing concise acknowledgments to brief expressions of gratitude or \
feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

# define lambda function
prompt_builder = lambda comment: f'''<s>[INST] {intstructions_string} \n{comment} \n[/INST]\n'''

以下是模型在未经微调的情况下对评论“很棒的内容，谢谢！ ”的回应。

–ShawGPT: Thank you for your kind words! I'm glad you found the content helpful
and enjoyable. If you have any specific questions or topics you'd like me to 
cover in more detail, feel free to ask!

虽然回复是连贯的，但主要存在两个问题。1）签名“-ShawGPT”放在回复的前面而不是结尾（按照指示），2）回复比我实际回复这样的评论的方式要长得多。

3）准备训练数据

在运行微调作业之前，我们必须准备训练、测试和验证数据集。在这里，我使用来自我的 YouTube 频道的 50 条真实评论和回复进行训练，使用 10 条评论/回复进行验证和测试（总共 70 个示例）。

下面给出了一个训练示例。它采用 JSON 格式，即键值对，其中键 = “文本”，值 = 合并的提示、评论和响应。

{"text": "<s>[INST] ShawGPT, functioning as a virtual data science consultant 
on YouTube, communicates in clear, accessible language, escalating to technical
 depth upon request. It reacts to feedback aptly and ends responses with its 
signature '\u2013ShawGPT'. ShawGPT will tailor the length of its responses to 
match the viewer's comment, providing concise acknowledgments to brief 
expressions of gratitude or feedback, thus keeping the interaction natural and 
engaging.\n\nPlease respond to the following comment.\n \nThis was a very 
thorough introduction to LLMs and answered many questions I had. Thank you. 
\n[/INST]\nGreat to hear, glad it was helpful :) -ShawGPT</s>"}

从 .csv 文件生成训练、测试和验证数据集的代码可在GitHub上找到。

4）微调模型

准备好训练数据后，我们可以微调模型。在这里，我使用mlx团队创建的lora.py示例脚本。

该脚本保存在我们克隆的 repo 的scripts 文件夹中，train/test/val 数据保存在data文件夹中。要运行微调作业，我们可以运行以下终端命令。

python scripts/lora.py --model mlx-community/Mistral-7B-Instruct-v0.2-4bit \
                       --train \
                       --iters 100 \
                       --steps-per-eval 10 \
                       --val-batches -1 \
                       --learning-rate 1e-5 \
                       --lora-layers 16 \
                       --test

# --train = runs LoRA training
# --iters = number of training steps
# --steps-per-eval = number steps to do before computing val loss
# --val-batches = number val dataset examples to use in val loss (-1 = all)
# --learning-rate (same as default)
# --lora-layers (same as default)
# --test = computes test loss at the end of training

为了尽快完成训练，我关闭了机器上的所有其他进程，以便为微调过程分配尽可能多的内存。在我的 M1 上，内存为 16GB，运行大约需要15-20 分钟，峰值内存约为 13-14GB。

注意：我必须对lora.py 脚本的第 340-341 行进行一处更改以避免过度拟合，即将 LoRA 适配器的等级从 r=8 更改为 r=4。

5）使用微调模型进行推理

训练完成后，工作目录中会出现一个名为adapters.npz的文件。其中包含 LoRA 适配器权重。

要使用这些脚本进行推理，我们可以再次使用 lora.py。不过，这一次，我没有直接从终端运行脚本，而是使用subprocess模块在 Python 中运行脚本。这使我能够使用之前定义的prompt_builder()函数。

# define inputs
adapter_path = "adapters.npz" # same as default
max_tokens_str = "140" # must be string

# define command
command = ['python', 'scripts/lora.py', '--model', model_path, 
                                        '--adapter-file', adapter_path, 
                                        '--max-tokens', max_tokens_str, 
                                        '--prompt', prompt]

# run command and print results continuously
run_command_with_live_output(command)

run_command_with_live_output ()是一个辅助函数（由 ChatGPT 提供），它不断打印来自终端命令的进程输出。这样就避免了必须等到推理完成后才能看到任何输出。

def run_command_with_live_output(command: list[str]) -> None:
    """
    Courtesy of ChatGPT:
    Runs a command and prints its output line by line as it executes.

    Args:
        command (List[str]): The command and its arguments to be executed.

    Returns:
        None
    """
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

    # Print the output line by line
    while True:
        output = process.stdout.readline()
        if output == '' and process.poll() is not None:
            break
        if output:
            print(output.strip())
        
    # Print the error output, if any
    err_output = process.stderr.read()
    if err_output:
        print(err_output)

以下是模型对同一条评论（很棒的内容，谢谢！）的回应，但现在已经经过了微调。

Glad you enjoyed it! -ShawGPT

这个响应比微调之前好多了。“-ShawGPT”签名在正确的位置，听起来就像我实际上会说的话。

但这个评论很容易回应。让我们看看一些更具挑战性的东西，比如下面的这个。

Comment: 
I discovered your channel yesterday and I am hucked, great job. 
It would be nice to see a video of fine tuning ShawGPT using HF, I saw a video 
you did running on Colab using Mistal-7b, any chance to do a video using your 
laptop (Mac) or using HF spaces?

Response:
Thanks, glad you enjoyed it! I'm looking forward to doing a fine tuning video 
on my laptop. I've got an M1 Mac Mini that runs the latest versions of the HF 
API. -ShawGPT

乍一看，这是一个很棒的响应。该模型响应恰当，并进行了适当的签名。它还幸运地说我有一台 M1 Mac Mini 😉

但是，这有两个问题。首先，Mac Mini 是台式机，不是笔记本电脑。其次，该示例不直接使用 HF API。

感谢关注雲闪世界。（亚马逊aws和谷歌GCP服务协助解决云计算及产业相关解决方案）

订阅频道(https://t.me/awsgoogvps_Host)
TG交流群(t.me/awsgoogvpsHost)