特定领域的Embeddings模型微调全面指南

假设你正在为医学领域构建一个问答系统。你希望确保当用户提出问题时，系统能够准确地检索相关的医学文章。但是通用的嵌入模型可能在处理医学术语的高度专业化词汇和细微差别时会遇到困难。

这时候，微调就能派上用场了！！！

在这篇博客文章中，我们将深入探讨为特定领域（如医学、法律或金融）微调嵌入模型的过程。我们将为你所在的领域生成一个特定的数据集，并利用它来训练模型，使其更好地理解该领域内的语言模式和概念。

最终，你将拥有一个针对你的领域优化的更强大的嵌入模型，从而实现更准确的检索并提高你的自然语言处理任务的结果。

理解Embeddings概念

嵌入（Embeddings）是文本或图像的强大数值表示形式，能够捕捉语义关系。想象一下，一段文本或音频在多维空间中作为一个点存在，其中相似的词或短语彼此之间的距离更近，而不相似的则相距较远。

嵌入（Embeddings）对于许多自然语言处理任务非常重要，例如：

语义相似度：找出两张图像或两段文本之间的相似程度。

文本分类：根据文本的意义将数据归类。

问答系统：寻找最相关的文档来回答一个问题。

检索增强生成（RAG）：结合检索的嵌入模型和文本生成的语言模型，以提高生成文本的质量和相关性。

Matryoshka 表征学习

Matryoshka 表示学习（MRL）是一种用于创建“可截断”嵌入向量的技术。想象一系列套娃，每个套娃内部都包含一个更小的套娃。MRL 以一种方式嵌入文本，使得早期维度（就像外层的套娃）包含最重要的信息，而后续维度则添加更多细节。这使得在需要时可以仅使用嵌入向量的一部分，从而减少存储和计算成本。

Bge-base-en
由北京人工智能研究院（BAAI）开发的BAAI/bge-base-en-v1.5模型是一个强大的文本嵌入模型。它在各种自然语言处理（NLP）任务中表现出色，并在MTEB和C-MTEB等基准测试中取得了优异成绩。bge-base-en模型对于计算资源有限的应用场景（比如我的情况）来说是一个不错的选择。

为什么微调嵌入模型？
为特定领域微调嵌入模型对于优化检索增强型生成（RAG）系统至关重要。这个过程确保模型对相似性的理解与该领域的具体背景和语言细微差别相匹配。微调后的嵌入模型能够更好地检索问题最相关的文档，从而最终提高RAG系统的准确性和相关性。

数据集格式：为微调打下基础
您可以使用各种数据集格式进行微调。

以下是几种最常见的类型：

正对（Positive Pair）：一对相关的句子（例如，问题和答案）。
三元组（Triplets）：（锚点，正例，反例）三元组，其中锚点与正例相似而与反例不相似。
带有相似度评分的正对（Pair with Similarity Score）：一对带有相似度评分以指示其关系的句子。
带有类别的文本（Texts with Classes）：带有相应类别标签的文本。
在本文中，我们将创建一个由问题和答案组成的数据集来微调我们的bge-base-en-v1.5模型。

损失函数：指导训练过程
损失函数对于训练嵌入模型至关重要。它们衡量模型预测与实际标签之间的差异，提供信号以调整模型权重。

不同的损失函数适合不同的数据集格式：

三元组损失（Triplet Loss）：与（锚点，正例，反例）三元组一起使用，鼓励模型将相似句子放置得更近，不相似的句子放置得更远。
对比损失（Contrastive Loss）：与正对和负对一起使用，鼓励相似句子接近，不相似的句子远离。
余弦相似度损失（Cosine Similarity Loss）：与带有相似度评分的句子对一起使用，鼓励模型生成的嵌入具有与提供的评分匹配的余弦相似度。
套娃损失（Matryoshka Loss）：一种专门设计用于生成可截断的套娃嵌入的损失函数。

代码示例
安装依赖项
我们首先安装必要的库。我们将使用datasets、sentence-transformers和google-generativeai来处理数据集、嵌入模型和文本生成。

apt-get -qq install poppler-utils tesseract-ocr
pip install datasets sentence-transformers google-generativeai
pip install -q --user --upgrade pillow
pip install -q unstructured["all-docs"] pi_heif
pip install -q --upgrade unstructured
pip install --upgrade nltk

我们还将安装unstructured库用于PDF解析和nltk库用于文本处理。

PDF解析与文本提取
我们将使用unstructured库从PDF文件中提取文本和表格。

import nltk
import os 
from unstructured.partition.pdf import partition_pdf
from collections import Counter
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab') 

def process_pdfs_in_folder(folder_path):
    total_text = []  # To accumulate the text from all PDFs

    # Get list of all PDF files in the folder
    pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]

    for pdf_file in pdf_files:
        pdf_path = os.path.join(folder_path, pdf_file)
        print(f"Processing: {pdf_path}")

        # Apply the partition logic
        elements = partition_pdf(pdf_path, strategy="auto")

        # Display the types of elements
        display(Counter(type(element) for element in elements))

        # Join the elements to form text and add it to total_text list
        text = "\n\n".join([str(el) for el in elements])
        total_text.append(text)

    # Return the total concatenated text
    return "\n\n".join(total_text)


folder_path = "data"
all_text = process_pdfs_in_folder(folder_path)

我们遍历指定文件夹中的每个PDF文件，并将内容划分为文本、表格和图形。

然后，我们将文本元素组合成单一的文本表示。

自定义文本分块
现在，我们使用nltk将提取出的文本划分为可管理的块。这是为了让文本更适合由大模型（LLM）进行处理所必需的。

import nltk

nltk.download('punkt')

def nltk_based_splitter(text: str, chunk_size: int, overlap: int) -> list:
    """
    Splits the input text into chunks of a specified size, with optional overlap between chunks.

    Parameters:
    - text: The input text to be split.
    - chunk_size: The maximum size of each chunk (in terms of characters).
    - overlap: The number of overlapping characters between consecutive chunks.

    Returns:
    - A list of text chunks, with or without overlap.
    """

    from nltk.tokenize import sent_tokenize

    # Tokenize the input text into individual sentences
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        # If the current chunk plus the next sentence doesn't exceed the chunk size, add the sentence to the chunk
        if len(current_chunk) + len(sentence) <= chunk_size:
            current_chunk += " " + sentence
        else:
            # Otherwise, add the current chunk to the list of chunks and start a new chunk with the current sentence
            chunks.append(current_chunk.strip())  # Strip to remove leading spaces
            current_chunk = sentence

    # After the loop, if there is any leftover text in the current chunk, add it to the list of chunks
    if current_chunk:
        chunks.append(current_chunk.strip())

    # Handle overlap if it's specified (overlap > 0)
    if overlap > 0:
        overlapping_chunks = []
        for i in range(len(chunks)):
            if i > 0:
                # Calculate the start index for overlap from the previous chunk
                start_overlap = max(0, len(chunks[i-1]) - overlap)
                # Combine the overlapping portion of the previous chunk with the current chunk
                chunk_with_overlap = chunks[i-1][start_overlap:] + " " + chunks[i]
                # Append the combined chunk, making sure it's not longer than chunk_size
                overlapping_chunks.append(chunk_with_overlap[:chunk_size])
            else:
                # For the first chunk, there's no previous chunk to overlap with
                overlapping_chunks.append(chunks[i][:chunk_size])

        return overlapping_chunks  # Return the list of chunks with overlap

    # If overlap is 0, return the non-overlapping chunks
    return chunks

chunks = nltk_based_splitter(text=all_text, 
                                  chunk_size=2048,
                                  overlap=0)

数据集生成器
在本节中我们定义两个函数：

提示函数为Google Gemini创建一个提示，请求基于提供的文本片段生成一个问题答案对。

import google.generativeai as genai
import pandas as pd

# Replace with your valid Google API key
GOOGLE_API_KEY = "xxxxxxxxxxxx"

# Prompt generator with an explicit request for structured output
def prompt(text_chunk):
    return f"""
    Based on the following text, generate one Question and its corresponding Answer.
    Please format the output as follows:
    Question: [Your question]
    Answer: [Your answer]

    Text: {text_chunk}
    """
# Function to interact with Google's Gemini and return a QA pair
def generate_with_gemini(text_chunk:str, temperature:float, model_name:str):
    genai.configure(api_key=GOOGLE_API_KEY)
    generation_config = {"temperature": temperature}

    # Initialize the generative model
    gen_model = genai.GenerativeModel(model_name, generation_config=generation_config)

    # Generate response based on the prompt
    response = gen_model.generate_content(prompt(text_chunk))

    # Extract question and answer from response using keyword
    try:
        question, answer = response.text.split("Answer:", 1)
        question = question.replace("Question:", "").strip()
        answer = answer.strip()
    except ValueError:
        question, answer = "N/A", "N/A"  # Handle unexpected format in response

    return question, answer

generate_with_gemini 函数与 Gemini 模型交互，并使用创建的提示生成问答对。

运行问答生成
使用 process_text_chunks 函数，我们为每个文本片段使用 Gemini 模型生成问答对。

def process_text_chunks(text_chunks:list, temperature:int, model_name=str):
    """
    Processes a list of text chunks to generate questions and answers using a specified model.

    Parameters:
    - text_chunks: A list of text chunks to process.
    - temperature: The sampling temperature to control randomness in the generated outputs.
    - model_name: The name of the model to use for generating questions and answers.

    Returns:
    - A Pandas DataFrame containing the text chunks, questions, and answers.
    """
    results = []

    # Iterate through each text chunk
    for chunk in text_chunks:
        question, answer = generate_with_gemini(chunk, temperature, model_name)
        results.append({"Text Chunk": chunk, "Question": question, "Answer": answer})

    # Convert results into a Pandas DataFrame
    df = pd.DataFrame(results)
    return df
# Process the text chunks and get the DataFrame
df_results = process_text_chunks(text_chunks=chunks, 
                                 temperature=0.7, 
                                 model_name="gemini-1.5-flash")
df_results.to_csv("generated_qa_pairs.csv", index=False)

这些结果然后存储在一个Pandas DataFrame中。

加载数据集
接下来，我们将从CSV文件中生成的问答对加载到HuggingFace数据集中。我们确保数据格式正确，以便进行微调。

from datasets import load_dataset

# Load the CSV file into a Hugging Face Dataset
dataset = load_dataset('csv', data_files='generated_qa_pairs.csv')

def process_example(example, idx):
    return {
        "id": idx,  # Add unique ID based on the index
        "anchor": example["Question"],
        "positive": example["Answer"]
    }
dataset = dataset.map(process_example,
                      with_indices=True , 
                      remove_columns=["Text Chunk", "Question", "Answer"])

加载模型
我们从HuggingFace加载BAAI/bge-base-en-v1.5模型，并确保选择适当的设备进行执行（CPU或GPU）。

import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss


model_id = "BAAI/bge-base-en-v1.5" 

# Load a model
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)

定义损失函数
在这里，我们配置套娃（马特罗什卡）损失函数，指定用于截断嵌入的维度。

# Important: large to small
matryoshka_dimensions = [768, 512, 256, 128, 64] 
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

内部损失函数MultipleNegativesRankingLoss有助于模型生成适用于检索任务的嵌入。

定义训练参数
我们使用SentenceTransformerTrainingArguments来定义训练参数。这包括输出目录、训练轮数、批量大小、学习率和评估策略。

from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="bge-finetuned",                 # output directory and hugging face model ID
    num_train_epochs=1,                         # number of epochs
    per_device_train_batch_size=4,              # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=16,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=True,                                  # use tf32 precision
    bf16=True,                                  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 128 dimension
)

注意：如果您使用的是Tesla T4并在训练过程中遇到错误，请尝试注释掉tf32=True和bf16=True这两行以禁用TF32和BF16精度。

创建评估器
我们创建一个评估器来衡量模型在训练期间的表现。评估器使用InformationRetrievalEvaluator来评估模型在Matryoshka损失中的每个维度上的检索性能。

corpus = dict(
    zip(dataset['train']['id'], 
        dataset['train']['positive'])
)  # Our corpus (cid => document)

queries = dict(
    zip(dataset['train']['id'], 
        dataset['train']['anchor'])
)  # Our queries (qid => question)

# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]

matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)

# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)

微调前评估模型
我们在微调之前评估基础模型以获取性能基准。

results = evaluator(model)

for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")

定义训练器
我们创建一个SentenceTransformerTrainer对象，指定模型、训练参数、数据集、损失函数和评估器。

from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model, # our embedding model
    args=args,  # training arguments we defined above
    train_dataset=dataset.select_columns(
        ["positive", "anchor"]
    ),
    loss=train_loss, # Matryoshka loss
    evaluator=evaluator, # Sequential Evaluator
)

开始微调
调用trainer.train()方法启动微调过程，使用提供的数据和损失函数更新模型的权重。

# start training 
trainer.train()
# save the best model
trainer.save_model()

训练完成后，我们将表现最好的模型保存到指定的输出目录。

微调后的评估
最后，我们加载微调后的模型，并使用相同的评估器来衡量微调后性能的提升。

from sentence_transformers import SentenceTransformer

fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)
# Evaluate the model
results = evaluator(fine_tuned_model)

# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")