Scikit-LLM：一款大模型与 scikit-learn 完美结合的工具！

news2025/1/11 18:37:32

Scikit-LLM 是文本分析领域的一项重大变革，它将像 ChatGPT 这样强大的语言模型与 scikit-learn 相结合，提供了一套无与伦比的工具包，用于理解和分析文本。

有了 scikit-LLM，你可以发现各种类型的文本数据中的隐藏模式、情感和上下文，如客户反馈、社交媒体帖子和新闻文章。

它汇聚了语言模型和 scikit-learn 的优势，使你能够从文本中提取前所未有的有价值的见解。

官方GitHub：https://github.com/iryna-kondr/scikit-llm

安装Scikit-LLM

首先安装Scikit-LLM，这是一个强大的库，将scikit-learn与语言模型集成在一起。您可以使用pip进行安装：

pip install scikit-llm

技术交流

建了大模型技术交流群！想要进交流群、获取原版资料的同学，可以直接加微信号：dkl88194。加的时候备注一下：研究方向 +学校/公司+CSDN，即可。然后就可以拉你进群了。

方式①、添加微信号：dkl88194，备注：来自CSDN + 技术交流
方式②、微信搜索公众号：Python学习与数据挖掘，后台回复：加群

在这里插入图片描述

获取OpenAI API密钥

Scikit-LLM 目前与一组特定的 OpenAI 模型兼容。因此，它要求用户提供自己的 OpenAI API 密钥以成功集成。

首先，从Scikit-LLM库导入SKLLMConfig模块，然后添加您的OpenAI密钥：

# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig

# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")

# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

Zero Shot GPT 分类器

ChatGPT有一个很酷的功能，就是它能够在不需要专门训练的情况下对文本进行分类。它只需要一些描述性的标签。

介绍一下ZeroShotGPTClassifier，这是Scikit-LLM中的一个类，它使您能够像创建任何其他scikit-learn分类器一样创建这样一个模型。

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn
X, y = get_classification_dataset()

# defining the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# fitting the data
clf.fit(X, y)

# predicting the data
labels = clf.predict(X)

不仅如此，Scikit-LLM还确保其接收到的响应实际上包含一个有效的标签。如果没有，Scikit-LLM将随机选择一个标签，考虑到这些标签在训练数据中出现的频率。

简而言之，Scikit-LLM处理API相关的事务并确保您获得可用的标签。甚至在响应缺少标签时，它会填充一个标签，选择的依据是该标签在训练数据中的出现频率。

如果你没有带标签的数据呢？

这里有个有趣的地方 — 您甚至不需要带标签的数据来训练模型。您只需要提供一个候选标签的列表：

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn for prediction only

X, _ = get_classification_dataset()

# defining the model
clf = ZeroShotGPTClassifier()

# Since no training so passing the labels only for prediction
clf.fit(None, ['positive', 'negative', 'neutral'])

# predicting the labels
labels = clf.predict(X)

这不是很酷吗？您可以通过指定潜在的标签而无需显式带标签的数据来训练分类器。

多标签文本分类

# importing Multi-Label zeroshot module and classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

# get classification dataset from sklearn 
X, y = get_multilabel_classification_dataset()

# defining the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the model
clf.fit(X, y)

# making predictions
labels = clf.predict(X)

在零样本和多标签零样本之间唯一的区别是当您创建MultiLabelZeroShotGPTClassifier类的实例时，需要指定您想要分配给每个样本的最大标签数量（在这里是max_labels=3）。

如果没有带标签的数据（多标签情况）呢？

在上面提供的例子中，MultiLabelZeroShotGPTClassifier是用带标签的数据（X和y）进行训练的。然而，您也可以通过提供一个候选标签的列表来训练分类器，而无需带标签的数据。在这种情况下，y应该是List[List[str]]类型。

以下是不使用带标签数据进行训练的示例：

# getting classification dataset for prediction only
X, _ = get_multilabel_classification_dataset()

# Defining all the labels that needs to predicted
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety"
]

# creating the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the labels only
clf.fit(None, [candidate_labels])

# predicting the data
labels = clf.predict(X)

文本向量化

文本向量化是将文本转换为数字的过程，以便机器更容易理解和分析它。在这种情况下，GPTVectorizer是Scikit-LLM的一个模块，它帮助将一段文本（无论长度如何）转换为一个称为向量的固定大小的数字集。

# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer

# Creating an instance of the GPTVectorizer class and assigning it to the variable 'model'
model = GPTVectorizer()  

# transorming the
vectors = model.fit_transform(X)

将GPTVectorizer实例的fit_transform方法应用于输入数据X，会将模型适应数据并将文本转换为固定维度的向量。然后将生成的向量分配给变量vectors。

让我们演示一个将GPTVectorizer与XGBoost分类器结合在scikit-learn流水线中的例子。这种方法允许进行高效的文本预处理和分类：

# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Creating an instance of LabelEncoder class
le = LabelEncoder()

# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)

# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)

# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]

# Creating a pipeline with the defined steps
clf = Pipeline(steps)

# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)

# Predicting the labels for the test data 'X_test' using the trained pipeline
yh = clf.predict(X_test)

文本总结

GPT在总结文本方面表现得非常出色。这就是为什么Scikit-LLM中有一个名为GPTSummarizer的模块。

您可以以两种方式使用它：独立使用或在执行其他操作之前使用（例如减小数据的大小，但是这次处理的是文本而不是数字）：

# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer

# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset

# Calling the get_summarization_dataset function
X = get_summarization_dataset()

# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)

# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.
# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)