本笔记本涵盖了数据未标记但具有可用于将其聚类为有意义的类别的特征的用例。聚类的挑战在于使那些使得这些聚类突出的特征可读,这就是我们将使用GPT-3生成有意义的聚类描述的地方。然后,我们可以使用这些描述来为以前未标记的数据集应用标签。
为了向模型提供数据,我们使用了在笔记本Multiclass classification for transactions Notebook中展示的方法创建的嵌入,应用于数据集中的全部359个交易,以便为我们提供更大的学习资源池。
设置
# 导入dotenv模块中的load_dotenv函数
# dotenv模块用于从.env文件中加载环境变量
from dotenv import load_dotenv
# 调用load_dotenv函数,加载.env文件中的环境变量
load_dotenv()
True
# 导入所需的库
from openai import OpenAI # 导入OpenAI库
import pandas as pd # 导入pandas库,用于数据处理
import numpy as np # 导入numpy库,用于数值计算
from sklearn.cluster import KMeans # 导入KMeans算法,用于聚类分析
from sklearn.manifold import TSNE # 导入TSNE算法,用于降维可视化
import matplotlib # 导入matplotlib库,用于数据可视化
import matplotlib.pyplot as plt # 导入matplotlib的pyplot模块,用于绘图
import os # 导入os库,用于操作系统相关功能
from ast import literal_eval # 导入ast库的literal_eval函数,用于将字符串转换为字面值
# 设置OpenAI的API密钥
# openai.api_key = os.getenv("OPENAI_API_KEY")
# 创建OpenAI客户端
client = OpenAI()
# 定义使用的模型
COMPLETIONS_MODEL = "gpt-3.5-turbo"
# 指定数据和预计算嵌入的文件路径
embedding_path = "data/library_transactions_with_embeddings_359.csv"
聚类
我们将重复聚类笔记本中的方法,使用K-Means对我们之前创建的特征嵌入进行聚类。然后,我们将使用Completions端点为我们生成聚类描述,并判断它们的有效性。
# 导入pandas库
import pandas as pd
# 从embedding_path路径读取csv文件,并将其存储在df变量中
df = pd.read_csv(embedding_path)
# 显示df的前几行数据
df.head()
Date | Supplier | Description | Transaction value (£) | combined | n_tokens | embedding | |
---|---|---|---|---|---|---|---|
0 | 21/04/2016 | M & J Ballantyne Ltd | George IV Bridge Work | 35098.0 | Supplier: M & J Ballantyne Ltd; Description: G... | 118 | [-0.013169967569410801, -0.004833734128624201,... |
1 | 26/04/2016 | Private Sale | Literary & Archival Items | 30000.0 | Supplier: Private Sale; Description: Literary ... | 114 | [-0.019571533426642418, -0.010801066644489765,... |
2 | 30/04/2016 | City Of Edinburgh Council | Non Domestic Rates | 40800.0 | Supplier: City Of Edinburgh Council; Descripti... | 114 | [-0.0054041435942053795, -6.548957026097924e-0... |
3 | 09/05/2016 | Computacenter Uk | Kelvin Hall | 72835.0 | Supplier: Computacenter Uk; Description: Kelvi... | 113 | [-0.004776035435497761, -0.005533686839044094,... |
4 | 09/05/2016 | John Graham Construction Ltd | Causewayside Refurbishment | 64361.0 | Supplier: John Graham Construction Ltd; Descri... | 117 | [0.003290407592430711, -0.0073441751301288605,... |
# 读取嵌入文件
embedding_df = pd.read_csv(embedding_path)
# 将嵌入列中的字符串转换为Python对象,并将其转换为NumPy数组
embedding_df["embedding"] = embedding_df.embedding.apply(literal_eval).apply(np.array)
# 将所有嵌入向量垂直堆叠成一个矩阵
matrix = np.vstack(embedding_df.embedding.values)
# 输出矩阵的形状
matrix.shape
(359, 1536)
# 设置聚类数目为5
n_clusters = 5
# 初始化KMeans模型,设置聚类数目、初始化方法、随机种子和最大迭代次数
kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
# 使用KMeans模型拟合数据
kmeans.fit(matrix)
# 获取每个数据点所属的聚类标签
labels = kmeans.labels_
# 将聚类标签添加到embedding_df数据框中
embedding_df["Cluster"] = labels
# 创建TSNE对象,设置参数
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200)
# 使用TSNE对矩阵进行降维
vis_dims2 = tsne.fit_transform(matrix)
# 提取降维后的x和y坐标
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
# 遍历每个类别并绘制散点图
for category, color in enumerate(["purple", "green", "red", "blue","yellow"]):
# 提取属于当前类别的x和y坐标
xs = np.array(x)[embedding_df.Cluster == category]
ys = np.array(y)[embedding_df.Cluster == category]
# 绘制散点图
plt.scatter(xs, ys, color=color, alpha=0.3)
# 计算当前类别的平均x和y坐标
avg_x = xs.mean()
avg_y = ys.mean()
# 在平均坐标处绘制一个十字形标记
plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
# 设置图表标题
plt.title("Clusters identified visualized in language 2d using t-SNE")
Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')
# 设置每个簇的交易数量为10,因为我们预计会有一些变化
transactions_per_cluster = 10
# 对于每个簇进行循环
for i in range(n_clusters):
print(f"Cluster {i} Theme:\n")
# 从embedding_df中选择簇为i的交易,并进行一系列的字符串替换和抽样操作
transactions = "\n".join(
embedding_df[embedding_df.Cluster == i]
.combined.str.replace("Supplier: ", "")
.str.replace("Description: ", ": ")
.str.replace("Value: ", ": ")
.sample(transactions_per_cluster, random_state=42)
.values
)
# 创建一个聊天模型的completion,用于生成对交易的描述
completion = client.chat.completions.create(
model=COMPLETIONS_MODEL,
# 我们将包含一个提示,以指示模型我们正在寻找什么样的描述
messages=[
{"role": "user",
"content": f'''我们想将这些交易分组成有意义的簇,以便我们可以针对我们花费最多的领域进行目标定位。
以下交易有什么共同之处?\n\n交易:\n"""\n{transactions}\n"""\n\n主题:'''}
],
temperature=0,
max_tokens=100,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
# 打印生成的描述
print(completion.choices[0].message.content.replace("\n", ""))
print("\n")
# 从embedding_df中选择簇为i的交易,并进行抽样操作,打印供应商和描述
sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)
for j in range(transactions_per_cluster):
print(sample_cluster_rows.Supplier.values[j], end=", ")
print(sample_cluster_rows.Description.values[j], end="\n")
print("-" * 100)
print("\n")
Cluster 0 Theme:
The common theme among these transactions is that they all involve spending money on various expenses such as electricity, non-domestic rates, IT equipment, computer equipment, and the purchase of an electric van.
EDF ENERGY, Electricity Oct 2019 3 buildings
City Of Edinburgh Council, Non Domestic Rates
EDF, Electricity
EX LIBRIS, IT equipment
City Of Edinburgh Council, Non Domestic Rates
CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place
EDF Energy, Electricity
XMA Scotland Ltd, IT equipment
Computer Centre UK Ltd, Computer equipment
ARNOLD CLARK, Purchase of an electric van
----------------------------------------------------------------------------------------------------
Cluster 1 Theme:
The common theme among these transactions is that they all involve payments for various goods and services. Some specific examples include student bursary costs, collection of papers, architectural works, legal deposit services, papers related to Alisdair Gray, resources on slavery abolition and social justice, collection items, online/print subscriptions, ALDL charges, and literary/archival items.
Institute of Conservation, This payment covers 2 invoices for student bursary costs
PRIVATE SALE, Collection of papers of an individual
LEE BOYD LIMITED, Architectural Works
ALDL, Legal Deposit Services
RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray
ADAM MATTHEW DIGITAL LTD, Resource - slavery abolution and social justice
PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items
LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20
ALDL, ALDL Charges
Private Sale, Literary & Archival Items
----------------------------------------------------------------------------------------------------
Cluster 2 Theme:
The common theme among these transactions is that they all involve spending money at Kelvin Hall.
CBRE, Kelvin Hall
GLASGOW CITY COUNCIL, Kelvin Hall
University Of Glasgow, Kelvin Hall
GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall
Computacenter Uk, Kelvin Hall
XMA Scotland Ltd, Kelvin Hall
GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19
Glasgow Life, Kelvin Hall Service Charges
Glasgow City Council, Kelvin Hall
GLASGOW LIFE, Quarterly service charge KH
----------------------------------------------------------------------------------------------------
Cluster 3 Theme:
The common theme among these transactions is that they all involve payments for facility management fees and services provided by ECG Facilities Service.
ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, Facilities Management Charge
ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties
ECG Facilities Service, Facilities Management Charge
ECG FACILITIES SERVICE, Maintenance contract - October
ECG FACILITIES SERVICE, Electrical and mechanical works
ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling
ECG Facilities Service, Facilities Management Charge
ECG Facilities Service, Facilities Management Charge
----------------------------------------------------------------------------------------------------
Cluster 4 Theme:
The common theme among these transactions is that they all involve construction or refurbishment work.
M & J Ballantyne Ltd, George IV Bridge Work
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
ARTHUR MCKAY BUILDING SERVICES, Causewayside Work
John Graham Construction Ltd, Causewayside Refurbishment
Morris & Spottiswood Ltd, George IV Bridge Work
ECG FACILITIES SERVICE, Causewayside IT Work
John Graham Construction Ltd, Causewayside Refurbishment
----------------------------------------------------------------------------------------------------
结论
现在我们有五个新的聚类,我们可以用它们来描述我们的数据。从可视化结果来看,我们的一些聚类存在一些重叠,我们需要进行一些调整才能达到正确的位置,但我们已经可以看到GPT-3已经做出了一些有效的推理。特别是,它发现包括法定存款的项目与文学档案有关,这是正确的,但模型没有给出任何线索。非常酷,通过一些调整,我们可以创建一个基本的聚类集合,然后使用多类分类器将其推广到我们可能使用的其他交易数据集。