跳到主要内容

交易分类的聚类

nbviewer

本笔记涵盖了数据未标记但具有可用于将其聚类为有意义类别的特征的用例。聚类的挑战在于使那些使这些聚类突出的特征具有人类可读性,这就是我们将使用GPT-3来为我们生成有意义的聚类描述的地方。然后,我们可以使用这些描述来为先前未标记的数据集应用标签。

为了向模型提供数据,我们使用了在笔记本Multiclass classification for transactions Notebook中展示的方法创建的嵌入,应用于数据集中的全部359笔交易,以便为我们提供更大的学习资源。

设置

# 可选环境导入
from dotenv import load_dotenv
load_dotenv()

True
# 导入

from openai import OpenAI
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import os
from ast import literal_eval

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
COMPLETIONS_MODEL = "gpt-3.5-turbo"

# 这条路径指向一个包含数据和预计算嵌入的文件。
embedding_path = "data/library_transactions_with_embeddings_359.csv"


聚类

我们将重用聚类笔记本中的方法,使用K-Means对我们先前创建的特征嵌入进行数据集聚类。然后,我们将使用完成端点为我们生成聚类描述,并评估它们的有效性。

df = pd.read_csv(embedding_path)
df.head()

Date Supplier Description Transaction value (£) combined n_tokens embedding
0 21/04/2016 M & J Ballantyne Ltd George IV Bridge Work 35098.0 Supplier: M & J Ballantyne Ltd; Description: G... 118 [-0.013169967569410801, -0.004833734128624201,...
1 26/04/2016 Private Sale Literary & Archival Items 30000.0 Supplier: Private Sale; Description: Literary ... 114 [-0.019571533426642418, -0.010801066644489765,...
2 30/04/2016 City Of Edinburgh Council Non Domestic Rates 40800.0 Supplier: City Of Edinburgh Council; Descripti... 114 [-0.0054041435942053795, -6.548957026097924e-0...
3 09/05/2016 Computacenter Uk Kelvin Hall 72835.0 Supplier: Computacenter Uk; Description: Kelvi... 113 [-0.004776035435497761, -0.005533686839044094,...
4 09/05/2016 John Graham Construction Ltd Causewayside Refurbishment 64361.0 Supplier: John Graham Construction Ltd; Descri... 117 [0.003290407592430711, -0.0073441751301288605,...
embedding_df = pd.read_csv(embedding_path)
embedding_df["embedding"] = embedding_df.embedding.apply(literal_eval).apply(np.array)
matrix = np.vstack(embedding_df.embedding.values)
matrix.shape

(359, 1536)
n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
kmeans.fit(matrix)
labels = kmeans.labels_
embedding_df["Cluster"] = labels

tsne = TSNE(
n_components=2, perplexity=15, random_state=42, init="random", learning_rate=200
)
vis_dims2 = tsne.fit_transform(matrix)

x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]

for category, color in enumerate(["purple", "green", "red", "blue","yellow"]):
xs = np.array(x)[embedding_df.Cluster == category]
ys = np.array(y)[embedding_df.Cluster == category]
plt.scatter(xs, ys, color=color, alpha=0.3)

avg_x = xs.mean()
avg_y = ys.mean()

plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
plt.title("Clusters identified visualized in language 2d using t-SNE")


Text(0.5, 1.0, 'Clusters identified visualized in language 2d using t-SNE')

# We'll read 10 transactions per cluster as we're expecting some variation
transactions_per_cluster = 10

for i in range(n_clusters):
print(f"Cluster {i} Theme:\n")

transactions = "\n".join(
embedding_df[embedding_df.Cluster == i]
.combined.str.replace("Supplier: ", "")
.str.replace("Description: ", ": ")
.str.replace("Value: ", ": ")
.sample(transactions_per_cluster, random_state=42)
.values
)
response = client.chat.completions.create(
model=COMPLETIONS_MODEL,
# We'll include a prompt to instruct the model what sort of description we're looking for
messages=[
{"role": "user",
"content": f'''We want to group these transactions into meaningful clusters so we can target the areas we are spending the most money.
What do the following transactions have in common?\n\nTransactions:\n"""\n{transactions}\n"""\n\nTheme:'''}
],
temperature=0,
max_tokens=100,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
print(response.choices[0].message.content.replace("\n", ""))
print("\n")

sample_cluster_rows = embedding_df[embedding_df.Cluster == i].sample(transactions_per_cluster, random_state=42)
for j in range(transactions_per_cluster):
print(sample_cluster_rows.Supplier.values[j], end=", ")
print(sample_cluster_rows.Description.values[j], end="\n")

print("-" * 100)
print("\n")


Cluster 0 Theme:

The common theme among these transactions is that they all involve spending money on various expenses such as electricity, non-domestic rates, IT equipment, computer equipment, and the purchase of an electric van.


EDF ENERGY, Electricity Oct 2019 3 buildings
City Of Edinburgh Council, Non Domestic Rates
EDF, Electricity
EX LIBRIS, IT equipment
City Of Edinburgh Council, Non Domestic Rates
CITY OF EDINBURGH COUNCIL, Rates for 33 Salisbury Place
EDF Energy, Electricity
XMA Scotland Ltd, IT equipment
Computer Centre UK Ltd, Computer equipment
ARNOLD CLARK, Purchase of an electric van
----------------------------------------------------------------------------------------------------


Cluster 1 Theme:

The common theme among these transactions is that they all involve payments for various goods and services. Some specific examples include student bursary costs, collection of papers, architectural works, legal deposit services, papers related to Alisdair Gray, resources on slavery abolition and social justice, collection items, online/print subscriptions, ALDL charges, and literary/archival items.


Institute of Conservation, This payment covers 2 invoices for student bursary costs
PRIVATE SALE, Collection of papers of an individual
LEE BOYD LIMITED, Architectural Works
ALDL, Legal Deposit Services
RICK GEKOSKI, Papers 1970's to 2019 Alisdair Gray
ADAM MATTHEW DIGITAL LTD, Resource - slavery abolution and social justice
PROQUEST INFORMATION AND LEARN, This payment covers multiple invoices for collection items
LM Information Delivery UK LTD, Payment of 18 separate invoice for Online/Print subscriptions Jan 20-Dec 20
ALDL, ALDL Charges
Private Sale, Literary & Archival Items
----------------------------------------------------------------------------------------------------


Cluster 2 Theme:

The common theme among these transactions is that they all involve spending money at Kelvin Hall.


CBRE, Kelvin Hall
GLASGOW CITY COUNCIL, Kelvin Hall
University Of Glasgow, Kelvin Hall
GLASGOW LIFE, Oct 20 to Dec 20 service charge - Kelvin Hall
Computacenter Uk, Kelvin Hall
XMA Scotland Ltd, Kelvin Hall
GLASGOW LIFE, Service Charges Kelvin Hall 01/07/19-30/09/19
Glasgow Life, Kelvin Hall Service Charges
Glasgow City Council, Kelvin Hall
GLASGOW LIFE, Quarterly service charge KH
----------------------------------------------------------------------------------------------------


Cluster 3 Theme:

The common theme among these transactions is that they all involve payments for facility management fees and services provided by ECG Facilities Service.


ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, Facilities Management Charge
ECG FACILITIES SERVICE, Inspection and Maintenance of all Library properties
ECG Facilities Service, Facilities Management Charge
ECG FACILITIES SERVICE, Maintenance contract - October
ECG FACILITIES SERVICE, Electrical and mechanical works
ECG FACILITIES SERVICE, This payment covers multiple invoices for facility management fees
ECG FACILITIES SERVICE, CB Bolier Replacement (1),USP Batteries,Gutter Works & Cleaning of pigeon fouling
ECG Facilities Service, Facilities Management Charge
ECG Facilities Service, Facilities Management Charge
----------------------------------------------------------------------------------------------------


Cluster 4 Theme:

The common theme among these transactions is that they all involve construction or refurbishment work.


M & J Ballantyne Ltd, George IV Bridge Work
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
John Graham Construction Ltd, Causewayside Refurbishment
ARTHUR MCKAY BUILDING SERVICES, Causewayside Work
John Graham Construction Ltd, Causewayside Refurbishment
Morris & Spottiswood Ltd, George IV Bridge Work
ECG FACILITIES SERVICE, Causewayside IT Work
John Graham Construction Ltd, Causewayside Refurbishment
----------------------------------------------------------------------------------------------------

结论

现在我们有了五个新的聚类,可以用来描述我们的数据。从可视化结果来看,我们的一些聚类有一些重叠,我们需要一些调整才能达到正确的位置,但我们已经可以看到GPT-3已经做出了一些有效的推断。特别是,它发现包括法律存款在内的项目与文学档案有关,这是正确的,但模型没有得到任何线索。非常棒,通过一些调整,我们可以创建一个基本的聚类集,然后可以将其与多类分类器一起使用,以推广到我们可能使用的其他交易数据集。