合成数据生成（第一部分）

使用大型语言模型（LLMs）生成合成数据为一个常见问题提供了强大的解决方案：高质量、多样化和符合隐私的数据的可用性。这可以在许多场景中使用，例如训练数据科学机器学习模型（SVM、决策树、KNN），在数据上微调不同的GPT模型，作为冷启动问题的解决方案，帮助构建具有逼真数据的引人入胜的演示/应用程序，场景测试等。

有许多关键因素可能会导致您希望利用合成数据。 1. 人类数据可能受到隐私限制，或者其中包含我们不希望被使用的可识别数据。 2. 合成数据可以比真实数据更加结构化，因此更容易操作。 3. 在数据稀疏或某些类别数据稀缺的领域，我们可能希望增加数据。 4. 当处理不平衡数据集或缺乏多样性的数据集时，我们可能希望创建数据以提高数据集的丰富性。

与传统的数据增强或手动数据创建方法不同，使用LLMs可以生成丰富、微妙和上下文相关的数据集，这可以显著增强其对企业和开发人员的实用性。

我们将本教程分为2部分。在这本食谱中，我们将有以下议程： 1. 带有结构化提示的CSV 2. 带有Python程序的CSV 3. 使用Python程序的多表CSV 4. 简单创建文本数据 5. 处理不平衡或缺乏多样性的文本数据而在第2部分中，我们将探讨获取更好文本数据的提示策略。

特别是最后两个对于创建合成数据以微调另一个GPT模型非常有用。例如，使用由gpt-4生成的高质量数据来微调更便宜更快的gpt-3.5，以提高性能同时降低成本。

设置步骤

%pip install openai
%pip install pandas
%pip install scikit-learn
%pip install matplotlib

from openai import OpenAI
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib

1. 具有结构提示的CSV

在这里，我们以最简单的方式创建数据。您可以通过解决3个关键点来快速生成数据：告诉它数据的格式（CSV），模式以及有关列之间关系的有用信息（LLM将能够从列名称中推断出这一点，但提供帮助将提高性能）。

datagen_model = "gpt-4-0125-preview"
question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

```csv
id,house size,house price,location,number of bedrooms
1,100,220000,Suburbs,3
2,80,180000,Suburbs,2
3,120,320000,Suburbs,4
4,65,160000,Countryside,2
5,150,500000,City Center,4
6,90,200000,Countryside,3
7,200,700000,City Center,5
8,180,600000,Suburbs,5
9,70,140000,Countryside,2
10,130,400000,City Center,3
```

2. 使用Python程序处理CSV文件

直接生成数据的问题在于受到上下文的限制，我们能够生成的数据量有限。相反，我们可以要求LLM生成一个Python程序来生成合成数据。这样可以让我们扩展到更多的数据量，同时通过检查Python程序，我们可以了解数据是如何生成的。

这样一来，我们就可以根据需要编辑Python程序，同时也为我们提供了一个很好的起点。

question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

To generate synthetic housing data and output it as a Pandas DataFrame, we can use Python with the `pandas` and `numpy` libraries. Below is a script that creates 100 rows of housing data considering the prescribed logic for house size, price, and number of bedrooms. It also takes into account the impact of location on house price.

First, ensure you have pandas and numpy installed. You can install them via pip if you haven't already:

```
pip install pandas numpy
```

The script:

```python
import pandas as pd
import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Initialize the lists
ids = list(range(1, 101))
sizes = np.random.normal(150, 50, 100).astype(int)  # House sizes with a mean of 150 m^2 and a std of 50
bedrooms = np.random.choice([1, 2, 3, 4, 5], 100)  # Number of bedrooms
locations = np.random.choice(['Downtown', 'Suburb', 'Countryside'], 100, p=[0.4, 0.4, 0.2])  # Location of houses with a preferential distribution

# Prices will be influenced by location, size, and bedrooms. This part is simplistic and can be made more complex.
base_price = 100000  # Base price
price_per_m2 = 1000  # Base price per m^2
extra_per_bedroom = 5000  # Extra cost per additional bedroom

prices = []

for i in range(100):
    base_location_multiplier = 1.5 if locations[i] == 'Downtown' else 1.2 if locations[i] == 'Suburb' else 1
    location_multiplier = base_location_multiplier * (1 + (sizes[i] / 1000))  # More expensive if bigger, especially downtown
    price = base_price + (sizes[i] * price_per_m2) + (bedrooms[i] * extra_per_bedroom)
    prices.append(int(price * location_multiplier))

# Create DataFrame
data = {
    'id': ids,
    'house size (m^2)': sizes,
    'number of bedrooms': bedrooms,
    'location': locations,
    'house price': prices
}

df = pd.DataFrame(data)

print(df)
```

This program initializes with a seed for reproducibility while creating randomized but plausible data for housing. The sizes are normally distributed around a mean value, and bedrooms are chosen from a set number. The pricing logic uses base values plus increases according to size, bedroom count, and a location multiplier, with downtown locations inflating prices more than suburbs or countryside locations. Adjustments are simplistic for the purpose of example and can be refined for more nuanced simulations.

我们需要确保适当解析此输出，因为通常可能会有包围在Python代码周围的文本。我们还可以明确要求它说明生成的数据的所有假设，然而在这种情况下，它告诉我们这一点是自动完成的。

3. 使用Python程序创建多表CSV

然而，对于更复杂的关系，我们需要确保指定更多的特征。

要创建与彼此相关的多个不同数据集（例如房屋、位置、房屋类型），与之前一样，我们需要指定格式、模式和有用信息。然而，现在需要更多的有用信息来获得良好的性能。这是因情况而异，但描述的一些重要事项包括数据集之间的关系，描述数据集之间的大小关系，确保外键和主键适当设置，并且最好使用先前生成的数据集来填充新数据集，以便实际数据值在必要时匹配。

question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

To create a Python program generating three Pandas DataFrames as described, I'll lay out a step-by-step process considering the relationships between the different types of data:

1. Install pandas if you haven't yet: `pip install pandas`
2. Import pandas and generate each DataFrame. I'll make some assumptions for the synthetic data to keep it relatively simple.

Let's start coding:

```python
import pandas as pd
import numpy as np

# Generating Location DataFrame
np.random.seed(42)  # For reproducibility
location_data = {
    'id': range(1, 11),  # Assuming 10 unique locations
    'country': ['CountryA'] * 5 + ['CountryB'] * 5,
    'city': ['City' + str(i) for i in range(1, 11)],
    'population': np.random.randint(100000, 1000000, size=10),
    'area': np.random.randint(500, 20000, size=10),
}
locations_df = pd.DataFrame(location_data)

# Generating House Types DataFrame
house_types_data = {
    'id': range(1, 5),  # Assuming 4 unique house types
    'house type': ['Villa', 'Apartment', 'Townhouse', 'Bungalow'],
    'average house type price': [300000, 200000, 250000, 220000],  # Just arbitrary prices
    'number of houses': [25, 50, 15, 10],  # Total = 100 houses, matching the housing data requirement
}
house_types_df = pd.DataFrame(house_types_data)

# Generating Housing Data
housing_data = {
    'id': range(1, 101),
    'house size': np.random.randint(50, 500, size=100),
    'house price': [],  # To be calculated based on size, location, etc.
    'location_id': np.random.choice(locations_df['id'], size=100),
    'number of bedrooms': np.random.randint(1, 6, size=100),
    'house_type_id': np.random.choice(house_types_df['id'], size=100),
}
# Simple model to calculate house price based on size, type, and a base price from the location's median
base_prices = locations_df['population'] / 100000  # Simplified assumption: more populous => more expensive
housing_data['house price'] = [
    (1200 * size) + (house_types_df.loc[type_id - 1, 'average house type price']) + (base_prices[loc_id - 1] * 1000)
    for size, type_id, loc_id
    in zip(housing_data['house size'], housing_data['house_type_id'], housing_data['location_id'])
]

housing_df = pd.DataFrame(housing_data)

# Display the first few rows of each DataFrame
print(locations_df.head())
print(house_types_df.head())
print(housing_df.head())
```

Notes:
- This script assumes 10 unique locations and 4 house types for simplicity.
- House prices are arbitrarily calculated using the house size, type, and a base price influenced by the location's population. Reality would require a more complex model.
- `numpy.random.randint` is used to generate integer values. Similarly, `numpy.random.choice` is used to randomly assign locations and house types to each house, demonstrating a form of foreign key relationship.
- For simplicity, foreign keys are represented by corresponding ID fields (e.g., `location_id` in the housing data references the `id` in the location data).

This simple synthetic data generation strategy illustrates creating related data sets with Python and pandas. The synthetic data should make general sense within the constraints provided, but keep in mind that for more complex or realistic data modeling, you'd need to incorporate more detailed rules and possibly real-world data.

4. 简单创建文本数据

在这里，我们首次尝试创建文本数据。这可以用于微调另一个GPT模型，例如。在这种情况下，我们想象自己是一家零售商，试图简化为他们正在销售的物品创建描述的过程。我们需要再次指定数据的格式，特别是在这种情况下，我们希望一个易于解析为输出的格式。

我们下面考虑的示例是我们想要为GPT模型创建输入输出训练对，以便进行微调。我们将产品名称和其所属的类别作为输入，输出将是一个描述。

明确指定输出的结构并给出命令以确保不偏离这一结构有助于强化输出结构。您可以在循环中运行这个操作，并追加数据以生成更多的合成数据。同样，与之前一样，我们需要很好地解析数据，以便我们下游的代码不会出错。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. The usecase is a retailer generating a description for a product from a product catalogue. I want the input to be product name and category (to which the product belongs to) and output to be description.
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create as many training pairs as possible.
  """

  response = client.chat.completions.create(
    model=datagen_model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #显示截断的响应

1.
Input: Northface Waterproof Jacket, Clothing
Output: Stay dry and stylish with the Northface Waterproof Jacket. Perfect for outdoor adventurers and city dwellers alike, this jacket combines cutting-edge waterproof technology with a sleek, modern design. Ideal for unpredictable weather, it ensures you're prepared for anything Mother Nature throws your way.

2.
Input: Apple iPhone 12, Electronics
Output: Experience the next level of innovation with the Apple iPhone 12. Featuring a stunning Super Retina XDR display, a powerful A14 Bionic chip, and advanced dual-camera system, this phone is designed to push the boundaries of what's possible. With 5G capability for super-fast downloads and high-quality streaming, it's the perfect device for tech enthusiasts.

3.
Input: Adidas Ultraboost Sneakers, Footwear
Output: Revolutionize your running experience with Adidas Ultraboost Sneakers. Engineered for long-lasting comfort and superior performance, these sneakers feature the innovative Boost 

注意：上面的输出被截断了。现在我们可以按照下面的方式解析它，以获取产品、类别及其描述的列表。例如，让我们看一下它生成的产品。

#正则表达式解析数据
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
descriptions = []

for match in matches:
    product, category, description = match
    products.append(product.strip())
    categories.append(category.strip())
    descriptions.append(description.strip())
products

['Northface Waterproof Jacket',
 'Apple iPhone 12',
 'Adidas Ultraboost Sneakers',
 'LEGO Star Wars Millennium Falcon',
 'Vitamix Professional Series 750 Blender',
 'Panasonic Lumix GH5 Camera',
 'Moleskine Classic Notebook',
 'Bodum French Press Coffee Maker',
 'Classic White Sneakers',
 'Multi-Purpose Blender',
 'Eco-Friendly Yoga Mat',
 'Organic Green Tea',
 'Smart LED Light Bulb',
 'Waterproof Hiking Boots',
 'Bamboo Toothbrush',
 'Modern Minimalist Floor Lamp',
 'Classic Leather Office Chair',
 'Stainless Steel French Press',
 'Eco-Friendly Bamboo Cutting Board',
 'Ultimate Gaming Laptop',
 'Waterproof Hiking Boots',
 'Compact Travel Umbrella',
 "Professional Chef's Knife"]

5. 处理不平衡或缺乏多样性的文本数据

生成高质量合成数据的一些最重要方面是准确性（数据是否合理）、一致性（相同输入的两个数据点是否大致相同）和多样性（确保我们的数据分布尽可能匹配生产环境中存在的分布）。

为了增加数据的多样性，我们首先通过对数据进行聚类来开始。这将为我们提供关于哪些簇是代表不足的（不平衡数据集）或哪些数据根本没有被处理（扩大数据分布）的信息。然后，我们要么建议新的簇（使用GPT的自我反思类型调用），要么要求我们合成生成调用的下一个迭代明确地针对代表不足的簇。

然后，我们可以递归运行这个生成和聚类分析循环，以自动化生成多样化的合成数据。

为了演示目的，我们明确要求LLM生成关于4个不同主题领域的信息：车辆、服装、洗漴用品、食物。然后我们将对数据进行聚类，看看它是否成功找到这4个主题领域。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under 4 main topics: vehicle, clothing, toiletries, food)
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #显示截断的响应

2) toiletries
Input: "Toothbrush X5+, Electronic toothbrush"
Output: "Experience a superior cleanse with the Toothbrush X5+. It comes equipped with an advanced sonic technology that guarantees a gentle yet effective clean every time."

3) vehicle
Input: "Pegasus Pro 300, Motorcycle"
Output: "Dominate the road with the stylish Pegasus Pro 300. This motorcycle guarantees a powerful, efficient, and thrilling performance on every ride."

4) food
Input: "Tasty Delight Instant Noodles, Instant food"
Output: "Tasty Delight Instant Noodles offer a quick, delicious meal ready in minutes. The perfect solution for those stepping up their cooking game."

5) clothing
Input: "UltraSport Men's Running Jacket, Sportswear"
Output: "UltraSport Men's Running Jacket combines functionality and style. The breathable material allows for comfortable workouts, even in colder weather."

6) toiletries
Input: "FreshBliss Shower Gel, Bath and body"
Output: "Indulge in luxury every morning with the FreshBliss Showe

注意：上面的输出已经被截断。在上面的示例中，我们会明确地将主题领域作为响应的一部分包含进去，因为这有助于调整后续的输出，并且往往能够获得更好的性能。我们还可以给出一个实际的示例，展示输出应该是什么样子，这样可以让它正确理解输出的风格，同时也有助于强化结构。

pattern = re.compile(r'(\d+)\) (\w+(?: \w+)?)\s*Input: "(.+?), (.+?)"\s*Output: "(.+?)"', re.DOTALL)
matches = pattern.findall(output_string)


topics = []
products = []
categories = []
descriptions = []

for match in matches:
    number, topic, product, category, description = match
    topics.append(topic)
    products.append(product)
    categories.append(category)
    descriptions.append(description)

products

['Toothbrush X5+',
 'Pegasus Pro 300',
 'Tasty Delight Instant Noodles',
 "UltraSport Men's Running Jacket",
 'FreshBliss Shower Gel',
 'OceanBlue Yacht 700',
 'FarmFresh Organic Apples',
 "Elegance Women's Velvet Dress",
 "GentleCare Men's Face Wash",
 'AquaBreathe',
 'Lunar Ride',
 'Sunrise Juice',
 'TitanFlex',
 'GlowRadiant',
 'SolarSpeed',
 'HealthyBite',
 'Brushify',
 'Choco Crunchy',
 'Super X100',
 'Le Bliz',
 'Purely Lavender',
 'Cheesy Delight',
 'EcoSprint',
 'Denim Duo',
 'Fresh Dawn']

现在我们将对数据进行聚类分析。我们将使用K均值聚类来对数据进行分离。K均值的一个重要参数是K，即聚类的数量。

我们知道应该有4个聚类（4个主题），因为我们在提示中指定了：车辆、电子产品、服装、食品。然而，对于我们的数据，我们通常不知道存在多少个聚类。因此，我们将使用“肘部法则”来找到最佳的聚类数量。

在肘部法则中，我们会迭代一系列不同的K值，每次存储惯性。惯性衡量了每个聚类中每个点与该聚类的质心之间的平方距离之和，从而告诉我们每个聚类有多么分离和密集。如果我们将K绘制为惯性，我们就能看到惯性如何下降，以及惯性下降最不迅速的地方（通常形成一个肘部形状），我们就可以设置我们的最佳聚类数量。您可以在这里深入了解肘部法则。

首先让我们将数据存储到一个 pandas dataframe 中，以便进行分析。

data = {
    'Product': products,
    'Category': categories,
    'Description': descriptions
}

df = pd.DataFrame(data)

接下来让我们将数据嵌入到向量空间中，因为嵌入是我们将要聚类的内容，如果它们相似的话，它们应该在向量空间中彼此靠近。

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model)

    return response.data[0].embedding

embedding_model = "text-embedding-3-small"
df["embedding"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))

matrix = np.vstack(df.embedding.values)

现在我们来执行肘部法则方法。

# 使用肘部法确定最佳聚类数量
inertias = []
range_of_clusters = range(1, 13)  # 根据需要调整范围

for n_clusters in range_of_clusters:
    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
    kmeans.fit(matrix)
    inertias.append(kmeans.inertia_)

这将为我们输出一个图表，在图表中我们需要直观地确定最佳的聚类点在哪里。我们可以看到下面的图表显示惯性逐渐减小，而不是出现明显的拐点，但最陡的下降点似乎出现在3、4或5个聚类周围，这与我们的预期相吻合。

# 绘制肘部图
plt.figure(figsize=(10, 6))
plt.plot(range_of_clusters, inertias, '-o')
plt.title('Elbow Method to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.xticks(range_of_clusters)
plt.show()

为了演示目的，我们将选择5作为最佳的聚类数，以展示选取的确切位置并不重要，只要我们大致正确即可。有许多正确的数据分类方法。我们还会存储每个数据点属于哪个聚类。

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

现在我们将分析聚类数据。我们将着手解决两个不同的问题。1. 数据不平衡，2. 扩大数据分布。

首先，对于不平衡的数据，我们统计每个簇中的示例数量。然后我们从每个簇中随机选择一些示例，并询问LLM这些示例映射到哪些主题。

cluster_counts = df["Cluster"].value_counts().sort_index()
print(cluster_counts)

Cluster
  4
  7
  6
  4
  4
Name: count, dtype: int64

我们可以看到这里找到的主题：环保交通、奢侈品和休闲用品、个人护理产品、电动牙刷和服装与我们最初的提示：车辆、服装、洗漴用品、食品相匹配得足够好，但并非完全一致。

由于我们选择了5个聚类，它将洗漱用品分成了护肤品和个人护理，这对我们在下游阶段影响不太大。

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3)).reset_index(drop=True)

# 格式化选定的示例
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want you identify the broad topic areas these clusters belong to.
    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    Do not add any extra characters around that formatting as it will make the output parsing break.
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content

pattern = r"Cluster: (\d+), topic: ([^\n]+)"
matches = re.findall(pattern, res)
clusters = [{"cluster": int(cluster), "topic": topic} for cluster, topic in matches]
json_output = json.dumps(clusters, indent=2)
print(json_output)

[
  {
    "cluster": 0,
    "topic": "Electronic Toothbrushes"
  },
  {
    "cluster": 1,
    "topic": "Clothing and Apparel"
  },
  {
    "cluster": 2,
    "topic": "Personal Care Products"
  },
  {
    "cluster": 3,
    "topic": "Eco-friendly Transportation"
  },
  {
    "cluster": 4,
    "topic": "Luxury and Leisure Items"
  }
]

现在我们已经有了各个簇及其计数，因此我们可以提示LLM在我们想要的主题中生成更多示例。然而，在这个示例中，我们不会深入进行，因为它们已经被很好地分开，您只需按照上述步骤提示模型生成数据，同时传入少数主题即可。

接下来，我们将尝试处理增加数据分布多样性的问题。

首先，我们以类似的方式开始，随机选择每个簇中的一些示例，并询问LLM这些示例对应的主题。除此之外，在同一个LLM调用中，我们将要求它生成更多的主题，以增加数据的多样性。我们在一个调用中完成这些操作，以节省时间/成本。

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3)).reset_index(drop=True)

# 格式化选定的示例
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want to promote diversity in my examples across categories so follow the procedure below:
    1. You must identify the broad topic areas these clusters belong to.
    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.


    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:

    1. Cluster topic mapping
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    2. New topics
    1. topic
    2. topic
    3. topic
    4. topic

    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content
print(res)

1. Cluster topic mapping
Cluster: 0, topic: Electronic/Health Products
Cluster: 1, topic: Fashion and Food
Cluster: 2, topic: Personal Care/Wellness
Cluster: 3, topic: Eco-friendly Transportation
Cluster: 4, topic: Chocolate/Motorcycles

2. New topics
1. Home Automation Gadgets
2. Educational Tools and Apps
3. Renewable Energy Solutions
4. Virtual Reality Experiences

我们可以再次看到，我们明确指定了输出结构应该遏制的内容。我还告诉它生成主题的目的（促进多样性），这样模型就有了完整的上下文。

然后我们将数据解析为一组集群映射的json列表和一个主题列表。

parts = res.split("\n\n")
cluster_mapping_part = parts[0]
new_topics_part = parts[1]

# 解析集群主题映射
cluster_topic_mapping_lines = cluster_mapping_part.split("\n")[1:]  # 跳过前两行
cluster_topic_mapping = [{"cluster": int(line.split(",")[0].split(":")[1].strip()), "topic": line.split(":")[2].strip()} for line in cluster_topic_mapping_lines]

# 解析新主题
new_topics_lines = new_topics_part.split("\n")[1:]  # 跳过第一行
new_topics = [line.split(". ")[1] for line in new_topics_lines]

cluster_topic_mapping, new_topics

([{'cluster': 0, 'topic': 'Electronic/Health Products'},
  {'cluster': 1, 'topic': 'Fashion and Food'},
  {'cluster': 2, 'topic': 'Personal Care/Wellness'},
  {'cluster': 3, 'topic': 'Eco-friendly Transportation'},
  {'cluster': 4, 'topic': 'Chocolate/Motorcycles'}],
 ['Home Automation Gadgets',
  'Educational Tools and Apps',
  'Renewable Energy Solutions',
  'Virtual Reality Experiences'])

最后，我们可以使用这些信息来进一步促使模型继续生成合成数据。我们通过将json列表中的所有主题传递到下面的提示中来实现这一点。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under some main topics: {[entry['topic'] for entry in cluster_topic_mapping]})
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string)

您可以在循环中运行此代码，以将结果附加到先前的数据中，这样可以持续生成更多文本合成数据，用于训练另一个GPT模型，同时确保我们满足不平衡数据集的需求，并生成多样化的数据。

您已经完成了合成数据生成教程的第一部分，其中我们已经讨论了： * 带有结构化提示的CSV * 带有Python程序的CSV * 带有Python程序的多表CSV * 简单创建文本数据 * 处理不平衡或非多样化的文本数据

在第二部分中，您将了解更好地提示LLM以增强文本合成数据生成的技术。

设置步骤​

1. 具有结构提示的CSV​

2. 使用Python程序处理CSV文件​

3. 使用Python程序创建多表CSV​

4. 简单创建文本数据​

5. 处理不平衡或缺乏多样性的文本数据​