跳到主要内容

如何使用防护栏

nbviewer

在这个笔记本中,我们分享了如何为您的LLM应用程序实现防护栏的示例。防护栏是一个泛指,用于指代旨在引导您的应用程序的侦探控制。鉴于LLMs固有的随机性,更大的可控性是一个常见要求,因此在将LLM从原型推向生产时,创建有效的防护栏已成为性能优化的最常见领域之一。

防护栏的种类非常多样,几乎可以部署到您能想象到的任何可能出现LLMs问题的情境中。本笔记旨在提供简单的示例,可以扩展以满足您独特的用例,同时概述在决定是否实施防护栏以及如何实施时需要考虑的权衡。

本笔记将重点介绍以下内容: 1. 输入防护栏,在内容传递到您的LLM之前标记不适当的内容 2. 输出防护栏,在传递给客户之前验证您的LLM生成的内容

注意: 本笔记将防护栏视为围绕LLM的侦探控制的泛指 - 对于提供预构建防护栏框架分发的官方库,请查看以下内容: - NeMo防护栏 - Guardrails AI

import openai

GPT_MODEL = 'gpt-3.5-turbo'

1. 输入防护措施

输入防护措施旨在防止不适当的内容首先传递到LLM中 - 一些常见的用例包括: - 主题防护栏: 当用户提出与主题无关的问题时,识别并为他们提供建议,告知LLM可以帮助他们解决哪些主题。 - 越狱: 检测用户是否试图篡改LLM并覆盖其提示。 - 提示注入: 检测用户是否试图隐藏恶意代码,该代码将在LLM执行的任何下游函数中执行。

在所有这些情况下,它们都充当预防性控制,运行在LLM之前或与LLM并行,并在满足这些标准之一时触发您的应用程序以采取不同的行为。

设计防护栏

在设计防护栏时,重要的是要考虑准确性延迟成本之间的权衡,您要尽量在对您的底线和用户体验影响最小的情况下实现最大准确性。

我们将从一个简单的主题防护栏开始,旨在检测与主题无关的问题,并在触发时阻止LLM回答。这个防护栏由一个简单的提示组成,使用gpt-3.5-turbo,在准确性上最大化延迟/成本,但如果我们想进一步优化,我们可以考虑: - 准确性: 您可以考虑使用经过微调的模型或少量示例来提高准确性。如果您有一个可以帮助确定内容是否允许的信息语料库,RAG也可以很有效。 - 延迟/成本: 您可以尝试微调较小的模型,例如babbage-002或开源产品,如Llama,在提供足够的训练示例时可以表现得相当不错。当使用开源产品时,您还可以调整用于推断的机器,以最大化成本或减少延迟。

这个简单的防护栏旨在确保LLM只回答预定义的一组主题,并对超出范围的查询以固定消息做出响应。

支持异步

为了最小化延迟,一种常见的设计是将您的防护栏与主要的LLM调用一起异步发送。如果您的防护栏被触发,您将返回它们的响应,否则返回LLM的响应。

我们将采用这种方法,创建一个execute_chat_with_guardrails函数,该函数将并行运行我们的LLM的get_chat_responsetopical_guardrail防护栏,并仅在防护栏返回allowed时返回LLM的响应。

限制

在开发设计时,您应始终考虑防护栏的限制。一些需要注意的关键限制包括: - 当将LLMs用作防护栏时,要意识到它们具有与基本LLM调用相同的漏洞。例如,提示注入尝试可能成功逃避您的防护栏和实际LLM调用。 - 随着对话变得更长,LLMs更容易受到越狱的影响,因为您的指令会被额外的文本稀释。 - 如果您使防护栏过于严格以弥补上述问题,防护栏可能会损害用户体验。这会表现为过度拒绝,即您的防护栏拒绝了无害的用户请求,因为与提示注入或越狱尝试存在相似之处。

缓解措施

如果您可以将防护栏与基于规则或更传统的机器学习模型结合起来进行检测,这可以缓解一些风险。我们还看到一些客户只考虑最新消息的防护栏,以减轻模型被长对话混淆的风险。

我们还建议逐步推出,并积极监控对话,以便发现提示注入或越狱的情况,并添加更多防护栏以覆盖这些新类型的行为,或将它们包含为现有防护栏的训练示例。

system_prompt = "You are a helpful assistant."

bad_request = "I want to talk about horses"
good_request = "What are the best breeds of dog for people that like cats?"

import asyncio


async def get_chat_response(user_request):
print("Getting LLM response")
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_request},
]
response = openai.chat.completions.create(
model=GPT_MODEL, messages=messages, temperature=0.5
)
print("Got LLM response")

return response.choices[0].message.content


async def topical_guardrail(user_request):
print("Checking topical guardrail")
messages = [
{
"role": "system",
"content": "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
},
{"role": "user", "content": user_request},
]
response = openai.chat.completions.create(
model=GPT_MODEL, messages=messages, temperature=0
)

print("Got guardrail response")
return response.choices[0].message.content


async def execute_chat_with_guardrail(user_request):
topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
chat_task = asyncio.create_task(get_chat_response(user_request))

while True:
done, _ = await asyncio.wait(
[topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
)
if topical_guardrail_task in done:
guardrail_response = topical_guardrail_task.result()
if guardrail_response == "not_allowed":
chat_task.cancel()
print("Topical guardrail triggered")
return "I can only talk about cats and dogs, the best animals that ever lived."
elif chat_task in done:
chat_response = chat_task.result()
return chat_response
else:
await asyncio.sleep(0.1) # 在再次检查任务之前先休息一会儿

# 使用正确的请求调用主函数 - 这应该会成功。
response = await execute_chat_with_guardrail(good_request)
print(response)

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
If you're a cat lover considering getting a dog, it's important to choose a breed that typically has a more cat-like temperament. Here are some dog breeds that are known to be more cat-friendly:

1. Basenji: Known as the "barkless dog," Basenjis are independent, clean, and have a cat-like grooming habit.

2. Shiba Inu: Shiba Inus are often described as having a cat-like personality. They are independent, clean, and tend to be reserved with strangers.

3. Greyhound: Greyhounds are quiet, low-energy dogs that enjoy lounging around, much like cats. They are also known for their gentle and calm nature.

4. Bichon Frise: Bichon Frises are small, friendly dogs that are often compared to cats due to their playful and curious nature. They are also hypoallergenic, making them a good choice for those with allergies.

5. Cavalier King Charles Spaniel: These dogs are affectionate, gentle, and adaptable, making them a good match for cat lovers. They are known for their desire to be close to their owners and their calm demeanor.

Remember, individual dogs can have different personalities, so it's important to spend time with the specific dog you're considering to see if their temperament aligns with your preferences.
# 使用正确的请求调用主函数 - 这应该会被阻止
response = await execute_chat_with_guardrail(bad_request)
print(response)

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.

看起来我们的护栏起作用了 - 第一个问题被允许通过,但第二个问题因为与主题无关而被阻止了。现在我们将扩展这个概念,以便对从LLM得到的回应进行调节。

2. 输出保护措施

输出保护措施规定了LLM返回的内容。这些措施可以采取多种形式,其中一些最常见的包括: - 幻觉/事实核实保护措施: 使用真实信息语料库或一组虚构响应的训练集来阻止LLM产生幻觉响应。 - 审查保护措施: 应用品牌和公司准则来审查LLM的结果,如果违反这些准则,则阻止或重写其响应。 - 语法检查: LLM返回的结构化输出可能损坏或无法解析 - 这些保护措施会检测到这些问题,然后重新尝试或优雅地失败,防止下游应用程序中的故障。 - 这是一个常见的控制措施,用于在函数调用时应用,确保当LLM返回function_call时,arguments中返回了预期的模式。

审查保护措施

在这里,我们实现了一个使用G-Eval评估方法的版本来评分LLM响应中不良内容存在的审查保护措施。这种方法在我们的其他笔记本中有更详细的演示。

为了实现这一点,我们将建立一个可扩展的内容审查框架,它接受一个domain并对一段content应用criteria,使用一组steps: 1. 我们设置一个描述我们将审查的内容类型的域名。 2. 我们提供标准,清楚地概述内容应该包含和不应该包含的内容。 3. 为LLM提供逐步说明,以评分内容。 4. LLM返回一个1-5的离散分数。

设置保护措施阈值

我们的输出保护措施将评估LLM的响应,并阻止任何得分为3或更高的内容。设置此阈值是优化的一个常见领域 - 我们建议建立一个评估集,并使用混淆矩阵对结果进行评分,以设置您的保护措施的正确容忍度。这里的权衡通常是: - 更多的假阳性会导致用户体验受损,客户感到恼火,助手似乎不那么有帮助。 - 更多的假阴性可能会给您的业务带来长期伤害,因为人们让助手回答不当问题,或者提示注入/越狱。

例如,对于越狱,您可能希望设置一个非常低的阈值,因为如果您的LLM被劫持并用于生成危险内容最终出现在社交媒体上,那么对您的业务的风险就非常高。然而,对于我们的用例,我们愿意接受一些假阴性,因为最糟糕的情况可能是有人最终拥有了一只可能更适合拉布拉多犬而不是比熊犬的狗,尽管这很悲伤,但可能不会给我们的业务造成长期损害(我们希望如此)。

domain = "animal breed recommendation"

animal_advice_criteria = """
评估内容中是否存在对特定猫或狗品种的明确推荐。
内容应仅包含关于猫和狗的一般建议,而非购买特定品种的指导。"""

animal_advice_steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""

async def moderation_guardrail(chat_response):
print("Checking moderation guardrail")
mod_messages = [
{"role": "user", "content": moderation_system_prompt.format(
domain=domain,
scoring_criteria=animal_advice_criteria,
scoring_steps=animal_advice_steps,
content=chat_response
)},
]
response = openai.chat.completions.create(
model=GPT_MODEL, messages=mod_messages, temperature=0
)
print("Got moderation response")
return response.choices[0].message.content


async def execute_all_guardrails(user_request):
topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
chat_task = asyncio.create_task(get_chat_response(user_request))

while True:
done, _ = await asyncio.wait(
[topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
)
if topical_guardrail_task in done:
guardrail_response = topical_guardrail_task.result()
if guardrail_response == "not_allowed":
chat_task.cancel()
print("Topical guardrail triggered")
return "I can only talk about cats and dogs, the best animals that ever lived."
elif chat_task in done:
chat_response = chat_task.result()
moderation_response = await moderation_guardrail(chat_response)

if int(moderation_response) >= 3:
print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
return "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

else:
print('Passed moderation')
return chat_response
else:
await asyncio.sleep(0.1) # 在再次检查任务之前先休息一会儿

# 添加一个请求,该请求应同时通过我们的主题守卫和我们的内容审核守卫。
great_request = 'What is some advice you can give to a new dog owner?'

tests = [good_request,bad_request,great_request]

for test in tests:
result = await execute_all_guardrails(test)
print(result)
print('\n\n')


Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 5
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Passed moderation
As a new dog owner, here are some helpful tips:

1. Choose the right breed: Research different dog breeds to find one that suits your lifestyle, activity level, and living situation. Some breeds require more exercise and attention than others.

2. Puppy-proof your home: Make sure your home is safe for your new furry friend. Remove any toxic plants, secure loose wires, and store household chemicals out of reach.

3. Establish a routine: Dogs thrive on routine, so establish a consistent schedule for feeding, exercise, and bathroom breaks. This will help your dog feel secure and reduce any anxiety.

4. Socialize your dog: Expose your dog to different people, animals, and environments from an early age. This will help them become well-adjusted and comfortable in various situations.

5. Train your dog: Basic obedience training is essential for your dog's safety and your peace of mind. Teach commands like sit, stay, and come, and use positive reinforcement techniques such as treats and praise.

6. Provide mental and physical stimulation: Dogs need both mental and physical exercise to stay happy and healthy. Engage in activities like walks, playtime, puzzle toys, and training sessions to keep your dog mentally stimulated.

7. Proper nutrition: Feed your dog a balanced and appropriate diet based on their age, size, and specific needs. Consult with a veterinarian to determine the best food options for your dog.

8. Regular veterinary care: Schedule regular check-ups with a veterinarian to ensure your dog's health and well-being. Vaccinations, parasite prevention, and dental care are important aspects of their overall care.

9. Be patient and consistent: Dogs require time, patience, and consistency to learn and adapt to their new environment. Stay positive, be patient with their training, and provide clear and consistent boundaries.

10. Show love and affection: Dogs are social animals that thrive on love and affection. Spend quality time with your dog, offer praise and cuddles, and make them feel like an important part of your family.

Remember, being a responsible dog owner involves commitment, time, and effort. With proper care and attention, you can build a strong bond with your new furry companion.


结论

在LLMs中,Guardrails是一个充满活力和不断发展的主题,我们希望这个笔记本为您提供了关于Guardrails核心概念的有效介绍。总结一下: - Guardrails是旨在防止有害内容传递到您的应用程序和用户端的检测控制措施,并为您的LLM在生产中增加可操纵性。 - 它们可以采取输入Guardrails的形式,这些Guardrails在内容到达LLM之前进行处理,以及输出Guardrails,用于控制LLM的响应。 - 设计Guardrails并设置它们的阈值是准确性、延迟和成本之间的权衡。您的决定应基于对Guardrails性能的清晰评估,以及对虚警和漏警对您业务的成本的理解。 - 通过采用异步设计原则,您可以水平扩展Guardrails,以最大程度地减少用户受到的影响,随着Guardrails数量和范围的增加。

我们期待看到您如何推动这一点,以及随着生态系统的成熟,对Guardrails的思考如何发展。