如何使用审查 API

注意： 本指南旨在补充我们的《防护栏食谱》，提供更专注于审查技术的内容。虽然内容和结构存在一些重叠，但这本食谱更深入地探讨了如何根据特定需求定制审查标准的微妙之处，提供了更精细的控制水平。如果您对内容安全措施的更广泛概述感兴趣，包括防护栏和审查，我们建议从防护栏食谱开始。这些资源共同为您提供了如何有效管理和审查应用程序中的内容的全面理解。

审查，就像现实世界中的防护栏一样，是一项预防措施，确保您的应用程序始终保持在可接受和安全内容的范围内。审查技术非常灵活，可以应用于LLMs可能遇到问题的各种场景。本笔记本旨在提供简单明了的示例，可根据您的特定需求进行调整，同时讨论决定是否实施审查以及如何进行审查所涉及的考虑因素和权衡。本笔记本将使用我们的审查 API，这是一个工具，您可以使用它来检查文本是否具有潜在危害性。

本笔记本将集中讨论以下内容：

输入审查： 在您的LLM处理文本之前识别和标记不当或有害内容。
输出审查： 在内容传达给最终用户之前审查和验证您的LLM生成的内容。
自定义审查： 定制审查标准和规则，以适应您的应用程序的特定需求和背景，确保个性化和有效的内容控制机制。

from openai import OpenAI
client = OpenAI()

GPT_MODEL = 'gpt-3.5-turbo'

1. 输入内容的调节

输入内容的调节旨在防止有害或不当内容进入LLM，常见的应用包括： - 内容过滤： 防止在社交媒体、论坛和内容创作平台上传播有害内容，如仇恨言论、骚扰、色情材料和错误信息。 - 社区标准执行： 确保用户互动，如评论、论坛帖子和聊天消息，符合在线平台的社区准则和标准，包括教育环境、游戏社区或约会应用程序。 - 垃圾邮件和欺诈预防： 在在线论坛、评论区、电子商务平台和客户评论中过滤垃圾邮件、欺诈内容和误导性信息。

这些措施作为预防性控制措施，在LLM之前或与LLM一起运行，如果满足特定标准，可以改变应用程序的行为。

拥抱异步

一种常见的设计以最小化延迟的方式是将您的调整异步发送，同时与主要的LLM调用一起发送。如果触发了您的调整，您会发送一个占位符响应，否则发送LLM响应。这种模式也可以在我们的Guardrails Cookbook中找到。值得注意的是，虽然异步模式在最小化延迟方面是有效的，但它也可能导致不必要的成本。具体来说，如果在处理之前对内容进行了标记，您可以避免完成成本。因此，在使用异步模式时，平衡减少延迟的好处与增加费用的潜力是至关重要的。

我们将使用这种方法，创建一个execute_chat_with_moderation函数，该函数将并行运行我们的LLM的get_chat_response和check_expression调整函数，并仅在调整返回False（未触发）时返回LLM响应。

工作流程：

我们将创建一个工作流程，将调整API整合到其中，以检查用户输入是否存在任何潜在的不安全内容，然后再将其发送到语言模型（LLM）之前。这确保只有适当的内容会在应用程序中进一步处理。

从用户接收输入
使用调整API分析输入是否存在任何问题内容。
条件处理：

如果输入被调整检查标记，相应处理（例如，拒绝输入，要求用户重新表达等）。
如果输入未被标记，则将其传递给LLM进行进一步处理。

我们将使用两个示例提示演示此工作流程。

system_prompt = "You are a helpful assistant."

bad_request = "I want to hurt them. How can i do this?"
good_request = "I would kill for a cup of coffe. Where can I get one nearby?"

import asyncio

async def check_moderation_flag(expression):
    moderation_response = client.moderations.create(input=expression)
    flagged = moderation_response.results[0].flagged
    return flagged
    
async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = client.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")
    return response.choices[0].message.content


async def execute_chat_with_input_moderation(user_request):
    # 创建审核任务和聊天回复任务
    moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        # 等待审核任务或聊天任务完成。
        done, _ = await asyncio.wait(
            [moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # 如果调节任务未完成，则等待并继续进行下一轮迭代。
        if moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # 如果触发了适度机制，取消聊天任务并返回一条消息。
        if moderation_task.result() == True:
            chat_task.cancel()
            print("Moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # 如果聊天任务已完成，请返回聊天回复。
        if chat_task in done:
            return chat_task.result()

        # 如果两项任务都未完成，则稍作休眠后再进行检查。
        await asyncio.sleep(0.1)

# 使用正确的请求调用主函数 - 这应该会成功。
good_response = await execute_chat_with_input_moderation(good_request)
print(good_response)

Getting LLM response
Got LLM response
I can help you with that! To find a nearby coffee shop, you can use a mapping app on your phone or search online for coffee shops in your current location. Alternatively, you can ask locals or check for any cafes or coffee shops in the vicinity. Enjoy your coffee!

# 使用错误的请求调用主函数 - 这应该会被阻止。
bad_response = await execute_chat_with_input_moderation(bad_request)
print(bad_response)

Getting LLM response
Got LLM response
Moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.

看起来我们的审核工作很有效 - 第一个问题被允许通过了，但第二个问题因为不当内容被阻止了。现在我们将扩展这个概念，来审核我们从LLM得到的回复。

2. 输出内容的管理

输出内容的管理对于控制语言模型（LLM）生成的内容至关重要。虽然LLM不应输出非法或有害的内容，但为了进一步确保内容保持在可接受和安全的范围内，设置额外的防护措施是有帮助的，增强应用程序的整体安全性和可靠性。常见的输出内容管理类型包括：

内容质量保证： 确保生成的内容，如文章、产品描述和教育材料，准确、信息丰富，并且没有不当信息。
社区标准遵从： 通过过滤仇恨言论、骚扰和其他有害内容，维护在线论坛、讨论版和游戏社区中的尊重和安全环境。
用户体验增强： 通过提供礼貌、相关且没有不适当语言或内容的回复，改善聊天机器人和自动化服务的用户体验。

在所有这些场景中，输出内容的管理在维护语言模型生成的内容的质量和完整性方面发挥着关键作用，确保其符合平台及其用户的标准和期望。

设置审查阈值

OpenAI已经为审查类别选择了平衡精度和召回率的审查阈值，但您的使用情况或对审查的容忍度可能不同。设置这个阈值是一个常见的优化领域 - 我们建议建立一个评估集，并使用混淆矩阵对结果进行评分，以设置适合您审查的正确容忍度。这里的权衡通常是：

更多的假阳性会导致用户体验受损，客户感到恼火，助手似乎不那么有帮助。
更多的假阴性可能会给您的业务带来持久的伤害，因为人们让助手回答不当问题，或提供不当回应。

例如，在一个专门用于创意写作的平台上，对于某些敏感话题的审查阈值可能设置得更高，以允许更大的创作自由，同时仍提供一个安全网，以捕捉明显超出可接受表达范围的内容。权衡是，一些在其他情境中可能被认为不当的内容是允许的，但考虑到平台的目的和受众期望，这是可以接受的。

工作流程：

我们将创建一个工作流程，将审查API纳入其中，以在将响应发送到语言模型（LLM）之前检查LLM响应中是否存在任何潜在不安全的内容。这确保只向用户显示适当的内容。

接收用户输入
发送提示到LLM并生成响应
使用审查API分析LLM的响应是否存在任何问题内容。
条件处理：

如果审查检查标记了响应，则相应处理（例如，拒绝响应，显示占位消息等）。
如果响应未被标记，则向用户显示它。

我们将使用前面两个示例提示演示这个工作流程。

async def execute_all_moderations(user_request):
    # 创建审核任务和聊天回复任务
    input_moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [input_moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # 如果输入审查未完成，请等待并继续进行下一次迭代。
        if input_moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # 若触发输入审查，则取消聊天任务并返回一条消息。
        if input_moderation_task.result() == True:
            chat_task.cancel()
            print("Input moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # Check if chat task is completed
        if chat_task in done:
            chat_response = chat_task.result()
            output_moderation_response = await check_moderation_flag(chat_response)

            # Check if output moderation is triggered
            if output_moderation_response == True:
                print("Moderation flagged for LLM response.")
                return "Sorry, we're not permitted to give this answer. I can help you with any general queries you might have."
            
            print('Passed moderation')
            return chat_response

        # 如果两项任务均未完成，则稍作休眠后再进行检查。
        await asyncio.sleep(0.1)

# 添加一个请求，该请求应通过我们的输入防护栏，但不应通过我们的输出防护栏。
interesting_request = "Describe a scene from a violent movie in detail."

tests = [good_request, bad_request, interesting_request]

for test in tests:
    print(test)
    result = await execute_all_moderations(test)
    print(result)
    print('\n\n')

I would kill for a cup of coffe. Where can I get one nearby?
Getting LLM response
Got LLM response
Passed moderation
I can help you with that! To find a nearby coffee shop, you can use a mapping app on your phone or search online for coffee shops in your area. Additionally, you can ask locals or nearby businesses for recommendations. Enjoy your coffee!



I want to hurt them. How can i do this?
Getting LLM response
Got LLM response
Input moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.



Describe a scene from a violent movie in detail.
Getting LLM response
Got LLM response
Passed moderation
I'm sorry, but I can't provide detailed descriptions of violent scenes from movies. If you have any other questions or need information on a different topic, feel free to ask!

正如观察到的那样，前两个示例表现如预期。然而，“有趣的示例”并未如预期那样执行。这种差异是因为所有OpenAI模型在生成内容时都采用了输出调节。然而，作为用户，您有能力调整调节设置，并筛选出可能并非明确有害的内容。在接下来的名为“自定义调节”的部分中，我们将探讨如何修改这些调节设置以满足您的特定需求。

3. 自定义审查

注意：本节不使用审查 API。

自定义审查提供了一种量身定制的内容过滤方法，允许精确遵守特定社区标准或主题，使其成为利基平台或专业内容的理想选择。与审查 API 涵盖的一般类别不同，自定义审查提供了更高程度的控制，并需要更多的设置和维护工作。它与保护栏的共同目标是确保内容安全，侧重于设置过滤和管理内容的标准，无论是用户输入还是语言模型的输出。自定义审查可以作为审查 API 的替代方案或扩展，使用我们的完成端点。对于那些有兴趣进一步探索保护栏的人，可以在这里找到包含示例和指导的食谱。

工作流程

我们将实施一个小例子来展示自定义审查。我们将使用额外调用 gpt-4 来评估消息，并决定它们是否应该被审查。

def custom_moderation(content, parameters):
    # 定义GPT-4的提示
    prompt = f"""Please assess the following content for any inappropriate material. You should base your assessment on the given parameters.
    Your answer should be in json format with the following fields: 
        - flagged: a boolean indicating whether the content is flagged for any of the categories in the parameters
        - reason: a string explaining the reason for the flag, if any
        - parameters: a dictionary of the parameters used for the assessment and their values
    Parameters: {parameters}\n\nContent:\n{content}\n\nAssessment:"""
    
    # 使用提示调用GPT-4
    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": "You are a content moderation assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    # 从回复中提取评估
    assessment = response.choices[0].message.content
    
    return assessment

# 示例内容和参数
parameters = "political content, misinformation"

# 使用自定义审核功能以示良好范例
moderation_result = custom_moderation(good_request, parameters)
print(moderation_result)

{
  "flagged": false,
  "reason": "The content does not contain political content or misinformation. The phrase 'I would kill for a cup of coffee' is a common expression indicating a strong desire for coffee and not a literal intent.",
  "parameters": {
    "political content": false,
    "misinformation": false
  }
}

# 使用自定义审核功能处理不当示例
moderation_result = custom_moderation(bad_request, parameters)
print(moderation_result)

{
  "flagged": true,
  "reason": "The content contains language that implies intent to cause harm, which violates guidelines concerning violent or harmful behavior, even though the specific parameters were political content and misinformation. This assessment is based on a general guideline to prevent harm or promotion of violence.",
  "parameters": {
    "political content": false,
    "misinformation": false
  }
}

# 请使用自定义审核功能进行自定义示例操作。
custom_request = "I want to talk about how the government is hiding the truth about the pandemic."
moderation_result = custom_moderation(custom_request, parameters)
print(moderation_result)

{
  "flagged": true,
  "reason": "The content suggests political content by discussing the government and hints at misinformation by suggesting the government is hiding the truth about the pandemic without providing evidence.",
  "parameters": {
    "political content": true,
    "misinformation": true
  }
}

结论

总的来说，本笔记探讨了在由语言模型（LLMs）驱动的应用中，审查的基本作用。我们深入探讨了输入和输出审查策略，突出它们在维护用户互动的安全和尊重环境中的重要性。通过实际示例，我们演示了使用OpenAI的审查API来预先过滤用户输入，并审查LLM生成的响应是否合适。实施这些审查技术对于维护应用程序的完整性并确保用户获得积极体验至关重要。

在进一步开发应用程序时，考虑通过定制审查不断完善您的审查策略。这可能涉及将审查标准调整到您特定的用例，或集成机器学习模型和基于规则的系统的组合，以更细致地分析内容。在允许言论自由和确保内容安全之间取得平衡是创造一个包容和建设性空间供所有用户使用的关键。通过持续监控和调整您的审查方法，您可以适应不断发展的内容标准和用户期望，确保您的LLM驱动应用的长期成功和相关性。

1. 输入内容的调节​

拥抱异步​

工作流程：​

2. 输出内容的管理​

设置审查阈值​

工作流程：​

3. 自定义审查​

工作流程​

结论​

1. 输入内容的调节

拥抱异步

工作流程：

2. 输出内容的管理

设置审查阈值

工作流程：

3. 自定义审查

工作流程

结论