Pandas 查询引擎¶

本指南向您展示如何使用我们的 PandasQueryEngine：使用LLMs将自然语言转换为Pandas Python代码。

PandasQueryEngine的输入是一个Pandas数据框，输出是一个响应。LLM推断要执行的数据框操作，以检索结果。

警告： 此工具为LLM提供了对 eval 函数的访问权限。在运行此工具的计算机上可能会发生任意代码执行。虽然对代码进行了一定程度的过滤，但不建议在生产环境中使用此工具，除非进行了严格的沙盒化或虚拟机化。

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index llama-index-experimental
!pip install llama-index llama-index-experimental

In [ ]:

Copied!

import logging
import sys
from IPython.display import Markdown, display

import pandas as pd
from llama_index.experimental.query_engine import PandasQueryEngine

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
from IPython.display import Markdown, display

import pandas as pd
from llama_index.experimental.query_engine import PandasQueryEngine

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

让我们从一个玩具DataFrame开始¶

在这里，让我们加载一个非常简单的包含城市和人口对的DataFrame，并在其上运行PandasQueryEngine。

通过设置verbose=True，我们可以看到生成的中间指令。

In [ ]:

Copied!

# 在一些样本数据上进行测试df = pd.DataFrame(    {        "city": ["多伦多", "东京", "柏林"],        "population": [2930000, 13960000, 3645000],    })
# 在一些样本数据上进行测试df = pd.DataFrame(    {        "city": ["多伦多", "东京", "柏林"],        "population": [2930000, 13960000, 3645000],    })

In [ ]:

Copied!

query_engine = PandasQueryEngine(df=df, verbose=True)
query_engine = PandasQueryEngine(df=df, verbose=True)

In [ ]:

Copied!

response = query_engine.query(
    "What is the city with the highest population?",
)
response = query_engine.query(
    "What is the city with the highest population?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df['city'][df['population'].idxmax()]
```
> Pandas Output: Tokyo

In [ ]:

Copied!

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

Tokyo

In [ ]:

Copied!

# 获取pandas python指令print(response.metadata["pandas_instruction_str"])
# 获取pandas python指令print(response.metadata["pandas_instruction_str"])

df['city'][df['population'].idxmax()]

我们还可以采取使用LLM来合成回复的步骤。

In [ ]:

Copied!





query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)
response = query_engine.query(
    "What is the city with the highest population? Give both the city and population",
)
print(str(response))
query_engine = PandasQueryEngine(df=df, verbose=True, synthesize_response=True)
response = query_engine.query(
    "What is the city with the highest population? Give both the city and population",
)
print(str(response))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.loc[df['population'].idxmax()]
```
> Pandas Output: city             Tokyo
population    13960000
Name: 1, dtype: object
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
The city with the highest population is Tokyo, with a population of 13,960,000.

分析泰坦尼克号数据集¶

泰坦尼克号数据集是入门机器学习中最受欢迎的表格数据集之一来源：https://www.kaggle.com/c/titanic

下载数据¶

In [ ]:

Copied!

!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

--2024-01-13 17:45:15--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57726 (56K) [text/plain]
Saving to: ‘titanic_train.csv’

titanic_train.csv   100%[===================>]  56.37K  --.-KB/s    in 0.009s  

2024-01-13 17:45:15 (6.45 MB/s) - ‘titanic_train.csv’ saved [57726/57726]

In [ ]:

Copied!

df = pd.read_csv("./titanic_train.csv")
df = pd.read_csv("./titanic_train.csv")

In [ ]:

Copied!

query_engine = PandasQueryEngine(df=df, verbose=True)
query_engine = PandasQueryEngine(df=df, verbose=True)

In [ ]:

Copied!

response = query_engine.query(
    "What is the correlation between survival and age?",
)
response = query_engine.query(
    "What is the correlation between survival and age?",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df['survived'].corr(df['age'])
```
> Pandas Output: -0.07722109457217755

In [ ]:

Copied!

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

-0.07722109457217755

In [ ]:

Copied!

# 获取pandas python指令print(response.metadata["pandas_instruction_str"])
# 获取pandas python指令print(response.metadata["pandas_instruction_str"])

df['survived'].corr(df['age'])

附加步骤¶

分析/修改提示¶

让我们来看看提示！

In [ ]:

Copied!

from llama_index.core import PromptTemplate
from llama_index.core import PromptTemplate

In [ ]:

Copied!

query_engine = PandasQueryEngine(df=df, verbose=True)
prompts = query_engine.get_prompts()
print(prompts["pandas_prompt"].template)
query_engine = PandasQueryEngine(df=df, verbose=True)
prompts = query_engine.get_prompts()
print(prompts["pandas_prompt"].template)

You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
{df_str}

Follow these instructions:
{instruction_str}
Query: {query_str}

Expression:

In [ ]:

Copied!

print(prompts["response_synthesis_prompt"].template)
print(prompts["response_synthesis_prompt"].template)

Given an input question, synthesize a response from the query results.
Query: {query_str}

Pandas Instructions (optional):
{pandas_instructions}

Pandas Output: {pandas_output}

Response:

您也可以更新提示：

In [ ]:

Copied!

new_prompt = PromptTemplate(    """\您正在使用Python中的pandas dataframe。数据框的名称是`df`。这是`print(df.head())`的结果：{df_str}请按照以下说明操作：{instruction_str}查询：{query_str}表达式：""")query_engine.update_prompts({"pandas_prompt": new_prompt})
new_prompt = PromptTemplate(    """\您正在使用Python中的pandas dataframe。数据框的名称是`df`。这是`print(df.head())`的结果：{df_str}请按照以下说明操作：{instruction_str}查询：{query_str}表达式：""")query_engine.update_prompts({"pandas_prompt": new_prompt})

这是指令字符串（您可以通过在初始化时传入instruction_str来自定义）。

In [ ]:

Copied!

指令字符串 = """\1. 使用Pandas将查询转换为可执行的Python代码。2. 代码的最后一行应该是一个可以使用`eval()`函数调用的Python表达式。3. 代码应该代表对查询的解决方案。4. 仅打印表达式。5. 不要引用表达式。"""
指令字符串 = """\1. 使用Pandas将查询转换为可执行的Python代码。2. 代码的最后一行应该是一个可以使用`eval()`函数调用的Python表达式。3. 代码应该代表对查询的解决方案。4. 仅打印表达式。5. 不要引用表达式。"""

使用查询管道语法实现查询引擎¶

如果你想学习如何使用我们的查询管道语法和上面的提示组件构建自己的Pandas查询引擎，请查看下面的教程。

使用查询管道设置Pandas DataFrame查询引擎