Skip to main content
Open In ColabOpen on GitHub

Ontotext GraphDB

Ontotext GraphDB 是一个符合 RDFSPARQL 标准的图数据库和知识发现工具。

本笔记本展示了如何使用LLMs为Ontotext GraphDB提供自然语言查询(NLQ到SPARQL,也称为text2sparql)。

GraphDB LLM 功能

GraphDB 支持一些LLM集成功能,如这里所述:

gpt-queries

  • 使用知识图谱(KG)中的数据向LLM请求文本、列表或表格的魔法谓词
  • 查询解释
  • 结果解释、总结、重述、翻译

retrieval-graphdb-connector

  • 在向量数据库中对KG实体进行索引
  • 支持任何文本嵌入算法和向量数据库
  • 使用与GraphDB相同的强大连接器(索引)语言,用于Elastic、Solr、Lucene
  • RDF数据中的更改自动同步到KG实体索引
  • 支持嵌套对象(GraphDB 10.5 版本中无 UI 支持)
  • 将KG实体序列化为文本,如下所示(例如,对于Wines数据集):
Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.

talk-to-graph

  • 使用定义的KG实体索引的简单聊天机器人

在本教程中,我们不会使用GraphDB LLM集成,而是从NLQ生成SPARQL。我们将使用Star Wars APISWAPI)本体和数据集,您可以在此查看

设置

你需要一个正在运行的GraphDB实例。本教程展示了如何使用GraphDB Docker镜像在本地运行数据库。它提供了一个docker compose设置,该设置将Star Wars数据集填充到GraphDB中。所有必要的文件,包括这个笔记本,都可以从GitHub仓库langchain-graphdb-qa-chain-demo下载。

  • 安装 Docker。本教程是使用 Docker 版本 24.0.7 创建的,该版本捆绑了 Docker Compose。对于较早的 Docker 版本,您可能需要单独安装 Docker Compose。
  • 在你的机器上的本地文件夹中克隆GitHub仓库langchain-graphdb-qa-chain-demo
  • 从同一文件夹执行以下脚本以启动GraphDB
docker build --tag graphdb .
docker compose up -d graphdb

您需要等待几秒钟,让数据库在http://localhost:7200/上启动。星球大战数据集starwars-data.trig会自动加载到langchain仓库中。本地SPARQL端点http://localhost:7200/repositories/langchain可以用来运行查询。您还可以从您喜欢的网络浏览器中打开GraphDB工作台http://localhost:7200/sparql,在那里您可以交互式地进行查询。

  • 设置工作环境

如果你使用conda,创建并激活一个新的conda环境,例如:

conda create -n graph_ontotext_graphdb_qa python=3.12
conda activate graph_ontotext_graphdb_qa

安装以下库:

pip install jupyter==1.1.1
pip install rdflib==7.1.1
pip install langchain-community==0.3.4
pip install langchain-openai==0.2.4

使用以下命令运行 Jupyter

jupyter notebook

指定本体

为了让LLM能够生成SPARQL,它需要知道知识图谱的模式(本体)。可以通过OntotextGraphDBGraph类中的两个参数之一来提供:

  • query_ontology: 一个在SPARQL端点上执行的CONSTRUCT查询,返回知识图谱模式语句。我们建议您将本体存储在其自己的命名图中,这将更容易仅获取相关语句(如下例所示)。不支持DESCRIBE查询,因为DESCRIBE返回对称简洁有界描述(SCBD),即也包括传入的类链接。在具有数百万实例的大型图中,这是不高效的。查看 https://github.com/eclipse-rdf4j/rdf4j/issues/4857
  • local_file: 一个本地的RDF本体文件。支持的RDF格式有Turtle, RDF/XML, JSON-LD, N-Triples, Notation-3, Trig, Trix, N-Quads

在任何情况下,本体转储应该:

  • 包括足够的信息关于类、属性、属性附加到类(使用 rdfs:domain, schema:domainIncludes 或 OWL 限制),以及分类(重要的个体)。
  • 不包括过于冗长且无关的定义和示例,这些对SPARQL构建没有帮助。
from langchain_community.graphs import OntotextGraphDBGraph

# feeding the schema using a user construct query

graph = OntotextGraphDBGraph(
query_endpoint="http://localhost:7200/repositories/langchain",
query_ontology="CONSTRUCT {?s ?p ?o} FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)
API Reference:OntotextGraphDBGraph
# feeding the schema using a local RDF file

graph = OntotextGraphDBGraph(
query_endpoint="http://localhost:7200/repositories/langchain",
local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt", # change the path here
)

无论哪种方式,本体(模式)都以Turtle格式提供给LLM,因为带有适当前缀的Turtle最为紧凑,也最容易让LLM记住。

《星球大战》本体论有点不寻常,因为它包含了许多关于类的具体三元组,例如物种:Aleena生活在上,它们是:Reptile的子类,具有某些典型特征(平均身高、平均寿命、皮肤颜色),并且特定个体(角色)是该类的代表:

@prefix : <https://swapi.co/vocabulary/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:Aleena a owl:Class, :Species ;
rdfs:label "Aleena" ;
rdfs:isDefinedBy <https://swapi.co/ontology/> ;
rdfs:subClassOf :Reptile, :Sentient ;
:averageHeight 80.0 ;
:averageLifespan "79" ;
:character <https://swapi.co/resource/aleena/47> ;
:film <https://swapi.co/resource/film/4> ;
:language "Aleena" ;
:planet <https://swapi.co/resource/planet/38> ;
:skinColor "blue", "gray" .

...

为了使本教程简单易懂,我们使用未加密的GraphDB。如果GraphDB是加密的,您应该在初始化OntotextGraphDBGraph之前设置环境变量'GRAPHDB_USERNAME'和'GRAPHDB_PASSWORD'。

os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = OntotextGraphDBGraph(
query_endpoint=...,
query_ontology=...
)

针对StarWars数据集的问答

我们现在可以使用OntotextGraphDBQAChain来提出一些问题。

import os

from langchain.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.

chain = OntotextGraphDBQAChain.from_llm(
ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
graph=graph,
verbose=True,
allow_dangerous_requests=True,
)

让我们问一个简单的问题。

chain.invoke({chain.input_key: "What is the climate on Tatooine?"})[chain.output_key]


> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
?planet rdfs:label "Tatooine" ;
:climate ?climate .
}

> Finished chain.
'The climate on Tatooine is arid.'

还有一个稍微复杂一点的。

chain.invoke({chain.input_key: "What is the climate on Luke Skywalker's home planet?"})[
chain.output_key
]


> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?climate
WHERE {
?character rdfs:label "Luke Skywalker" .
?character :homeworld ?planet .
?planet :climate ?climate .
}

> Finished chain.
"The climate on Luke Skywalker's home planet is arid."

我们也可以提出更复杂的问题,比如

chain.invoke(
{
chain.input_key: "What is the average box office revenue for all the Star Wars movies?"
}
)[chain.output_key]


> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : <https://swapi.co/vocabulary/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
?film a :Film .
?film :boxOffice ?boxOfficeValue .
BIND(xsd:decimal(?boxOfficeValue) AS ?boxOffice)
}


> Finished chain.
'The average box office revenue for all the Star Wars movies is approximately 754.1 million dollars.'

链式修饰符

Ontotext GraphDB QA 链允许提示优化,以进一步改进您的 QA 链并增强应用程序的整体用户体验。

"SPARQL生成"提示

提示用于基于用户问题和知识图谱模式的SPARQL查询生成。

  • sparql_generation_prompt

    Default value:

      GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
    Write a SPARQL SELECT query for querying a graph database.
    The ontology schema delimited by triple backticks in Turtle format is:
    ```
    {schema}
    ```
    Use only the classes and properties provided in the schema to construct the SPARQL query.
    Do not use any classes or properties that are not explicitly provided in the SPARQL query.
    Include all necessary prefixes.
    Do not include any explanations or apologies in your responses.
    Do not wrap the query in backticks.
    Do not include any text except the SPARQL query generated.
    The question delimited by triple backticks is:
    ```
    {prompt}
    ```
    """
    GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "prompt"],
    template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
    )

"SPARQL 修复" 提示

有时,LLM 可能会生成一个带有语法错误或缺少前缀等的 SPARQL 查询。链将尝试通过提示 LLM 进行一定次数的修正来解决这个问题。

  • sparql_fix_prompt

    Default value:

      GRAPHDB_SPARQL_FIX_TEMPLATE = """
    This following SPARQL query delimited by triple backticks
    ```
    {generated_sparql}
    ```
    is not valid.
    The error delimited by triple backticks is
    ```
    {error_message}
    ```
    Give me a correct version of the SPARQL query.
    Do not change the logic of the query.
    Do not include any explanations or apologies in your responses.
    Do not wrap the query in backticks.
    Do not include any text except the SPARQL query generated.
    The ontology schema delimited by triple backticks in Turtle format is:
    ```
    {schema}
    ```
    """

    GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
    input_variables=["error_message", "generated_sparql", "schema"],
    template=GRAPHDB_SPARQL_FIX_TEMPLATE,
    )
  • max_fix_retries

    默认值: 5

"回答"提示

提示用于根据数据库返回的结果和初始用户问题来回答问题。默认情况下,LLM被指示仅使用返回结果中的信息。如果结果集为空,LLM应告知无法回答问题。

  • qa_prompt

    Default value:

      GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
    You are an assistant that creates well-written and human understandable answers.
    The information part contains the information provided, which you can use to construct an answer.
    The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
    Make your response sound like the information is coming from an AI assistant, but don't add any information.
    Don't use internal knowledge to answer the question, just say you don't know if no information is available.
    Information:
    {context}

    Question: {prompt}
    Helpful Answer:"""
    GRAPHDB_QA_PROMPT = PromptTemplate(
    input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
    )

当你完成使用GraphDB进行QA的操作后,你可以通过从包含Docker compose文件的目录运行 docker compose down -v --remove-orphans 来关闭Docker环境。


这个页面有帮助吗?