retrieve_utils

UNSTRUCTURED_FORMATS

如果安装了'unstructured'库，将解析这些格式。

split_text_to_chunks

def split_text_to_chunks(text: str,
                         max_tokens: int = 4000,
                         chunk_mode: str = "multi_lines",
                         must_break_at_empty_line: bool = True,
                         overlap: int = 0)

将长文本分割成最多 max_tokens 个标记的块。

extract_text_from_pdf

def extract_text_from_pdf(file: str) -> str

从 PDF 文件中提取文本。

split_files_to_chunks

def split_files_to_chunks(
    files: list,
    max_tokens: int = 4000,
    chunk_mode: str = "multi_lines",
    must_break_at_empty_line: bool = True,
    custom_text_split_function: Callable = None
) -> Tuple[List[str], List[dict]]

将文件列表分割成最多 max_tokens 个标记的块。

get_files_from_dir

def get_files_from_dir(dir_path: Union[str, List[str]],
                       types: list = TEXT_FORMATS,
                       recursive: bool = True)

返回给定目录、URL、文件路径或它们的列表中的所有文件的列表。

parse_html_to_markdown

def parse_html_to_markdown(html: str, url: str = None) -> str

将 HTML 解析为 Markdown。

get_file_from_url

def get_file_from_url(url: str, save_path: str = None) -> Tuple[str, str]

从 URL 下载文件。

is_url

def is_url(string: str)

如果字符串是有效的 URL，则返回 True。

create_vector_db_from_dir

def create_vector_db_from_dir(dir_path: Union[str, List[str]],
                              max_tokens: int = 4000,
                              client: API = None,
                              db_path: str = "tmp/chromadb.db",
                              collection_name: str = "all-my-documents",
                              get_or_create: bool = False,
                              chunk_mode: str = "multi_lines",
                              must_break_at_empty_line: bool = True,
                              embedding_model: str = "all-MiniLM-L6-v2",
                              embedding_function: Callable = None,
                              custom_text_split_function: Callable = None,
                              custom_text_types: List[str] = TEXT_FORMATS,
                              recursive: bool = True,
                              extra_docs: bool = False) -> API

从给定目录中的所有文件创建一个向量数据库，目录也可以是单个文件或指向单个文件的 URL。我们支持兼容 chromadb 的 API 来创建向量数据库，如果您已经准备好自己的向量数据库，则不需要使用此函数。

参数：

dir_path Union[str, List[str]] - 目录、文件、URL 的路径或它们的列表。
max_tokens 可选，整数 - 每个块的最大标记数。默认为4000。
client 可选，API - chromadb客户端。默认为None。
db_path 可选，字符串 - chromadb的路径。默认为"tmp/chromadb.db"。版本 <=0.2.24 的默认值为/tmp/chromadb.db。
collection_name 可选，字符串 - 集合的名称。默认为"all-my-documents"。
get_or_create 可选，布尔值 - 是否获取或创建集合。默认为False。如果为True，则如果集合已存在，将返回该集合。如果集合已存在且get_or_create为False，则会引发ValueError。
chunk_mode 可选，字符串 - 块模式。默认为"multi_lines"。
must_break_at_empty_line 可选，布尔值 - 是否在空行处分割。默认为True。
embedding_model 可选，字符串 - 要使用的嵌入模型。默认为"all-MiniLM-L6-v2"。如果embedding_function不为None，则将被忽略。
embedding_function 可选，可调用对象 - 要使用的嵌入函数。默认为None，将使用带有给定embedding_model的SentenceTransformer。如果要使用OpenAI、Cohere、HuggingFace或其他嵌入函数，可以在此处传递它，按照https://docs.trychroma.com/embeddings中的示例进行操作。
custom_text_split_function 可选，可调用对象 - 用于将字符串拆分为字符串列表的自定义函数。默认为None，将使用autogen.retrieve_utils.split_text_to_chunks中的默认函数。
custom_text_types 可选，列表[str] - 要处理的文件类型列表。默认为TEXT_FORMATS。
recursive 可选，布尔值 - 是否递归搜索dir_path中的文档。默认为True。
extra_docs 可选，布尔值 - 是否在集合中添加更多文档。默认为False。

返回值：

chromadb客户端。

query_vector_db

def query_vector_db(query_texts: List[str],
                    n_results: int = 10,
                    client: API = None,
                    db_path: str = "tmp/chromadb.db",
                    collection_name: str = "all-my-documents",
                    search_string: str = "",
                    embedding_model: str = "all-MiniLM-L6-v2",
                    embedding_function: Callable = None) -> QueryResult

查询向量数据库。我们支持与chromadb兼容的API，如果您准备了自己的向量数据库和查询函数，则不需要。

参数：

query_texts 列表[str] - 将用于查询向量数据库的字符串列表。
n_results 可选，整数 - 要返回的结果数量。默认为10。
client 可选，API - 与chromadb兼容的客户端。默认为None，将使用chromadb客户端。
db_path 可选，字符串 - 向量数据库的路径。默认为"tmp/chromadb.db"。版本 <=0.2.24 的默认值为/tmp/chromadb.db。
collection_name 可选，字符串 - 集合的名称。默认为"all-my-documents"。
search_string 可选，字符串 - 搜索字符串。只会检索包含该字符串完全匹配的文档。默认为空字符串。
embedding_model 可选，字符串 - 要使用的嵌入模型。默认为 "all-MiniLM-L6-v2"。如果 embedding_function 不为 None，则会忽略此参数。
embedding_function 可选，可调用对象 - 要使用的嵌入函数。默认为 None，将使用带有给定 embedding_model 的 SentenceTransformer。如果要使用 OpenAI、Cohere、HuggingFace 或其他嵌入函数，可以在此处传递，参考 https://docs.trychroma.com/embeddings 中的示例。

返回值：

查询结果。格式如下：

class QueryResult(TypedDict):
    ids: List[IDs]
    embeddings: Optional[List[List[Embedding]]]
    documents: Optional[List[List[Document]]]
    metadatas: Optional[List[List[Metadata]]]
    distances: Optional[List[List[float]]]

UNSTRUCTURED_FORMATS​

split_text_to_chunks​

extract_text_from_pdf​

split_files_to_chunks​

get_files_from_dir​

parse_html_to_markdown​

get_file_from_url​

is_url​

create_vector_db_from_dir​

query_vector_db​