语义分块器#
- class langchain_experimental.text_splitter.SemanticChunker(embeddings: Embeddings, buffer_size: int = 1, add_start_index: bool = False, breakpoint_threshold_type: Literal['percentile', 'standard_deviation', 'interquartile', 'gradient'] = 'percentile', breakpoint_threshold_amount: float | None = None, number_of_chunks: int | None = None, sentence_split_regex: str = '(?<=[.?!])\\s+', min_chunk_size: int | None = None)[source]#
根据语义相似性分割文本。
摘自Greg Kamradt的精彩笔记本: FullStackRetrieval-com/RetrievalTutorials
所有的功劳都归他。
在高层次上,这会将文本分割成句子,然后将句子分成每组3个句子,然后在嵌入空间中合并相似的句子。
方法
__init__
(embeddings[, buffer_size, ...])atransform_documents
(documents, **kwargs)异步转换文档列表。
create_documents
(texts[, metadatas])从文本列表创建文档。
split_documents
(documents)分割文档。
split_text
(text)transform_documents
(documents, **kwargs)通过拆分文档来转换文档序列。
- Parameters:
embeddings (Embeddings)
buffer_size (int)
add_start_index (bool)
breakpoint_threshold_type (Literal['percentile', 'standard_deviation', 'interquartile', 'gradient'])
breakpoint_threshold_amount (float | None)
number_of_chunks (int | None)
sentence_split_regex (str)
min_chunk_size (int | None)
- __init__(embeddings: Embeddings, buffer_size: int = 1, add_start_index: bool = False, breakpoint_threshold_type: Literal['percentile', 'standard_deviation', 'interquartile', 'gradient'] = 'percentile', breakpoint_threshold_amount: float | None = None, number_of_chunks: int | None = None, sentence_split_regex: str = '(?<=[.?!])\\s+', min_chunk_size: int | None = None)[source]#
- Parameters:
embeddings (Embeddings)
buffer_size (int)
add_start_index (bool)
breakpoint_threshold_type (Literal['percentile', 'standard_deviation', 'interquartile', 'gradient'])
breakpoint_threshold_amount (float | None)
number_of_chunks (int | None)
sentence_split_regex (str)
min_chunk_size (int | None)
- async atransform_documents(documents: Sequence[Document], **kwargs: Any) Sequence[Document] #
异步转换文档列表。
- create_documents(texts: List[str], metadatas: List[dict] | None = None) List[Document] [source]#
从文本列表创建文档。
- Parameters:
文本 (列表[字符串])
metadatas (列表[字典] | 无)
- Return type:
列表[文档]
使用 SemanticChunker 的示例