`langchain_community.document_transformers.beautiful_soup_transformer`.BeautifulSoupTransformer¶

class langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[source]¶

通过提取特定标签并删除不需要的标签来转换HTML内容。

示例：

from langchain_community.document_transformers import BeautifulSoupTransformer

bs4_transformer = BeautifulSoupTransformer()
docs_transformed = bs4_transformer.transform_documents(docs)

初始化转换器。

这将检查是否已安装BeautifulSoup4包。如果没有安装，则会引发ImportError。

Methods

`__init__`()	初始化转换器。
`atransform_documents`(documents, **kwargs)	异步转换文档列表。
`extract_tags`(html_content, tags, *[, ...])	从给定的HTML内容中提取特定标签。
`remove_unnecessary_lines`(content)	清理内容，删除不必要的行。
`remove_unwanted_classnames`(html_content, ...)	从给定的HTML内容中删除不需要的类名。
`remove_unwanted_tags`(html_content, unwanted_tags)	从给定的HTML内容中删除不需要的标签。
`transform_documents`(documents[, ...])	将文档对象列表转换为清理其HTML内容。

Return type: None

__init__() → None[source]¶

初始化转换器。

这将检查是否已安装BeautifulSoup4包。如果没有安装，则会引发ImportError。

Return type: None

async atransform_documents(documents: Sequence[Document], **kwargs: Any) → Sequence[Document][source]¶

异步转换文档列表。

参数：: documents：要转换的文档序列。
返回：: 转换后的文档列表。

Parameters

documents (Sequence[Document]) –
kwargs (Any) –

Return type

Sequence[Document]

static extract_tags(html_content: str, tags: Union[List[str], Tuple[str, ...]], *, remove_comments: bool = False) → str[source]¶

从给定的HTML内容中提取特定标签。

参数：: html_content：原始的HTML内容字符串。 tags：要从HTML中提取的标签列表。
返回：: 一个字符串，其中包含提取标签的内容。

Parameters

html_content (str) –
tags (Union[List[str], Tuple[str, ...]]) –
remove_comments (bool) –

Return type

str

static remove_unnecessary_lines(content: str) → str[source]¶

清理内容，删除不必要的行。

参数：: content：一个字符串，可能包含不必要的行或空格。
返回：: 删除不必要行的清理过的字符串。

Parameters: content (str) –
Return type: str

static remove_unwanted_classnames(html_content: str, unwanted_classnames: Union[List[str], Tuple[str, ...]]) → str[source]¶

从给定的HTML内容中删除不需要的类名。

参数：: html_content：原始的HTML内容字符串。 unwanted_classnames：要从HTML中删除的类名列表。
返回：: 删除了不需要的类名的清理过的HTML字符串。

Parameters

html_content (str) –
unwanted_classnames (Union[List[str], Tuple[str, ...]]) –

Return type

str

static remove_unwanted_tags(html_content: str, unwanted_tags: Union[List[str], Tuple[str, ...]]) → str[source]¶

从给定的HTML内容中删除不需要的标签。

参数：: html_content：原始的HTML内容字符串。 unwanted_tags：要从HTML中删除的标签列表。
返回：: 删除了不需要标签的清理后的HTML字符串。

Parameters

html_content (str) –
unwanted_tags (Union[List[str], Tuple[str, ...]]) –

Return type

str

transform_documents(documents: Sequence[Document], unwanted_tags: Union[List[str], Tuple[str, ...]] = ('script', 'style'), tags_to_extract: Union[List[str], Tuple[str, ...]] = ('p', 'li', 'div', 'a'), remove_lines: bool = True, *, unwanted_classnames: Union[Tuple[str, ...], List[str]] = (), remove_comments: bool = False, **kwargs: Any) → Sequence[Document][source]¶

将文档对象列表转换为清理其HTML内容。

参数：: documents：包含HTML内容的Document对象序列。 unwanted_tags：要从HTML中删除的标签列表。 tags_to_extract：要提取内容的标签列表。 remove_lines：如果设置为True，则将删除不必要的行。 unwanted_classnames：要从HTML中删除的类名列表。 remove_comments：如果设置为True，则将删除注释。
返回：: 转换内容的Document对象序列。

Parameters

documents (Sequence[Document]) –
unwanted_tags (Union[List[str], Tuple[str, ...]]) –
tags_to_extract (Union[List[str], Tuple[str, ...]]) –
remove_lines (bool) –
unwanted_classnames (Union[Tuple[str, ...], List[str]]) –
remove_comments (bool) –
kwargs (Any) –

Return type

Sequence[Document]

Examples using BeautifulSoupTransformer¶

langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer¶

Examples using BeautifulSoupTransformer¶

`langchain_community.document_transformers.beautiful_soup_transformer`.BeautifulSoupTransformer¶