Beautiful Soup
Beautiful Soup 是一个用于解析HTML和XML文档的Python包(包括处理格式不正确的标记,即未闭合的标签,因此以标签汤命名)。它为解析的页面创建一个解析树,可用于从HTML中提取数据,[3] 这对于网络抓取非常有用。
Beautiful Soup
提供了对HTML内容的精细控制,能够实现特定标签的提取、删除和内容清理。
它适用于您希望提取特定信息并根据需要清理HTML内容的情况。
例如,我们可以从HTML内容中抓取 ,
-
: 段落标签。它在HTML中定义一个段落,用于将相关的句子和/或短语组合在一起。
-
: 列表项标签。它用于有序列表(
)和无序列表(
)中,以定义列表中的各个项目。 -
: 分区标签。它是一个块级元素,用于将其他内联或块级元素分组。
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
# Load HTML
loader = AsyncChromiumLoader(["https://www.wsj.com"])
html = loader.load()
API Reference:AsyncChromiumLoader | BeautifulSoupTransformer
# Transform
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
html, tags_to_extract=["p", "li", "div", "a"]
)
docs_transformed[0].page_content[0:500]
'Conservative legal activists are challenging Amazon, Comcast and others using many of the same tools that helped kill affirmative-action programs in colleges.1,2099 min read U.S. stock indexes fell and government-bond prices climbed, after Moody’s lowered credit ratings for 10 smaller U.S. banks and said it was reviewing ratings for six larger ones. The Dow industrials dropped more than 150 points.3 min read Penn Entertainment’s Barstool Sportsbook app will be rebranded as ESPN Bet this fall as '