Textractor

pipeline

Textractor 管道从文档中提取和分割文本。该管道使用 Apache Tika（如果 Java 可用）和 BeautifulSoup4。有关支持的文档格式列表，请参见此链接。

每个文档都经过以下处理流程：

如果内容不在本地，则获取内容
如果文档的 mime-type 不是纯文本或 HTML，则通过 Tika 运行并转换为 XHTML
将 XHTML 转换为 Markdown 并返回

如果没有 Apache Tika，此管道仅支持纯文本和 HTML。其他文档类型需要安装 Tika 和 Java。另一种选择是通过此 Docker 镜像启动 Apache Tika。

示例

以下展示了一个使用此管道的简单示例。

from txtai.pipeline import Textractor

# 创建并运行管道
textract = Textractor()
textract("https://github.com/neuml/txtai")

有关更详细的示例，请参见下面的链接。

Notebook	描述
从文档中提取文本	从 PDF、Office、HTML 等中提取文本

配置驱动的示例

管道可以通过 Python 或配置运行。管道可以通过配置使用管道的类名的小写形式实例化。配置驱动的管道可以通过工作流或 API 运行。

config.yml

# 使用类名的小写形式创建管道
textractor:

# 使用工作流运行管道
workflow:
  textract:
    tasks:
      - action: textractor

使用工作流运行

from txtai import Application

# 使用工作流创建并运行管道
app = Application("config.yml")
list(app.workflow("textract", ["https://github.com/neuml/txtai"]))

使用 API 运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"textract", "elements":["https://github.com/neuml/txtai"]}'

方法

管道的 Python 文档。

`init(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, tika=True, sections=False, headers=None)`

Source code in txtai/pipeline/data/textractor.py

def __init__(self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, tika=True, sections=False, headers=None):
    if not TIKA:
        raise ImportError('Textractor pipeline is not available - install "pipeline" extra to enable')

    super().__init__(sentences, lines, paragraphs, minlength, join, sections)

    # Determine if Apache Tika (default if Java is available) or Beautiful Soup should be used
    # Beautiful Soup only supports HTML, Tika supports a wide variety of file formats.
    self.tika = self.checkjava() if tika else False

    # HTML to Text extractor
    self.extract = Extract(self.paragraphs, self.sections)

    # HTTP headers
    self.headers = headers if headers else {}

`call(text)`

Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.

Parameters:

Name	Type	Description	Default
`text`		text\|list	required

Returns:

Type	Description
	segmented text

Source code in txtai/pipeline/data/segmentation.py

def __call__(self, text):
    """
    Segments text into semantic units.

    This method supports text as a string or a list. If the input is a string, the return
    type is text|list. If text is a list, a list of returned, this could be a
    list of text or a list of lists depending on the tokenization strategy.

    Args:
        text: text|list

    Returns:
        segmented text
    """

    # Get inputs
    texts = [text] if not isinstance(text, list) else text

    # Extract text for each input file
    results = []
    for value in texts:
        # Get text
        value = self.text(value)

        # Parse and add extracted results
        results.append(self.parse(value))

    return results[0] if isinstance(text, str) else results