Skip to content

分段

pipeline pipeline

分段流水线将文本分割成语义单元。

示例

以下展示了一个使用此流水线的简单示例。

from txtai.pipeline import Segmentation

# 创建并运行流水线
segment = Segmentation(sentences=True)
segment("这是一个测试。还有另一个测试。")

配置驱动的示例

流水线可以通过Python或配置来运行。流水线可以通过配置中的小写类名实例化。配置驱动的流水线可以通过工作流API运行。

config.yml

# 使用小写类名创建流水线
segmentation:
  sentences: true

# 使用工作流运行流水线
workflow:
  segment:
    tasks:
      - action: segmentation

使用工作流运行

from txtai import Application

# 使用工作流创建并运行流水线
app = Application("config.yml")
list(app.workflow("segment", ["这是一个测试。还有另一个测试。"]))

使用API运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"segment", "elements":["这是一个测试。还有另一个测试。"]}'

方法

流水线的Python文档。

__init__(sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False)

Creates a new Segmentation pipeline.

Parameters:

Name Type Description Default
sentences

tokenize text into sentences if True, defaults to False

False
lines

tokenizes text into lines if True, defaults to False

False
paragraphs

tokenizes text into paragraphs if True, defaults to False

False
minlength

require at least minlength characters per text element, defaults to None

None
join

joins tokenized sections back together if True, defaults to False

False
sections

tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on what's available

False
Source code in txtai/pipeline/data/segmentation.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def __init__(self, sentences=False, lines=False, paragraphs=False, minlength=None, join=False, sections=False):
    """
    Creates a new Segmentation pipeline.

    Args:
        sentences: tokenize text into sentences if True, defaults to False
        lines: tokenizes text into lines if True, defaults to False
        paragraphs: tokenizes text into paragraphs if True, defaults to False
        minlength: require at least minlength characters per text element, defaults to None
        join: joins tokenized sections back together if True, defaults to False
        sections: tokenizes text into sections if True, defaults to False. Splits using section or page breaks, depending on what's available
    """

    if not NLTK:
        raise ImportError('Segmentation pipeline is not available - install "pipeline" extra to enable')

    self.sentences = sentences
    self.lines = lines
    self.paragraphs = paragraphs
    self.sections = sections
    self.minlength = minlength
    self.join = join

__call__(text)

Segments text into semantic units.

This method supports text as a string or a list. If the input is a string, the return type is text|list. If text is a list, a list of returned, this could be a list of text or a list of lists depending on the tokenization strategy.

Parameters:

Name Type Description Default
text

text|list

required

Returns:

Type Description

segmented text

Source code in txtai/pipeline/data/segmentation.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def __call__(self, text):
    """
    Segments text into semantic units.

    This method supports text as a string or a list. If the input is a string, the return
    type is text|list. If text is a list, a list of returned, this could be a
    list of text or a list of lists depending on the tokenization strategy.

    Args:
        text: text|list

    Returns:
        segmented text
    """

    # Get inputs
    texts = [text] if not isinstance(text, list) else text

    # Extract text for each input file
    results = []
    for value in texts:
        # Get text
        value = self.text(value)

        # Parse and add extracted results
        results.append(self.parse(value))

    return results[0] if isinstance(text, str) else results