Skip to content

表格

pipeline pipeline

表格管道将表格数据拆分为行和列。表格管道在创建(id, 文本, 标签)元组以加载到嵌入索引中最有用。

示例

以下展示了一个使用此管道的简单示例。

from txtai.pipeline import Tabular

# 创建并运行管道
tabular = Tabular("id", ["text"])
tabular("path to csv file")

请参阅下面的链接以获取更详细的示例。

笔记本 描述
使用可组合工作流转换表格数据 转换、索引和搜索表格数据 在 Colab 中打开

配置驱动示例

管道可以通过 Python 或配置运行。管道可以使用管道的小写名称配置中实例化。配置驱动的管道可以通过工作流API运行。

config.yml

# 使用小写类名创建管道
tabular:
    idcolumn: id
    textcolumns:
      - text

# 使用工作流运行管道
workflow:
  tabular:
    tasks:
      - action: tabular

使用工作流运行

from txtai import Application

# 使用工作流创建并运行管道
app = Application("config.yml")
list(app.workflow("tabular", ["path to csv file"]))

使用 API 运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"tabular", "elements":["path to csv file"]}'

方法

管道的 Python 文档。

__init__(idcolumn=None, textcolumns=None, content=False)

Creates a new Tabular pipeline.

Parameters:

Name Type Description Default
idcolumn

column name to use for row id

None
textcolumns

list of columns to combine as a text field

None
content

if True, a dict per row is generated with all fields. If content is a list, a subset of fields is included in the generated rows.

False
Source code in txtai/pipeline/data/tabular.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def __init__(self, idcolumn=None, textcolumns=None, content=False):
    """
    Creates a new Tabular pipeline.

    Args:
        idcolumn: column name to use for row id
        textcolumns: list of columns to combine as a text field
        content: if True, a dict per row is generated with all fields. If content is a list, a subset of fields
                 is included in the generated rows.
    """

    if not PANDAS:
        raise ImportError('Tabular pipeline is not available - install "pipeline" extra to enable')

    self.idcolumn = idcolumn
    self.textcolumns = textcolumns
    self.content = content

__call__(data)

Splits data into rows and columns.

Parameters:

Name Type Description Default
data

input data

required

Returns:

Type Description

list of (id, text, tag)

Source code in txtai/pipeline/data/tabular.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def __call__(self, data):
    """
    Splits data into rows and columns.

    Args:
        data: input data

    Returns:
        list of (id, text, tag)
    """

    items = [data] if not isinstance(data, list) else data

    # Combine all rows into single return element
    results = []
    dicts = []

    for item in items:
        # File path
        if isinstance(item, str):
            _, extension = os.path.splitext(item)
            extension = extension.replace(".", "").lower()

            if extension == "csv":
                df = pd.read_csv(item)

            results.append(self.process(df))

        # Dict
        if isinstance(item, dict):
            dicts.append(item)

        # List of dicts
        elif isinstance(item, list):
            df = pd.DataFrame(item)
            results.append(self.process(df))

    if dicts:
        df = pd.DataFrame(dicts)
        results.extend(self.process(df))

    return results[0] if not isinstance(data, list) else results