Skip to content

翻译

pipeline pipeline

翻译管道用于在不同语言之间翻译文本。它支持超过100种语言。内置了自动源语言检测功能。该管道会检测每行输入文本的语言,加载源语言和目标语言组合的模型,并将文本翻译为目标语言。

示例

以下展示了一个使用此管道的简单示例。

from txtai.pipeline import Translation

# 创建并运行管道
translate = Translation()
translate("This is a test translation into Spanish", "es")

请参阅以下链接以获取更详细的示例。

笔记本 描述
在语言之间翻译文本 简化机器翻译和语言检测 在Colab中打开

配置驱动的示例

管道可以通过Python或配置运行。管道可以通过配置使用管道的类名的小写形式进行实例化。配置驱动的管道可以通过工作流API运行。

config.yml

# 使用类名的小写形式创建管道
translation:

# 使用工作流运行管道
workflow:
  translate:
    tasks:
      - action: translation
        args: ["es"]

使用工作流运行

from txtai import Application

# 使用工作流创建并运行管道
app = Application("config.yml")
list(app.workflow("translate", ["This is a test translation into Spanish"]))

使用API运行

CONFIG=config.yml uvicorn "txtai.api:app" &

curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"translate", "elements":["This is a test translation into Spanish"]}'

方法

管道的Python文档。

__init__(path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True)

Constructs a new language translation pipeline.

Parameters:

Name Type Description Default
path

optional path to model, accepts Hugging Face model hub id or local path, uses default model for task if not provided

None
quantize

if model should be quantized, defaults to False

False
gpu

True/False if GPU should be enabled, also supports a GPU device id

True
batch

batch size used to incrementally process content

64
langdetect

set a custom language detection function, method must take a list of strings and return language codes for each, uses default language detector if not provided

None
findmodels

True/False if the Hugging Face Hub will be searched for source-target translation models

True
Source code in txtai/pipeline/text/translation.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def __init__(self, path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):
    """
    Constructs a new language translation pipeline.

    Args:
        path: optional path to model, accepts Hugging Face model hub id or local path,
              uses default model for task if not provided
        quantize: if model should be quantized, defaults to False
        gpu: True/False if GPU should be enabled, also supports a GPU device id
        batch: batch size used to incrementally process content
        langdetect: set a custom language detection function, method must take a list of strings and return
                    language codes for each, uses default language detector if not provided
        findmodels: True/False if the Hugging Face Hub will be searched for source-target translation models
    """

    # Call parent constructor
    super().__init__(path if path else "facebook/m2m100_418M", quantize, gpu, batch)

    # Language detection
    self.detector = None
    self.langdetect = langdetect
    self.findmodels = findmodels

    # Language models
    self.models = {}
    self.ids = self.modelids()

__call__(texts, target='en', source=None, showmodels=False)

Translates text from source language into target language.

This method supports texts as a string or a list. If the input is a string, the return type is string. If text is a list, the return type is a list.

Parameters:

Name Type Description Default
texts

text|list

required
target

target language code, defaults to "en"

'en'
source

source language code, detects language if not provided

None

Returns:

Type Description

list of translated text

Source code in txtai/pipeline/text/translation.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def __call__(self, texts, target="en", source=None, showmodels=False):
    """
    Translates text from source language into target language.

    This method supports texts as a string or a list. If the input is a string,
    the return type is string. If text is a list, the return type is a list.

    Args:
        texts: text|list
        target: target language code, defaults to "en"
        source: source language code, detects language if not provided

    Returns:
        list of translated text
    """

    values = [texts] if not isinstance(texts, list) else texts

    # Detect source languages
    languages = self.detect(values) if not source else [source] * len(values)
    unique = set(languages)

    # Build a dict from language to list of (index, text)
    langdict = {}
    for x, lang in enumerate(languages):
        if lang not in langdict:
            langdict[lang] = []
        langdict[lang].append((x, values[x]))

    results = {}
    for language in unique:
        # Get all indices and text values for a language
        inputs = langdict[language]

        # Translate text in batches
        outputs = []
        for chunk in self.batch([text for _, text in inputs], self.batchsize):
            outputs.extend(self.translate(chunk, language, target, showmodels))

        # Store output value
        for y, (x, _) in enumerate(inputs):
            if showmodels:
                model, op = outputs[y]
                results[x] = (op.strip(), language, model)
            else:
                results[x] = outputs[y].strip()

    # Return results in same order as input
    results = [results[x] for x in sorted(results)]
    return results[0] if isinstance(texts, str) else results