自定义C++和CUDA操作符

创建于：2024年6月18日 | 最后更新：2024年12月30日 | 最后验证：2024年11月5日

What you will learn

如何将用C++/CUDA编写的自定义操作符与PyTorch集成
如何使用torch.library.opcheck测试自定义运算符

Prerequisites

PyTorch 2.4 或更高版本
对C++和CUDA编程的基本理解

PyTorch 提供了一个庞大的运算符库，这些运算符作用于张量（例如 torch.add、torch.sum 等）。然而，您可能希望为 PyTorch 引入一个新的自定义运算符。本教程演示了使用 C++/CUDA 编写自定义运算符的推荐路径。

在我们的教程中，我们将演示如何编写一个与PyTorch子系统组合的融合乘加C++和CUDA操作符。该操作的语义如下：

def mymuladd(a: Tensor, b: Tensor, c: float):
    return a * b + c

你可以找到本教程的端到端工作示例这里。

设置构建系统

如果您正在开发自定义的C++/CUDA代码，它必须被编译。请注意，如果您正在与一个已经绑定到预编译的C++/CUDA代码的Python库进行接口交互，您可能会考虑编写一个自定义的Python操作符（自定义Python操作符）。

使用 torch.utils.cpp_extension 来编译自定义的 C++/CUDA 代码以用于 PyTorch C++ 扩展可以通过 setuptools “提前”构建，或者通过 load_inline “即时”构建；我们将重点介绍“提前”构建的方式。

使用cpp_extension就像编写以下setup.py一样简单：

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension("extension_cpp", ["muladd.cpp"])],
      cmdclass={'build_ext': cpp_extension.BuildExtension})

如果您需要编译CUDA代码（例如，.cu文件），则应使用 torch.utils.cpp_extension.CUDAExtension。请参阅extension-cpp以了解如何设置此功能。

从 PyTorch 2.6 开始，你现在可以为多个 CPython 版本构建一个单一的 wheel 文件（类似于你为纯 Python 包所做的操作）。特别是，如果你的自定义库遵循 CPython 稳定有限 API 或完全避免使用 CPython，你可以通过 setuptools 的 py_limited_api 标志，针对最低支持的 CPython 版本构建一个与 Python 无关的 wheel 文件，如下所示：

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="extension_cpp",
      ext_modules=[
          cpp_extension.CppExtension(
            "extension_cpp",
            ["python_agnostic_code.cpp"],
            py_limited_api=True)],
      cmdclass={'build_ext': cpp_extension.BuildExtension},
      options={"bdist_wheel": {"py_limited_api": "cp39"}}
)

请注意，您必须在setup中指定py_limited_api=True，并且还要将其作为"bdist_wheel"命令的选项，同时指定支持的最低Python版本（在本例中为3.9）。此setup将构建一个可以在多个Python版本python>=3.9上安装的wheel。请参阅torchao以获取示例。

注意

你必须独立验证构建的wheel是否真正与Python无关。指定py_limited_api并不会检查任何保证，因此有可能构建一个看起来与Python无关但实际上会在另一个Python环境中崩溃，或者更糟的是，静默地错误的wheel。注意避免使用不稳定的CPython API，例如来自libtorch_python的API（特别是pytorch/python绑定），并且只使用来自libtorch的API（aten对象、操作符和调度器）。例如，为了从Python访问自定义操作，库应该通过调度器注册这些操作（下面会介绍！）。

定义自定义操作并添加后端实现

首先，让我们编写一个计算 mymuladd 的 C++ 函数：

at::Tensor mymuladd_cpu(at::Tensor a, const at::Tensor& b, double c) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = a_ptr[i] * b_ptr[i] + c;
  }
  return result;
}

为了从PyTorch的Python前端使用这个功能，我们需要使用TORCH_LIBRARY API将其注册为PyTorch操作符。这将自动将该操作符绑定到Python。

操作员注册是一个两步过程：

定义操作符 - 此步骤确保PyTorch知道新的操作符。
注册后端实现 - 在此步骤中，各种后端的实现（如CPU和CUDA）与操作符相关联。

定义一个操作符

要定义一个运算符，请按照以下步骤操作：

为操作符选择一个命名空间。我们建议命名空间是您顶级项目的名称；在我们的教程中，我们将使用“extension_cpp”。
提供一个模式字符串，用于指定操作符的输入/输出类型以及输入张量是否会被改变。除了张量和浮点数外，我们还支持更多类型；详情请参阅The Custom Operators Manual。
- 如果您正在编写一个可以改变其输入张量的操作符，请参阅此处 (创建可变操作符) 以了解如何指定。

TORCH_LIBRARY(extension_cpp, m) {
   // Note that "float" in the schema corresponds to the C++ double type
   // and the Python float type.
   m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
 }

这使得操作符可以通过torch.ops.extension_cpp.mymuladd在Python中使用。

为操作符注册后端实现

使用 TORCH_LIBRARY_IMPL 为操作符注册一个后端实现。

TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
}

如果你也有一个CUDA实现的myaddmul，你可以在一个单独的TORCH_LIBRARY_IMPL块中注册它：

__global__ void muladd_kernel(int numel, const float* a, const float* b, float c, float* result) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < numel) result[idx] = a[idx] * b[idx] + c;
}

at::Tensor mymuladd_cuda(const at::Tensor& a, const at::Tensor& b, double c) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CUDA);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CUDA);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();

  int numel = a_contig.numel();
  muladd_kernel<<<(numel+255)/256, 256>>>(numel, a_ptr, b_ptr, c, result_ptr);
  return result;
}

TORCH_LIBRARY_IMPL(extension_cpp, CUDA, m) {
  m.impl("mymuladd", &mymuladd_cuda);
}

为操作符添加`torch.compile`支持

要为操作符添加torch.compile支持，我们必须添加一个FakeTensor内核（也称为“元内核”或“抽象实现”）。FakeTensors是具有元数据（如形状、数据类型、设备）但没有数据的张量：操作符的FakeTensor内核指定了如何根据输入张量的元数据计算输出张量的元数据。FakeTensor内核应返回您选择的具有正确张量元数据（形状/步幅/dtype/设备）的虚拟张量。

我们建议通过Python中的torch.library.register_fake API来完成此操作，尽管也可以从C++中完成（详情请参阅 The Custom Operators Manual ）。

# Important: the C++ custom operator definitions should be loaded first
# before calling ``torch.library`` APIs that add registrations for the
# C++ custom operator(s). The following import loads our
# C++ custom operator definitions.
# Note that if you are striving for Python agnosticism, you should use
# the ``load_library(...)`` API call instead. See the next section for
# more details.
from . import _C

@torch.library.register_fake("extension_cpp::mymuladd")
def _(a, b, c):
    torch._check(a.shape == b.shape)
    torch._check(a.dtype == torch.float)
    torch._check(b.dtype == torch.float)
    torch._check(a.device == b.device)
    return torch.empty_like(a)

设置混合Python/C++注册

在本教程中，我们在C++中定义了一个自定义操作符，在C++中添加了CPU/CUDA实现，并在Python中添加了FakeTensor内核和反向公式。这些注册加载（或导入）的顺序很重要（以错误的顺序导入会导致错误）。

要使用带有混合Python/C++注册的自定义操作符，我们必须首先加载包含自定义操作符定义的C++库，然后调用torch.library注册API。这可以通过以下两种方式之一实现：

如果您正在按照本教程操作，导入我们创建的Python C扩展模块将加载C++自定义操作符定义。
如果你的C++自定义操作符位于共享库对象中，你也可以使用torch.ops.load_library("/path/to/library.so")来加载它。这是实现Python无关性的推荐路径，因为你不需要导入Python C扩展模块。参见torchao __init__.py中的示例。

为操作符添加训练（自动梯度）支持

使用torch.library.register_autograd为操作符添加训练支持。优先使用此方法，而不是直接使用Python的torch.autograd.Function或C++的torch::autograd::Function；你必须以非常特定的方式使用这些方法，以避免无声的错误（更多详情请参见The Custom Operators Manual）。

def _backward(ctx, grad):
    a, b = ctx.saved_tensors
    grad_a, grad_b = None, None
    if ctx.needs_input_grad[0]:
        grad_a = grad * b
    if ctx.needs_input_grad[1]:
        grad_b = grad * a
    return grad_a, grad_b, None

def _setup_context(ctx, inputs, output):
    a, b, c = inputs
    saved_a, saved_b = None, None
    if ctx.needs_input_grad[0]:
        saved_b = b
    if ctx.needs_input_grad[1]:
        saved_a = a
    ctx.save_for_backward(saved_a, saved_b)

# This code adds training support for the operator. You must provide us
# the backward formula for the operator and a `setup_context` function
# to save values to be used in the backward.
torch.library.register_autograd(
    "extension_cpp::mymuladd", _backward, setup_context=_setup_context)

请注意，反向传播必须是PyTorch理解的运算符的组合。如果您希望在反向传播中使用另一个自定义的C++或CUDA内核，它必须被包装成一个自定义运算符。

如果我们有自己的自定义mymul内核，我们需要将其包装成一个自定义操作符，然后从反向调用它：

// New! a mymul_cpu kernel
at::Tensor mymul_cpu(const at::Tensor& a, const at::Tensor& b) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_CHECK(a.device().type() == at::DeviceType::CPU);
  TORCH_CHECK(b.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = result.data_ptr<float>();
  for (int64_t i = 0; i < result.numel(); i++) {
    result_ptr[i] = a_ptr[i] * b_ptr[i];
  }
  return result;
}

TORCH_LIBRARY(extension_cpp, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
  // New! defining the mymul operator
  m.def("mymul(Tensor a, Tensor b) -> Tensor");
}


TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
  // New! registering the cpu kernel for the mymul operator
  m.impl("mymul", &mymul_cpu);
}

def _backward(ctx, grad):
    a, b = ctx.saved_tensors
    grad_a, grad_b = None, None
    if ctx.needs_input_grad[0]:
        grad_a = torch.ops.extension_cpp.mymul.default(grad, b)
    if ctx.needs_input_grad[1]:
        grad_b = torch.ops.extension_cpp.mymul.default(grad, a)
    return grad_a, grad_b, None


def _setup_context(ctx, inputs, output):
    a, b, c = inputs
    saved_a, saved_b = None, None
    if ctx.needs_input_grad[0]:
        saved_b = b
    if ctx.needs_input_grad[1]:
        saved_a = a
    ctx.save_for_backward(saved_a, saved_b)


# This code adds training support for the operator. You must provide us
# the backward formula for the operator and a `setup_context` function
# to save values to be used in the backward.
torch.library.register_autograd(
    "extension_cpp::mymuladd", _backward, setup_context=_setup_context)

测试一个操作符

使用torch.library.opcheck来测试自定义操作是否正确注册。请注意，此函数不测试梯度在数学上是否正确 – 计划为此编写单独的测试，无论是手动测试还是使用 torch.autograd.gradcheck。

def sample_inputs(device, *, requires_grad=False):
    def make_tensor(*size):
        return torch.randn(size, device=device, requires_grad=requires_grad)

    def make_nondiff_tensor(*size):
        return torch.randn(size, device=device, requires_grad=False)

    return [
        [make_tensor(3), make_tensor(3), 1],
        [make_tensor(20), make_tensor(20), 3.14],
        [make_tensor(20), make_nondiff_tensor(20), -123],
        [make_nondiff_tensor(2, 3), make_tensor(2, 3), -0.3],
    ]

def reference_muladd(a, b, c):
    return a * b + c

samples = sample_inputs(device, requires_grad=True)
samples.extend(sample_inputs(device, requires_grad=False))
for args in samples:
    # Correctness test
    result = torch.ops.extension_cpp.mymuladd(*args)
    expected = reference_muladd(*args)
    torch.testing.assert_close(result, expected)

    # Use opcheck to check for incorrect usage of operator registration APIs
    torch.library.opcheck(torch.ops.extension_cpp.mymuladd.default, args)

创建可变操作符

您可能希望编写一个自定义操作符来改变其输入。使用Tensor(a!)在模式中指定每个可变的Tensor；否则，将会有未定义的行为。如果有多个可变的Tensor，请为每个可变的Tensor使用不同的名称（例如，Tensor(a!)，Tensor(b!)，Tensor(c!)）。

让我们编写一个myadd_out(a, b, out)操作符，它将a+b的内容写入out。

// An example of an operator that mutates one of its inputs.
void myadd_out_cpu(const at::Tensor& a, const at::Tensor& b, at::Tensor& out) {
  TORCH_CHECK(a.sizes() == b.sizes());
  TORCH_CHECK(b.sizes() == out.sizes());
  TORCH_CHECK(a.dtype() == at::kFloat);
  TORCH_CHECK(b.dtype() == at::kFloat);
  TORCH_CHECK(out.dtype() == at::kFloat);
  TORCH_CHECK(out.is_contiguous());
  TORCH_INTERNAL_ASSERT(a.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(b.device().type() == at::DeviceType::CPU);
  TORCH_INTERNAL_ASSERT(out.device().type() == at::DeviceType::CPU);
  at::Tensor a_contig = a.contiguous();
  at::Tensor b_contig = b.contiguous();
  const float* a_ptr = a_contig.data_ptr<float>();
  const float* b_ptr = b_contig.data_ptr<float>();
  float* result_ptr = out.data_ptr<float>();
  for (int64_t i = 0; i < out.numel(); i++) {
    result_ptr[i] = a_ptr[i] + b_ptr[i];
  }
}

在定义操作符时，我们必须在模式中指定它会改变输出张量：

TORCH_LIBRARY(extension_cpp, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
  m.def("mymul(Tensor a, Tensor b) -> Tensor");
  // New!
  m.def("myadd_out(Tensor a, Tensor b, Tensor(a!) out) -> ()");
}

TORCH_LIBRARY_IMPL(extension_cpp, CPU, m) {
  m.impl("mymuladd", &mymuladd_cpu);
  m.impl("mymul", &mymul_cpu);
  // New!
  m.impl("myadd_out", &myadd_out_cpu);
}

注意

不要将任何变异的张量作为操作符的输出返回，因为这会导致与PyTorch子系统（如torch.compile）不兼容。

结论

在本教程中，我们介绍了将自定义C++和CUDA操作符与PyTorch集成的推荐方法。TORCH_LIBRARY/torch.library API相当底层。有关如何使用API的更多信息，请参阅The Custom Operators Manual。