随机数生成

Numba 提供了一个可以在 GPU 上执行的随机数生成算法。然而，由于 NVIDIA 实现 cuRAND 的技术问题，Numba 的 GPU 随机数生成器并非基于 cuRAND。相反，Numba 的 GPU RNG 是 xoroshiro128+ 算法的一个实现。xoroshiro128+ 算法的周期为 2**128 - 1，比 cuRAND 中默认使用的 XORWOW 算法的周期短，但 xoroshiro128+ 仍然通过了随机数生成器质量的 BigCrush 测试。

在使用GPU上的任何随机数生成器（RNG）时，确保每个线程都有自己的RNG状态，并且这些状态已初始化为生成不重叠的序列是非常重要的。numba.cuda.random模块提供了一个主机函数来完成此操作，以及获取均匀或正态分布随机数的CUDA设备函数。

备注

Numba（类似于 cuRAND）使用 Box-Muller 变换 <https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform> 从均匀生成器生成正态分布的随机数。然而，Box-Muller 生成一对随机数，当前的实现只返回其中之一。因此，生成正态分布值的速度是均匀分布值的一半。

numba.cuda.random.create_xoroshiro128p_states(n, seed, subsequence_start=0, stream=0)[源代码]

返回一个初始化的用于 n 个随机数生成器的新设备数组。

这将初始化 RNG 状态，使得数组中的每个状态对应于主序列中相隔 2**64 步的子序列。因此，只要没有 CUDA 线程请求超过 2**64 个随机数，此函数生成的所有 RNG 状态都保证是独立的。

subsequence_start 参数可以用来将第一个 RNG 状态提前 2**64 步的倍数。

参数:

n (int) – 要创建的 RNG 状态的数量
seed (uint64) – 生成器列表的起始种子
subsequence_start (uint64)
stream (CUDA stream) – stream 用于在初始化内核上运行

numba.cuda.random.init_xoroshiro128p_states(states, seed, subsequence_start=0, stream=0)[源代码]

在GPU上初始化并行生成器的RNG状态。

这将初始化 RNG 状态，使得数组中的每个状态对应于主序列中相隔 2**64 步的子序列。因此，只要没有 CUDA 线程请求超过 2**64 个随机数，此函数生成的所有 RNG 状态都保证是独立的。

subsequence_start 参数可以用来将第一个 RNG 状态提前 2**64 步的倍数。

参数:

states (1D DeviceNDArray, dtype=xoroshiro128p_dtype) – RNG 状态数组
seed (uint64) – 生成器列表的起始种子

numba.cuda.random.xoroshiro128p_normal_float32(states, index)[源代码]

返回一个正态分布的 float32 并推进 states[index]。

返回值是通过 Box-Muller 变换从均值为 0、标准差为 1 的高斯分布中抽取的。这使得随机数生成器序列前进了两步。

参数:

states (1D array, dtype=xoroshiro128p_dtype) – RNG 状态数组
index (int64) – 要更新的状态中的偏移量

返回类型:

float32

numba.cuda.random.xoroshiro128p_normal_float64(states, index)[源代码]

返回一个正态分布的 float32 并推进 states[index]。

返回值是通过 Box-Muller 变换从均值为 0、标准差为 1 的高斯分布中抽取的。这使得随机数生成器序列前进了两步。

参数:

states (1D array, dtype=xoroshiro128p_dtype) – RNG 状态数组
index (int64) – 要更新的状态中的偏移量

返回类型:

float64

numba.cuda.random.xoroshiro128p_uniform_float32(states, index)[源代码]

返回一个范围在 [0.0, 1.0) 内的 float32 并推进 states[index]。

参数:

states (1D array, dtype=xoroshiro128p_dtype) – RNG 状态数组
index (int64) – 要更新的状态中的偏移量

返回类型:

float32

numba.cuda.random.xoroshiro128p_uniform_float64(states, index)[源代码]

返回一个范围在 [0.0, 1.0) 内的 float64 并推进 states[index]。

参数:

states (1D array, dtype=xoroshiro128p_dtype) – RNG 状态数组
index (int64) – 要更新的状态中的偏移量

返回类型:

float64

一个简单的例子

以下是一个使用随机数生成器的示例程序:

from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np

@cuda.jit
def compute_pi(rng_states, iterations, out):
    """Find the maximum value in values and store in result[0]"""
    thread_id = cuda.grid(1)

    # Compute pi by drawing random (x, y) points and finding what
    # fraction lie inside a unit circle
    inside = 0
    for i in range(iterations):
        x = xoroshiro128p_uniform_float32(rng_states, thread_id)
        y = xoroshiro128p_uniform_float32(rng_states, thread_id)
        if x**2 + y**2 <= 1.0:
            inside += 1

    out[thread_id] = 4.0 * inside / iterations

threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)

compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())

管理RNG状态大小和使用3D网格的示例

RNG 状态的数量随着使用 RNG 的线程数量而增加，因此通常最好在使用 RNG 时结合使用跨步循环，以保持状态大小可管理。

在下面的示例中，初始化一个包含随机数的大型3D数组时，如果每个输出元素使用一个线程，将导致453,617,100个RNG状态。这将花费很长时间来初始化，并且无法充分利用GPU。相反，它使用一个固定大小的3D网格，总共包含2,097,152个（(16 ** 3) * (8 ** 3)）线程，这些线程在输出数组上进行跨步操作。3D线程索引``startx``、starty``和``startz``被线性化为一个1D索引``tid，以索引到2,097,152个RNG状态中。

从 test_ex_3d_grid 的 numba/cuda/tests/doc_example/test_random.py

from numba import cuda
from numba.cuda.random import (create_xoroshiro128p_states,
                               xoroshiro128p_uniform_float32)
import numpy as np

@cuda.jit
def random_3d(arr, rng_states):
    # Per-dimension thread indices and strides
    startx, starty, startz = cuda.grid(3)
    stridex, stridey, stridez = cuda.gridsize(3)

    # Linearized thread index
    tid = (startz * stridey * stridex) + (starty * stridex) + startx

    # Use strided loops over the array to assign a random value to each entry
    for i in range(startz, arr.shape[0], stridez):
        for j in range(starty, arr.shape[1], stridey):
            for k in range(startx, arr.shape[2], stridex):
                arr[i, j, k] = xoroshiro128p_uniform_float32(rng_states, tid)

# Array dimensions
X, Y, Z = 701, 900, 719

# Block and grid dimensions
bx, by, bz = 8, 8, 8
gx, gy, gz = 16, 16, 16

# Total number of threads
nthreads = bx * by * bz * gx * gy * gz

# Initialize a state for each thread
rng_states = create_xoroshiro128p_states(nthreads, seed=1)

# Generate random numbers
arr = cuda.device_array((X, Y, Z), dtype=np.float32)
random_3d[(gx, gy, gz), (bx, by, bz)](arr, rng_states)