使用 Ray 调试器#

Ray 内置了一个调试器，允许你调试分布式应用程序。它允许在 Ray 任务和角色中设置断点，当命中断点时，你可以进入一个 PDB 会话，然后可以使用该会话进行：

在该上下文中检查变量
该任务或角色内的步骤
在堆栈中向上或向下移动

警告

Ray 调试器已被弃用。请改用 Ray 分布式调试器。

入门指南#

备注

在 Python 3.6 上，breakpoint() 函数不受支持，您需要改用 ray.util.pdb.set_trace()。

以下是一个例子：

import ray

@ray.remote
def f(x):
    breakpoint()
    return x * x

futures = [f.remote(i) for i in range(2)]
print(ray.get(futures))

将程序放入名为 debugging.py 的文件中，并使用以下命令执行：

python debugging.py

每个执行的任务在执行到 breakpoint() 行时都会进入断点。你可以通过在集群的头节点上运行以下命令来附加到调试器：

ray debug

ray debug 命令将打印出类似这样的输出：

2021-07-13 16:30:40,112     INFO scripts.py:216 -- Connecting to Ray instance at 192.168.2.61:6379.
2021-07-13 16:30:40,112     INFO worker.py:740 -- Connecting to existing Ray cluster at address: 192.168.2.61:6379
Active breakpoints:
index | timestamp           | Ray task | filename:lineno
0     | 2021-07-13 23:30:37 | ray::f() | debugging.py:6
1     | 2021-07-13 23:30:37 | ray::f() | debugging.py:6
Enter breakpoint index or press enter to refresh:

你现在可以输入 0 并按下回车键跳到第一个断点。你将被带到PDB的断点处，并可以使用 help 查看可用的操作。运行 bt 以查看执行的回溯：

(Pdb) bt
  /home/ubuntu/ray/python/ray/workers/default_worker.py(170)<module>()
-> ray.worker.global_worker.main_loop()
  /home/ubuntu/ray/python/ray/worker.py(385)main_loop()
-> self.core_worker.run_task_loop()
> /home/ubuntu/tmp/debugging.py(7)f()
-> return x * x

你可以使用 print(x) 检查 x 的值。你可以使用 ll 查看当前的源代码，并使用 up 和 down 更改堆栈帧。现在让我们继续执行 c。

执行继续后，按 Control + D 返回断点列表。选择另一个断点并再次按 c 继续执行。

Ray 程序 debugging.py 现已完成，应该已经打印了 [0, 1]。恭喜，您已经完成了您的第一次 Ray 调试会话！

在集群上运行#

Ray 调试器支持在 Ray 集群中运行的任务和参与者内部设置断点。为了使用 ray debug 从集群的头节点附加到这些断点，您需要在启动集群时（可能在您的 cluster.yaml 文件或 k8s Ray 集群规范中）确保传递 --ray-debugger-external 标志给 ray start。

请注意，此标志将使工作进程监听外部IP地址上的PDB命令，因此只有在您的集群位于防火墙后面时才应*仅*使用此标志。

调试器命令#

Ray 调试器支持与 PDB 相同的命令。

在 Ray 任务之间跳转#

你可以使用调试器在 Ray 任务之间进行步进。让我们以以下递归函数为例：

import ray

@ray.remote
def fact(n):
    if n == 1:
        return n
    else:
        n_ref = fact.remote(n - 1)
        return n * ray.get(n_ref)

@ray.remote
def compute():
    breakpoint()
    result_ref = fact.remote(5)
    result = ray.get(result_ref)

ray.get(compute.remote())

在执行Python文件并调用 ray debug 运行程序后，您可以通过按 0 并输入来选择断点。这将导致以下输出：

Enter breakpoint index or press enter to refresh: 0
> /home/ubuntu/tmp/stepping.py(16)<module>()
-> result_ref = fact.remote(5)
(Pdb)

你可以使用 Ray 调试器中的 remote 命令进入调用。在函数内部，使用 p(n) 打印 n 的值，结果如下：

-> result_ref = fact.remote(5)
(Pdb) remote
*** Connection closed by remote host ***
Continuing pdb session in different process...
--Call--
> /home/ubuntu/tmp/stepping.py(5)fact()
-> @ray.remote
(Pdb) ll
  5  ->     @ray.remote
  6         def fact(n):
  7             if n == 1:
  8                 return n
  9             else:
 10                 n_ref = fact.remote(n - 1)
 11                 return n * ray.get(n_ref)
(Pdb) p(n)
5
(Pdb)

现在再次使用 remote 进入下一个远程调用并打印 n。你现在可以选择通过多次调用 remote 继续递归进入函数，或者可以使用 get 调试器命令跳转到在结果上调用 ray.get 的位置。再次使用 get 跳回到原始调用位置，并使用 p(result) 打印结果：

Enter breakpoint index or press enter to refresh: 0
> /home/ubuntu/tmp/stepping.py(14)<module>()
-> result_ref = fact.remote(5)
(Pdb) remote
*** Connection closed by remote host ***
Continuing pdb session in different process...
--Call--
> /home/ubuntu/tmp/stepping.py(5)fact()
-> @ray.remote
(Pdb) p(n)
5
(Pdb) remote
*** Connection closed by remote host ***
Continuing pdb session in different process...
--Call--
> /home/ubuntu/tmp/stepping.py(5)fact()
-> @ray.remote
(Pdb) p(n)
4
(Pdb) get
*** Connection closed by remote host ***
Continuing pdb session in different process...
--Return--
> /home/ubuntu/tmp/stepping.py(5)fact()->120
-> @ray.remote
(Pdb) get
*** Connection closed by remote host ***
Continuing pdb session in different process...
--Return--
> /home/ubuntu/tmp/stepping.py(14)<module>()->None
-> result_ref = fact.remote(5)
(Pdb) p(result)
120
(Pdb)

事后调试#

通常我们无法提前知道错误发生在哪里，因此无法设置断点。在这些情况下，我们可以在错误发生或异常抛出时自动进入调试器。这被称为 事后调试。

我们将通过一个 Ray serve 应用程序展示其工作原理。首先，安装所需的依赖项：

pip install "ray[serve]" scikit-learn

接下来，将以下代码复制到一个名为 serve_debugging.py 的文件中：

import time

from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier

import ray
from ray import serve

serve.start()

# Train model
iris_dataset = load_iris()
model = GradientBoostingClassifier()
model.fit(iris_dataset["data"], iris_dataset["target"])

# Define Ray Serve model,
@serve.deployment
class BoostingModel:
    def __init__(self):
        self.model = model
        self.label_list = iris_dataset["target_names"].tolist()

    async def __call__(self, starlette_request):
        payload = (await starlette_request.json())["vector"]
        print(f"Worker: received request with data: {payload}")

        prediction = self.model.predict([payload])[0]
        human_name = self.label_list[prediction]
        return {"result": human_name}

# Deploy model
serve.run(BoostingModel.bind(), route_prefix="/iris")

time.sleep(3600.0)

让我们在激活事后调试的情况下启动程序（RAY_PDB=1）：

RAY_PDB=1 python serve_debugging.py

标志 RAY_PDB=1 将产生这样的效果：如果发生异常，Ray 将进入调试器而不是进一步传播它。让我们看看这是如何工作的！首先使用无效请求查询模型

python -c 'import requests; response = requests.get("http://localhost:8000/iris", json={"vector": [1.2, 1.0, 1.1, "a"]})'

当 serve_debugging.py 驱动程序命中断点时，它会提示你运行 ray debug。在我们执行此操作后，会看到类似以下的输出：

Active breakpoints:
index | timestamp           | Ray task                                     | filename:lineno
0     | 2021-07-13 23:49:14 | ray::RayServeWrappedReplica.handle_request() | /home/ubuntu/ray/python/ray/serve/backend_worker.py:249
Traceback (most recent call last):

  File "/home/ubuntu/ray/python/ray/serve/backend_worker.py", line 242, in invoke_single
    result = await method_to_call(*args, **kwargs)

  File "serve_debugging.py", line 24, in __call__
    prediction = self.model.predict([payload])[0]

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_gb.py", line 1188, in predict
    raw_predictions = self.decision_function(X)

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/_gb.py", line 1143, in decision_function
    X = check_array(X, dtype=DTYPE, order="C", accept_sparse='csr')

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 673, in check_array
    array = np.asarray(array, order=order, dtype=dtype)

  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)

ValueError: could not convert string to float: 'a'

Enter breakpoint index or press enter to refresh:

我们现在按 0 然后按 Enter 进入调试器。使用 ll 我们可以看到上下文，使用 print(a) 我们可以打印出导致问题的数组。正如我们所见，它包含一个字符串（'a'）作为最后一个元素，而不是一个数字。

与上述类似，您也可以调试 Ray 角色。祝您调试愉快！

调试 API#

参见调试。