2019 Dask 用户调查结果

内容

实时笔记本

您可以在 live session 中运行此笔记本，或查看 Github 上的内容。

2019 Dask 用户调查结果¶

本笔记本展示了2019年Dask用户调查的结果，该调查在今年夏天早些时候进行。感谢所有抽出时间填写调查的人！这些结果帮助我们更好地了解Dask社区，并将指导未来的开发工作。

原始数据以及分析的开始可以在以下binder中找到：

如果你在数据中发现任何问题，请告诉我们。

亮点¶

我们收到了259份调查问卷。总的来说，我们发现调查参与者非常关心改进的文档、易用性（包括部署的便捷性）和扩展性。尽管Dask汇集了许多不同的社区（大数组与大数据框，传统HPC用户与云原生资源管理器），但对于Dask最重要的事情，大家普遍达成了共识。

现在我们将逐一探讨一些个别项目的问题，特别强调一些有趣的结果。

[ ]:

%matplotlib inline

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap
import re

api_choices = ['Array', 'Bag', 'DataFrame', 'Delayed', 'Futures', 'ML', 'Xarray']
cluster_manager_choices = [
    "SSH",
    "Kubernetes",
    "HPC",
    "My workplace has a custom solution for this",
    "Hadoop / Yarn / EMR",
]

def shorten(label):
    return textwrap.shorten(label, 50)


def fmt_percent(ax):
    ticklabels = ['{:,.2f}%'.format(x) for x in ax.get_xticks()]
    ax.set_xticklabels(ticklabels)
    sns.despine()
    return ax


df = (
    pd.read_csv("data/2019-user-survey-results.csv.gz", parse_dates=['Timestamp'])
      .replace({"How often do you use Dask?": "I use Dask all the time, even when I sleep"}, "Every day")
)

如何使用 Dask？¶

对于学习资源，几乎所有受访者都使用文档。

[ ]:

ax = (
    df['What Dask resources have you used for support in the last six months?']
    .str.split(";").explode()
    .value_counts().head(6)
    .div(len(df)).mul(100).plot.barh()
);
fmt_percent(ax).set(title="Support Resource Usage");

大多数受访者至少偶尔使用 Dask。幸运的是，我们有一批数量可观的受访者，他们正在初步了解 Dask，但仍然花时间参与了调查。

[ ]:

usage_order = [
    'Every day',
    'Occasionally',
    'Just looking for now',
]
ax = df['How often do you use Dask?'].value_counts().loc[usage_order].div(len(df)).mul(100).plot.barh()
fmt_percent(ax).set(title="How often do you use Dask?");

我对学习资源的使用如何随着用户经验的增加而变化感到好奇。我们可能会预期那些刚开始接触Dask的人会从 examples.dask.org 开始，在那里他们可以尝试Dask而无需安装任何东西。

[ ]:

resources = df['What Dask resources have you used for support in the last six months?'].str.split(";").explode()
top = resources.value_counts().head(6).index
resources = resources[resources.isin(top)]

m = (
    pd.merge(df[['How often do you use Dask?']], resources, left_index=True, right_index=True)
      .replace(re.compile("GitHub.*"), "GitHub")
)

fig, ax = plt.subplots(figsize=(10, 10))

sns.countplot(hue="What Dask resources have you used for support in the last six months?",
              y='How often do you use Dask?',
              order=usage_order,
              data=m, ax=ax)
sns.despine()

总的来说，文档仍然是各个用户群体中的领导者。

Dask 教程和 dask 示例的使用在各组之间相对一致。常规用户和新用户之间的主要区别在于，常规用户更有可能在 GitHub 上参与。

从StackOverflow问题和GitHub问题中，我们对于库的哪些部分被使用有了一个模糊的概念。调查显示（至少对于我们的受访者来说）DataFrame和Delayed是最常用的API。

[ ]:

api_counts = (
    df['Dask APIs'].str.split(";").explode().value_counts()
    .div(len(df)).mul(100)
)
ax = api_counts.sort_values().nlargest(8).plot.barh()
fmt_percent(ax).set(xlabel="Percent")
sns.despine();

[ ]:

print("About {:0.2%} of our respondests are using Dask on a Cluster.".format(df['Local machine or Cluster?'].str.contains("Cluster").mean()))

但大多数受访者也在他们的笔记本电脑上使用 Dask。这突显了 Dask 向下扩展的重要性，无论是用于使用 LocalCluster 进行原型设计，还是使用 LocalCluster 或其中一个单机调度器进行离线分析。

[ ]:

order = [
    'Personal laptop',
    'Large workstation',
    'Cluster of 2-10 machines',
    'Cluster with 10-100 machines',
    'Cluster with 100+ machines'
]
df['Local machine or Cluster?'].str.split(";").explode().value_counts().loc[order].plot.barh();
sns.despine()

大多数受访者至少在某些时候会交互式地使用 Dask。

[ ]:

mapper = {
    "Interactive:  I use Dask with Jupyter or IPython when playing with data;Batch: I submit scripts that run in the future": "Both",
    "Interactive:  I use Dask with Jupyter or IPython when playing with data": "Interactive",
    "Batch: I submit scripts that run in the future": "Batch",

}

ax = df["Interactive or Batch?"].map(mapper).value_counts().div(len(df)).mul(100).plot.barh()
sns.despine()
fmt_percent(ax)
ax.set(title='Interactive or Batch?');

大多数受访者认为，更多的文档和示例将是项目最有价值的改进。这一点在新用户中尤为明显。但即使是在每天使用Dask的用户中，更多的人认为“更多示例”比“新功能”或“性能改进”更有价值。

[ ]:

help_by_use = (
    df.groupby("How often do you use Dask?")['Which would help you most right now?']
    .value_counts()
    .unstack()
)

s = (
    help_by_use
        .style
        .background_gradient(axis="rows")
        .set_caption("Normalized by row. Darker means that a higher proporiton of "
                     "users with that usage frequency prefer that priority.")
)
s

也许某些dask API的用户与整个群体的感受不同？我们按照API的使用情况进行类似的分析，而不是使用频率。

[ ]:

help_by_api = (
    pd.merge(
        df['Dask APIs'].str.split(';').explode(),
        df['Which would help you most right now?'],
        left_index=True, right_index=True)
    .groupby('Which would help you most right now?')['Dask APIs'].value_counts()
    .unstack(fill_value=0).T
    .loc[['Array', 'Bag', 'DataFrame', 'Delayed', 'Futures', 'ML', 'Xarray']]

)
(
    help_by_api
        .style
        .background_gradient(axis="columns")
        .set_caption("Normalized by row. Darker means that a higher proporiton of "
                     "users of that API prefer that priority.")
)

没有什么特别突出的。“未来”用户（我们预计他们相对高级）可能会优先考虑功能和性能，而不是文档。但每个人都同意，更多的示例是最高优先级。

常见功能请求¶

对于特定功能，我们列出了一些（作为开发者）我们认为可能重要的内容。

[ ]:

common = (df[df.columns[df.columns.str.startswith("What common feature")]]
          .rename(columns=lambda x: x.lstrip("What common feature requests do you care about most?[").rstrip(r"]")))

counts = (
    common.apply(pd.value_counts)
    .T.stack().reset_index()
    .rename(columns={'level_0': 'Question', 'level_1': "Importance", 0: "count"})
)

order = ["Not relevant for me", "Somewhat useful", 'Critical to me']
g = (
    sns.FacetGrid(counts, col="Question", col_wrap=2, aspect=1.5, sharex=False, height=3)
    .map(sns.barplot, "Importance", "count", order=order)
)

最明显的突出点是，很多人认为“更好的 NumPy/Pandas 支持”是“最关键的”。事后看来，如果有一个后续的填空字段来理解每个受访者的意思，那就更好了。最简明的解释是“涵盖更多的 NumPy / pandas API”。

“易于部署”在“对我至关重要”中占有很高的比例。再次回顾，我注意到一些模糊性。这是否意味着人们希望Dask更容易部署？还是说Dask，他们目前发现易于部署，是至关重要的？无论如何，我们可以优先考虑部署的简单性。

相对较少的受访者关心诸如“管理大量用户”的事情，尽管我们预计这会在系统管理员中相对受欢迎，尽管他们是一个较小的群体。

当然，我们也有人在将 Dask 推向极限，对他们来说，“提高扩展性”至关重要。

你还使用其他什么系统？¶

相对较高比例的受访者使用 Python 3（97%，相比之下，最近一次 Python 开发者调查中为 84%）。

[ ]:

df['Python 2 or 3?'].dropna().astype(int).value_counts(normalize=True).apply("{:0.2%}".format)

我们有点惊讶地发现，SSH 是最受欢迎的“集群资源管理器”。

[ ]:

df['If you use a cluster, how do you launch Dask? '].dropna().str.split(";").explode().value_counts().head(6)

集群资源管理器与API使用相比如何？

[ ]:

managers = (
    df['If you use a cluster, how do you launch Dask? '].str.split(";").explode().dropna()
        .replace(re.compile("HPC.*"), "HPC")
    .loc[lambda x: x.isin(cluster_manager_choices)]
)

apis = (
    df['Dask APIs'].str.split(";").explode().dropna()
    .loc[lambda x: x.isin(api_choices)]
)
wm = pd.merge(apis, managers, left_index=True, right_index=True).replace("My workplace has a custom solution for this", "Custom")

x = wm.groupby("Dask APIs")["If you use a cluster, how do you launch Dask? "].value_counts().unstack().T
x.style.background_gradient(axis="columns")

HPC 用户是 dask.array 和 xarray 的相对重度用户。

有些出乎意料的是，Dask 的最重度用户发现 Dask 足够稳定。也许他们已经克服了错误并找到了解决方法（百分比按行归一化）。

[ ]:

fig, ax = plt.subplots(figsize=(9, 6))
sns.countplot(x="How often do you use Dask?", hue="Is Dask stable enough for you?", data=df, ax=ax,
              order=reversed(usage_order));
sns.despine()

要点¶

我们应该优先改进和扩展我们的文档和示例。这可以通过Dask维护者从社区中寻找示例来实现。https://examples.dask.org 上的许多示例是由使用Dask的领域专家开发的。
改进对更大问题的扩展性是重要的，但我们不应为此牺牲单机使用场景。
交互式和批处理工作流都很重要。
Dask 的各个子社区之间的相似性大于差异性。

再次感谢所有回应者。我们期待重复这一过程，以识别随时间变化的趋势。

2020 Dask 用户调查结果

Dask Examples 文档

2019 Dask 用户调查结果

内容

2019 Dask 用户调查结果¶

亮点¶

如何使用 Dask？¶

常见功能请求¶

你还使用其他什么系统？¶

要点¶