2020 Dask 用户调查结果

内容

实时笔记本

您可以在 live session 中运行此笔记本，或查看 Github 上的内容。

2020 Dask 用户调查结果¶

本笔记本展示了2020年Dask用户调查的结果，该调查在今年早些时候的夏季进行。感谢所有抽出时间填写调查的人！这些结果帮助我们更好地理解Dask社区，并将指导未来的开发工作。

原始数据以及分析的开始可以在以下binder中找到：

如果你在数据中发现任何问题，请告诉我们。

亮点¶

我们收到了240份调查问卷（略少于去年的260份）。总体情况与去年相似。社区中最大的变化是对更好性能的更强需求。

[ ]:

%matplotlib inline

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import textwrap
import re


df2019 = (
    pd.read_csv("data/2019-user-survey-results.csv.gz", parse_dates=["Timestamp"])
      .replace({"How often do you use Dask?": "I use Dask all the time, even when I sleep"}, "Every day")
)

df2020 = (
    pd.read_csv("data/2020-user-survey-results.csv.gz")
      .assign(Timestamp=lambda df: pd.to_datetime(df['Timestamp'], format="%Y/%m/%d %H:%M:%S %p %Z").astype('datetime64[ns]'))
      .replace({"How often do you use Dask?": "I use Dask all the time, even when I sleep"}, "Every day")
)
df2020.head()

common = df2019.columns & df2020.columns
added = df2020.columns.difference(df2019.columns)
dropped = df2019.columns.difference(df2020.columns)

df = pd.concat([df2019, df2020])
df['Year'] = df.Timestamp.dt.year
df = df.set_index(['Year', 'Timestamp']).sort_index()

大多数问题与2019年相同。我们增加了一些关于部署和仪表板使用的问题。让我们先来看看这些。

在使用Dask包部署集群的受访者中（约占受访者的53%），方法多种多样。

[ ]:

k = 'Do you use Dask projects to deploy?'
d = df2020['Do you use Dask projects to deploy?'].dropna().str.split(";").explode()
top = d.value_counts()
top = top[top > 10].index
sns.countplot(y=k, data=d[d.isin(top)].to_frame(), order=top);

大多数人通过网页浏览器访问仪表盘。那些不使用仪表盘的人可能（希望）只是在一台机器上使用带有线程调度器的Dask（尽管仪表盘在一台机器上也能正常工作）。

[ ]:

k = "How do you view Dask's dashboard?"
sns.countplot(y=k, data=df2020[k].dropna().str.split(";").explode().to_frame());

Dask 的学习资料与去年相当相似。最显著的不同之处在于我们的调查表提供了更多选项（我们的 YouTube 频道和“Gitter 聊天”）。除此之外，https://examples.dask.org 可能相对更受欢迎。

[ ]:

k = 'What Dask resources have you used for support in the last six months?'

resource_map = {
    "Tutorial": "Tutorial at tutorial.dask.org",
    "YouTube": "YouTube channel",
    "gitter": "Gitter chat"
}

d = df[k].str.split(';').explode().replace(resource_map)
top = d.value_counts()[:8].index
d = d[d.isin(top)]

fig, ax = plt.subplots(figsize=(8, 8))
ax = sns.countplot(y=k, hue="Year", data=d.reset_index(), ax=ax);
ax.set(ylabel="", title=k);

如何使用 Dask？¶

就像去年一样，我们将按使用Dask的频率来分组查看资源使用情况。

[ ]:

resource_palette = (
    df['What Dask resources have you used for support in the last six months?'].str.split(";").explode().replace(resource_map).replace(re.compile("GitHub.*"), "GitHub").value_counts()[:6].index
)
resource_palette = dict(zip(resource_palette, sns.color_palette("Paired")))

usage_order = ['Every day', 'Occasionally', 'Just looking for now']

def resource_plot(df, year, ax):
    resources = (
        df.loc[year, 'What Dask resources have you used for support in the last six months?']
          .str.split(";")
          .explode()
          .replace(resource_map)
    )
    top = resources.value_counts().head(6).index
    resources = resources[resources.isin(top)]

    m = (
        pd.merge(df.loc[year, ['How often do you use Dask?']], resources, left_index=True, right_index=True)
          .replace(re.compile("GitHub.*"), "GitHub")
    )

    ax = sns.countplot(hue="What Dask resources have you used for support in the last six months?",
                  y='How often do you use Dask?',
                  order=usage_order,
                  data=m, ax=ax,
                  hue_order=list(resource_palette),
                  palette=resource_palette)
    sns.despine()
    return ax

fig, axes = plt.subplots(figsize=(20, 10), ncols=2)
ax1 = resource_plot(df, 2019, axes[0])
ax2 = resource_plot(df, 2020, axes[1])
ax1.set_title("2019")
ax2.set_title("2020");

一些观察

GitHub 问题在适度和大型的 Dask 用户中变得相对不那么流行，这可能反映了更好的文档或稳定性（假设人们在文档中找不到答案或遇到错误时会去问题跟踪器）。
https://examples.dask.org 现在在偶尔使用的用户中明显更受欢迎。
针对去年的调查，我们投入时间改进了 https://tutorial.dask.org，我们之前觉得它有所欠缺。它的使用情况与去年大致相同（非常受欢迎），因此我们不清楚是否应该在那里投入更多关注。

API 使用情况与去年大致相同（回想一下，大约有20人没有参加调查，而且人们可以选择多个选项，因此相对差异是最有趣的）。我们为 RAPIDS、Prefect 和 XGBoost 添加了新的选项，每个选项都有一定的受欢迎程度（与 dask.Bag 相当）。

[ ]:

apis = df['Dask APIs'].str.split(";").explode()
top = apis.value_counts().loc[lambda x: x > 10]
apis = apis[apis.isin(top.index)].reset_index()

sns.countplot(y="Dask APIs", hue="Year", data=apis);

就像去年一样，大约65%的用户至少在某些时候在集群上使用Dask。

[ ]:

df['Local machine or Cluster?'].dropna().str.contains("Cluster").astype(int).groupby("Year").mean()

但大多数受访者也在他们的笔记本电脑上使用 Dask。这突显了 Dask 向下扩展的重要性，无论是用于使用 LocalCluster 进行原型设计，还是使用 LocalCluster 或其中一个单机调度器进行离线分析。

[ ]:

order = [
    'Personal laptop',
    'Large workstation',
    'Cluster of 2-10 machines',
    'Cluster with 10-100 machines',
    'Cluster with 100+ machines'
]
d = df['Local machine or Cluster?'].str.split(";").explode().reset_index()
sns.countplot(y="Local machine or Cluster?", data=d, hue="Year", order=order);

就像去年一样，大多数受访者认为更多的文档和示例将是项目最有价值的改进。

一个有趣的改变来自于查看“哪些对你现在最有帮助？”按API组（dask.dataframe、``dask.array``等）划分的结果。去年显示，在我的领域中，“更多示例”对所有API组来说是最重要的（下表第一张）。但在2020年，情况有所不同（下图第二张）。

[ ]:

help_by_api = (
    pd.merge(
        df.loc[2019, 'Dask APIs'].str.split(';').explode(),
        df.loc[2019, 'Which would help you most right now?'],
        left_index=True, right_index=True)
    .groupby('Which would help you most right now?')['Dask APIs'].value_counts()
    .unstack(fill_value=0).T
    .loc[['Array', 'Bag', 'DataFrame', 'Delayed', 'Futures', 'ML', 'Xarray']]

)
(
    help_by_api
        .style
        .background_gradient(axis="columns")
        .set_caption("2019 normalized by row. Darker means that a higher proporiton of "
                     "users of that API prefer that priority.")
)

[ ]:

help_by_api = (
    pd.merge(
        df.loc[2020, 'Dask APIs'].str.split(';').explode(),
        df.loc[2020, 'Which would help you most right now?'],
        left_index=True, right_index=True)
    .groupby('Which would help you most right now?')['Dask APIs'].value_counts()
    .unstack(fill_value=0).T
    .loc[['Array', 'Bag', 'DataFrame', 'Delayed', 'Futures', 'ML', 'Xarray']]

)
(
    help_by_api
        .style
        .background_gradient(axis="columns")
        .set_caption("2020 normalized by row. Darker means that a higher proporiton of "
                     "users of that API prefer that priority.")
)

示例再次成为最重要的（对于所有API组，除了``Futures``）。但“性能改进”现在是第二重要的字段（除了``Futures``，它是最重要的）。我们该如何解释这一点？一个善意的解释是，Dask的用户正在扩展到更大的问题，并遇到了新的扩展挑战。一个不那么善意的解释是，我们用户的流程是相同的，但Dask正在变慢！

常见功能请求¶

对于特定功能，我们列出了一些我们（作为开发者）认为可能重要的内容。

[ ]:

common = (df[df.columns[df.columns.str.startswith("What common feature")]]
          .rename(columns=lambda x: x.lstrip("What common feature requests do you care about most?[").rstrip(r"]")))
a = common.loc[2019].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': "Importance", 0: "count"}).assign(Year=2019)
b = common.loc[2020].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': "Importance", 0: "count"}).assign(Year=2020)

counts = pd.concat([a, b], ignore_index=True)

d = common.stack().reset_index().rename(columns={"level_2": "Feature", 0: "Importance"})
order = ["Not relevant for me", "Somewhat useful", 'Critical to me']
sns.catplot('Importance', row="Feature", kind="count", col="Year", data=d, sharex=False, order=order);

与去年相比，各项功能的相对重要性确实没有太大变化。也许最大的变化在于“部署的简易性”，其中“对我至关重要”现在相对更受欢迎（尽管去年它已经是最受欢迎的）。

你还使用其他什么系统？¶

SSH 仍然是“集群资源管理器”中最受欢迎的。这是去年的一个大惊喜，所以我们做了一些工作来让它更好。除此之外，没有太多变化。

[ ]:

c = df['If you use a cluster, how do you launch Dask? '].dropna().str.split(";").explode()
top = c.value_counts().index[:6]
sns.countplot(y="If you use a cluster, how do you launch Dask? ", data=c[c.isin(top)].reset_index(), hue="Year");

Dask 用户对其稳定性的满意度与去年大致相同。

[ ]:

# fig, ax = plt.subplots(figsize=(9, 6))
sns.countplot(y="Is Dask stable enough for you?", hue="Year", data=df.reset_index())
sns.despine()

要点¶

总的来说，大多数事情与去年相似。
文档，特别是特定领域的示例，仍然非常重要。
更多用户正在推动 Dask 的发展。投资于性能提升很可能会带来价值。

再次感谢所有回应者。我们期待重复这一过程，以识别随时间变化的趋势。

2021 Dask 用户调查结果

2019 Dask 用户调查结果

Dask Examples 文档

2020 Dask 用户调查结果

内容

2020 Dask 用户调查结果¶

亮点¶

如何使用 Dask？¶

常见功能请求¶

你还使用其他什么系统？¶

要点¶