====
Dask
====
.. grid:: 1 1 2 2
.. grid-item::
:columns: 12 12 6 6
*Dask is a Python library for parallel and distributed computing.* Dask is:
- **Easy** to use and set up (it's just a Python library)
- **Powerful** at providing scale, and unlocking complex algorithms
- and **Fun** 🎉
.. grid-item::
:columns: 12 12 6 6
.. raw:: html
How to Use Dask
---------------
Dask provides several APIs. Choose one that works best for you:
.. tab-set::
.. tab-item:: Tasks
Dask Futures parallelize arbitrary for-loop style Python code,
providing:
- **Flexible** tooling allowing you to construct custom
pipelines and workflows
- **Powerful** scaling techniques, processing several thousand
tasks per second
- **Responsive** feedback allowing for intuitive execution,
and helpful dashboards
Dask futures form the foundation for other Dask work
Learn more at :bdg-link-primary:`Futures Documentation `
or see an example at :bdg-link-primary:`Futures Example `
.. grid:: 1 1 2 2
.. grid-item::
:columns: 12 12 7 7
.. code-block:: python
from dask.distributed import LocalCluster
client = LocalCluster().get_client()
# Submit work to happen in parallel
results = []
for filename in filenames:
data = client.submit(load, filename)
result = client.submit(process, data)
results.append(result)
# Gather results back to local computer
results = client.gather(results)
.. grid-item::
:columns: 12 12 5 5
.. figure:: images/futures-graph.png
:align: center
.. tab-item:: DataFrames
Dask Dataframes parallelize the popular pandas library, providing:
- **Larger-than-memory** execution for single machines, allowing you
to process data that is larger than your available RAM
- **Parallel** execution for faster processing
- **Distributed** computation for terabyte-sized datasets
Dask Dataframes are similar in this regard to Apache Spark, but use the
familiar pandas API and memory model. One Dask dataframe is simply a
collection of pandas dataframes on different computers.
Learn more at :bdg-link-primary:`DataFrame Documentation `
or see an example at :bdg-link-primary:`DataFrame Example `
.. grid:: 1 1 2 2
.. grid-item::
:columns: 12 12 7 7
.. code-block:: python
import dask.dataframe as dd
# Read large datasets in parallel
df = dd.read_parquet("s3://mybucket/data.*.parquet")
df = df[df.value < 0]
result = df.groupby(df.name).amount.mean()
result = result.compute() # Compute to get pandas result
result.plot()
.. grid-item::
:columns: 12 12 5 5
.. figure:: images/dask-dataframe.svg
:align: center
.. tab-item:: Arrays
Dask Arrays parallelize the popular NumPy library, providing:
- **Larger-than-memory** execution for single machines, allowing you
to process data that is larger than your available RAM
- **Parallel** execution for faster processing
- **Distributed** computation for terabyte-sized datasets
Dask Arrays allow scientists and researchers to perform intuitive and
sophisticated operations on large datasets but use the
familiar NumPy API and memory model. One Dask array is simply a
collection of NumPy arrays on different computers.
Learn more at :bdg-link-primary:`Array Documentation `
or see an example at :bdg-link-primary:`Array Example `
.. grid:: 1 1 2 2
.. grid-item::
.. code-block:: python
import dask.array as da
x = da.random.random((10000, 10000))
y = (x + x.T) - x.mean(axis=1)
z = y.var(axis=0).compute()
.. grid-item::
:columns: 12 12 5 5
.. figure:: images/dask-array.svg
:align: center
Xarray wraps Dask array and is a popular downstream project, providing
labeled axes and simultaneously tracking many Dask arrays together,
resulting in more intuitive analyses. Xarray is popular and accounts
for the majority of Dask array use today especially within geospatial
and imaging communities.
Learn more at :bdg-link-primary:`Xarray Documentation `
or see an example at :bdg-link-primary:`Xarray Example `
.. grid:: 1 1 2 2
.. grid-item::
.. code-block:: python
import xarray as xr
ds = xr.open_mfdataset("data/*.nc")
da.groupby('time.month').mean('time').compute()
.. grid-item::
:columns: 12 12 5 5
.. figure:: https://docs.xarray.dev/en/stable/_static/logos/Xarray_Logo_RGB_Final.png
:align: center
.. tab-item:: Bags
Dask Bags are simple parallel Python lists, commonly used to process
text or raw Python objects. They are ...
- **Simple** offering easy map and reduce functionality
- **Low-memory** processing data in a streaming way that minimizes memory use
- **Good for preprocessing** especially for text or JSON data prior
ingestion into dataframes
Dask bags are similar in this regard to Spark RDDs or vanilla
Python data structures and iterators. One Dask bag is simply a
collection of Python iterators processing in parallel on different computers.
Learn more at :bdg-link-primary:`Bag Documentation `
or see an example at :bdg-link-primary:`Bag Example `
.. code-block:: python
import dask.bag as db
# Read large datasets in parallel
lines = db.read_text("s3://mybucket/data.*.json")
records = (lines
.map(json.loads)
.filter(lambda d: d["value"] > 0)
)
df = records.to_dask_dataframe()
How to Install Dask
-------------------
Installing Dask is easy with ``pip`` or ``conda``.
Learn more at :bdg-link-primary:`Install Documentation `
.. tab-set::
.. tab-item:: pip
.. code-block::
pip install "dask[complete]"
.. tab-item:: conda
.. code-block::
conda install dask
How to Deploy Dask
------------------
You can use Dask on a single machine, or deploy it on distributed hardware.
Learn more at :bdg-link-primary:`Deploy Documentation `
.. tab-set::
.. tab-item:: Local
Dask can set itself up easily in your Python session if you create a
``LocalCluster`` object, which sets everything up for you.
.. code-block:: python
from dask.distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()
# Normal Dask work ...
Alternatively, you can skip this part, and Dask will operate within a
thread pool contained entirely with your local process.
.. tab-item:: Cloud
`Coiled `_
is a commercial SaaS product that deploys Dask clusters on cloud platforms like AWS, GCP, and Azure.
.. code-block:: python
import coiled
cluster = coiled.Cluster(
n_workers=100,
region="us-east-2",
worker_memory="16 GiB",
spot_policy="spot_with_fallback",
)
client = cluster.get_client()
Learn more at :bdg-link-primary:`Coiled Documentation `
.. tab-item:: HPC
The `Dask-Jobqueue project `_ deploys
Dask clusters on popular HPC job submission systems like SLURM, PBS, SGE, LSF,
Torque, Condor, and others.
.. code-block:: python
from dask_jobqueue import PBSCluster
cluster = PBSCluster(
cores=24,
memory="100GB",
queue="regular",
account="my-account",
)
cluster.scale(jobs=100)
client = cluster.get_client()
Learn more at :bdg-link-primary:`Dask-Jobqueue Documentation `
.. tab-item:: Kubernetes
The `Dask Kubernetes project `_ provides
a Dask Kubernetes Operator for deploying Dask on Kubernetes clusters.
.. code-block:: python
from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(
name="my-dask-cluster",
image="ghcr.io/dask/dask:latest",
resources={"requests": {"memory": "2Gi"}, "limits": {"memory": "64Gi"}},
)
cluster.scale(10)
client = cluster.get_client()
Learn more at :bdg-link-primary:`Dask Kubernetes Documentation `
Learn with Examples
-------------------
Dask use is widespread, across all industries and scales. Dask is used
anywhere Python is used and people experience pain due to large scale data, or
intense computing.
You can learn more about Dask applications at the following sources:
- `Dask Examples `_
- `Dask YouTube Channel `_
Additionally, we encourage you to look through the reference documentation on
this website related to the API that most closely matches your application.
Dask was designed to be **easy to use** and **powerful**. We hope that it's
able to help you have fun with your work.
.. toctree::
:maxdepth: 1
:hidden:
:caption: Getting Started
Install Dask
10-minutes-to-dask.rst
deploying.rst
Best Practices
faq.rst
.. toctree::
:maxdepth: 1
:hidden:
:caption: How to Use
array.rst
bag.rst
DataFrame
Delayed
futures.rst
ml.rst
.. toctree::
:maxdepth: 1
:hidden:
:caption: Internals
understanding-performance.rst
scheduling.rst
graphs.rst
debugging-performance.rst
internals.rst
.. toctree::
:maxdepth: 1
:hidden:
:caption: Reference
api.rst
cli.rst
develop.rst
changelog.rst
configuration.rst
how-to/index.rst
presentations.rst
maintainers.rst
.. _`Anaconda Inc`: https://www.anaconda.com
.. _`3-clause BSD license`: https://github.com/dask/dask/blob/main/LICENSE.txt
.. _`#dask tag`: https://stackoverflow.com/questions/tagged/dask
.. _`GitHub issue tracker`: https://github.com/dask/dask/issues
.. _`xarray`: https://xarray.pydata.org/en/stable/
.. _`scikit-image`: https://scikit-image.org/docs/stable/
.. _`scikit-allel`: https://scikits.appspot.com/scikit-allel
.. _`pandas`: https://pandas.pydata.org/pandas-docs/version/0.17.0/
.. _`distributed scheduler`: http://www.aidoczh.com/dask-distributed/