Why Frisky

You don't need Frisky.

Frisky is a fast version of Dask and most people don't need Dask. Parallel computing is for when you're too lazy to solve a problem properly, so you bludgeon the problem with hardware. You're better off using a database, or Polars, or asking your LLM to craft you something custom. Heck, you could even use Spark.

Frisky exists because I wanted to see how fast I could make a Dask-like thing today. Frisky is an art project. A very fast art project.

But Frisky is fun, and that's honestly the best reason I can find to use it, oh and that agents seem to enjoy using it. You're probably here for speed though, so let's talk about speed first.

Speed! Rust!
AI!
New Dashboard!
Speed Again! Fast Disk and Network!
Graph Construction!
Lightweight!
Moving Fast! Breaking Things!

Speed! Rust!

Is it still cool to reimplement libraries in Rust? I hope so, because gosh-darn-it that's exactly what I did.

Frisky is written in Rust (🎉) Frisky's core is roughly 100x faster than the core of Dask. That's not to say that your code will be 100x faster (almost certainly not) but you'll no longer be able to blame task scheduling.

The Frisky scheduler can run around 250,000-400,000 tasks per second, or with an overhead of roughly 3 µs.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "title": "Scheduler throughput (higher is better)", "data": {"values": [ {"scheduler": "Frisky", "tasks_per_s": 300000, "grp": "Frisky"}, {"scheduler": "Dask", "tasks_per_s": 3000, "grp": "other"} ]}, "height": 220, "mark": "bar", "encoding": { "x": {"field": "scheduler", "type": "nominal", "title": null, "axis": {"labelAngle": 0}}, "y": {"field": "tasks_per_s", "type": "quantitative", "title": "tasks / second"}, "color": {"field": "grp", "type": "nominal", "legend": null, "scale": {"domain": ["Frisky", "other"], "range": ["#e3a008", "#9aa0a6"]}}, "tooltip": [ {"field": "scheduler", "type": "nominal", "title": "Scheduler"}, {"field": "tasks_per_s", "type": "quantitative", "title": "Tasks / second", "format": ","} ] } }

Measured

Frisky: an offline Rust harness on an Apple M4 drives the real scheduler state machine with simulated worker completions, achieving 2.5-4.5 µs/task. In our matching end-to-end benchmark, Dask ran at ~2,600-3,000 tasks/s.

AI!

Even if Rust has lost its lustre, AI is surely in vogue (at least according to my LinkedIn feed), and agents love Frisky.

The biggest benefit to using Rust isn't speed, it's that we measure every detail of the calculation guilt-free, generating mountains of valuable telemetry and context. Frisky measures these mountains of telemetry and feeds them into downstream analytics, which in turn enable an AI development cycle at the heart of Frisky's progress.

You point your favorite AI agent to a Frisky dashboard and the frisky CLI and it pulls in all the context it wants:

$ frisky --help                               # Agents read this
$ frisky observe overview YOUR_DASHBOARD_URL  # Then they do this

Overview  http://localhost:8787
  state   workers 6 (0 idle)   waiting 34   processing 1347   memory 7684   erred 0   queued 0
          → stuck? observe blocked

  perf    wall-clock 14.0s   workers 6   tasks 14683   spans 200000
  memory 0.5 GB / 17.2 GB (3% peak)   spilled 0.0 GB   unspilled 0.0 GB   network recv 1.0 GB
          → per-worker memory observe workers   prefix costs observe prefixes http://localhost:8787   in-flight transfers observe transfers

Components
           │0s       2         4         6         8         10        12      14s│     Total
           ├──────────────────────────────────────────────────────────────────────┤
   compute │▇▇██████████████████████████████████▃█████████████████████████████    │   275.6 s
   network │▁▂▂▃▂▄▇▂▂▂▂▂▂▃▂▅▄▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂██▇██▇▇▇▄▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▂▁▁▁    │    57.9 s
 scheduler │▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  ▁▁│     3.0 s
     other │▂▂▂▃▄▅▄▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▃▃▂▃▃▃▂▂▂▂▂▃▃▃▂▃▃▃▂▂▂▃▃▃▂▃▂▂▂▂│    94.7 s
           └──────────────────────────────────────────────────────────────────────┘
→ zoom / full view: observe timeline http://localhost:8787 --view component

Costliest span types — over time
                     │0s        2         4          6         8          10        12   13s│     Total
                     ├──────────────────────────────────────────────────────────────────────┤
worker.exec.call     │▇▇████████████████████████████████████▇█▇▇▇███████████████████████████│   268.6 s
tcp.send.queue       │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇▃▃▆▆▅▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│    17.2 s
tcp.send.write_queue │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆▃▃▅▆▄▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     8.5 s
tcp.recv.queue       │▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▃▆▅▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     8.5 s
tcp.send.compress    │▁▁▁▁▁▂▃▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▆▆▄▄▅▄▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     7.7 s
tcp.send.serialize   │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▃▂▃▄▃▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     6.2 s
worker.transfer.recv │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▂▂▄▄▄▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     5.5 s
worker.deserialize   │▁▁▁▁▂▂▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▄▅▃▂▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁│     4.4 s
                     └──────────────────────────────────────────────────────────────────────┘
Memory               │▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│      2% peak
Network              │▁▁▁▁▁▁▁▁▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▇▆▆▆▆▆█▇▇▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│   313.3 MB/s

                Costliest span types
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Name                 ┃  Total ┃ Per-wkr ┃ Max wkr ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ worker.exec.call     │ 268.6s │   44.8s │   45.7s │
│ tcp.send.queue       │  17.2s │    2.9s │    6.0s │
│ tcp.send.write_queue │   8.5s │    1.4s │    3.2s │
│ tcp.recv.queue       │   8.5s │    1.4s │    3.9s │
│ tcp.send.compress    │   7.7s │    1.3s │    2.5s │
│ tcp.send.serialize   │   6.2s │    1.0s │    2.1s │
│ worker.transfer.recv │   5.5s │    0.9s │    1.2s │
│ worker.deserialize   │   4.4s │    0.7s │    1.0s │
└──────────────────────┴────────┴─────────┴─────────┘
Per-wkr = mean over the workers that ran it; Max wkr = the busiest single worker. Max wkr ≫ Per-wkr means a few workers carry the work (imbalance).
→ one op over time: observe timeline http://localhost:8787 --view detailed --prefix worker.exec.call   raw: observe spans http://localhost:8787 --name worker.exec.call

                   Workers unlike their peers
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Worker          ┃ Excess ┃ Top deviations (vs median worker) ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 127.0.0.1:58058 │     3s │ tcp.recv.queue +3s (20σ)          │
│ 127.0.0.1:58053 │     0s │ tcp.send.write +0s (4σ)           │
└─────────────────┴────────┴───────────────────────────────────┘
Each worker's top-3 most unusual span types vs the median worker; Excess = total extra seconds across them.
→ rank all: observe stragglers http://localhost:8787   one worker: observe timeline http://localhost:8787 --worker 127.0.0.1:58058

You can see an example of this above and learn more at the agents page. Agents often exclaim upon seeing the results. Here are some examples taken from my session history:

This is rich.
This is a decisive win. The sparklines reveal what binary completely hid…
This is working beautifully
Exactly what I need to see the run shape
It immediately answered "why isn't progress happening".
Jackpot!
-- Claude and Codex

I think I'm decently smart (and obviously at least a little arrogant) which is why I was able to fit the entire Dask state machine in my head back in the day. Oh boy have I met my match. Agents are better than me at this, so I've structured Frisky to be developed by agents; giving them tons of feedback loops. I've got to say, they're doing a marvelous job. I hope that you enjoy the fruits of their labor.

I also hope that you find a way to work this way too. Frisky can help you feed your agents really good context about your computations. Performance optimization is really easy with good feedback.

New Dashboard!

Our mountains of telemetry also produce an entrancing dashboard (users loved Dask's dashboard, and Frisky's dashboard makes Dask's look like it was written in 2015 (it was!)).

There's nothing more fun than pumping mountains of telemetry onto a dashboard. Humans love charts, and Frisky's dashboard has more charts than even I understand. I actually had to switch to WebGL just to display things fast enough.

I don't know how much you'll get from it (it needs more thought at the moment) but damn is it fun to watch. If you're viewing this on desktop (not mobile) you should see a live dashboard below. Click the buttons at the top for different pages.

Interact Page scroll

Note

This is a recorded loop, so the numbers repeat and live actions (opening logs, drilling into a specific task) aren't wired up. To see your own workload, run frisky demo (or any cluster) and open its dashboard.

Speed again! Fast Disk and Network!

OK, so rarely is scheduling the actual bottleneck. It's more common to feel pain around four things:

Disk: Running out of memory / writing to disk
Network: Communication between workers
Graph Generation: Large graph generation client-side and uploading it to the scheduler
Being Dumb: Dumb user code

Let's talk about 1-2 now, then we'll get to 3-4 later. This is about disk and sockets.

I no longer worry about running out of memory. Modern disk is fast enough if used well, and Frisky is designed to keep disk and socket pipelines busy. Doing disk well is hard. You've got to do obvious things, like compression, but also lots of non-obvious things like:

Pipeline writes, don't write concurrently (even if the internet or LLMs tells you to)

Instead pipeline a constant stream of data to your disk. Baby modern SSDs like the spinning disk of yore.
Byteshuffle your data before compressing (but only sometimes)
Sample blocks to choose the right compression for a buffer
Write directly to disk from network when you're memory bound
Account for every byte in transit to avoid OOMing
… and so on

Maybe you're saying "I'm smart and know HPC. These are standard solutions." You're right, but they're also uncommon in modern frameworks.

Modern hardware is faster than I expected. My MacBook Air (the absolute cheapest model I could buy) runs 20x faster than the hardware I could get when I designed Dask. Hardware is faster than our software realizes. It's time for our software to catch up.

Graph Construction!

One of the most common complaints from Dask users is the delay between submitting a computation, and seeing anything happen on the dashbaord. For a large graph this can be minutes of time without any feedback. It's not always the slowest part, but it is the most anxiety inducing.

In these minutes of stillness, Dask is furiously doing the following:

Constructing millions of Python tasks to run
Optimizing that graph
Determining a memory-minimizing trajectory through that graph
Sending that graph to the scheduler
Integrating those Python tasks into scheduler state

This happens in libraries like Xarray and Dask Dataframe/Array before we get to Frisky.

To resolve this problem I've re-implemented Dask Array at github.com/mrocklin/dask-array. Originally I built this for query optimization (another topic I won't get into here) but, because it's my personal version of Dask Array it also moves a bit faster, and so now does all of the above steps in Rust, and in a Frisky native way.

You can use it with Xarray too if you're on git main versions of Dask and Xarray. Just do this.

from dask_array.xarray import register
register()

For Legacy Dask collections (what I'm now calling everything else), Frisky can't speed this up, but it can tell you what's going on. We now include Client activity on the Dashboard, which feels much better than staring at a [*] cell in Jupyter!

Lightweight!

Frisky moves tremendous amounts of data, but it's quite compact in three ways:

The code is tight (it's not AI slop)
The processes are lightweight (start up is trivial)
The package is small (easy to depend on)

Despite being AI engineered, it's about the same code complexity as the hand-crafted core of Dask.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "title": "Implementation size (lower is better)", "data": {"values": [ {"project": "dask.distributed", "language": "Python", "kloc": 44.2, "ord": 2}, {"project": "Frisky", "language": "Rust", "kloc": 40.5, "ord": 1}, {"project": "Frisky", "language": "Python", "kloc": 6.8, "ord": 2} ]}, "height": 220, "mark": "bar", "encoding": { "x": {"field": "project", "type": "nominal", "title": null, "axis": {"labelAngle": 0}, "sort": ["dask.distributed", "Frisky"]}, "y": {"field": "kloc", "type": "quantitative", "title": "thousands of code lines"}, "color": {"field": "language", "type": "nominal", "title": "Language", "scale": {"domain": ["Rust", "Python"], "range": ["#e3a008", "#3776ab"]}}, "order": {"field": "ord", "type": "quantitative"}, "tooltip": [ {"field": "project", "type": "nominal", "title": "Project"}, {"field": "language", "type": "nominal", "title": "Language"}, {"field": "kloc", "type": "quantitative", "title": "Code lines (K)"} ] } }

Measured

Code lines only (cloc, comments and blanks excluded, tests excluded), split by language. Frisky: Rust core 40,466 + Python package 6,781. dask.distributed: Python 44,177.

You can spin up an entire in-process Frisky cluster, do some work, and spin everything down all faster than you can blink an eye.

import frisky

with frisky.LocalCluster(processes=False) as cluster:
    with cluster.get_client() as client:
        futures = client.map(lambda x: x + 1, range(10))
        results = client.gather(futures)

That whole round trip - start the cluster, submit ten tasks, distribute them to the workers, do work, gather results back, shut everything down, - takes about 6 ms. The first cluster in a fresh process pays a one-time ~30 ms import cost; every one after that stays in the single-digit-millisecond range.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "title": "Cluster start + 10 tasks + stop (lower is better)", "data": {"values": [ {"event": "Frisky: start, 10 tasks, stop", "ms": 6, "grp": "Frisky"}, {"event": "Blink of an eye", "ms": 100, "grp": "other"} ]}, "height": 120, "mark": "bar", "encoding": { "y": {"field": "event", "type": "nominal", "title": null, "sort": {"field": "ms", "order": "ascending"}}, "x": {"field": "ms", "type": "quantitative", "title": "milliseconds"}, "color": {"field": "grp", "type": "nominal", "legend": null, "scale": {"domain": ["Frisky", "other"], "range": ["#e3a008", "#9aa0a6"]}}, "tooltip": [ {"field": "event", "type": "nominal", "title": "Operation"}, {"field": "ms", "type": "quantitative", "title": "Milliseconds"} ] } }

Measured

~2.5 ms to start and stop a warm in-process cluster; running ten client.map tasks through it adds ~3 ms, for ~6 ms end to end (Apple M4). A human blink is ~100 ms.

And as a dependency Frisky is trivial. Much smaller than standard libraries in the Python stack.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "title": "Wheel size (lower is better)", "data": {"values": [ {"package": "Frisky", "mb": 4.2}, {"package": "pandas", "mb": 10.9}, {"package": "Prefect", "mb": 15.0}, {"package": "NumPy", "mb": 16.6}, {"package": "Polars", "mb": 56.4}, {"package": "Ray", "mb": 73.8} ]}, "transform": [{"calculate": "datum.package == 'Frisky' ? 'Frisky' : 'other'", "as": "grp"}], "height": 240, "mark": "bar", "encoding": { "y": {"field": "package", "type": "nominal", "title": null, "sort": {"field": "mb", "order": "ascending"}}, "x": {"field": "mb", "type": "quantitative", "title": "megabytes"}, "color": {"field": "grp", "type": "nominal", "legend": null, "scale": {"domain": ["Frisky", "other"], "range": ["#e3a008", "#9aa0a6"]}}, "tooltip": [ {"field": "package", "type": "nominal", "title": "Package"}, {"field": "mb", "type": "quantitative", "title": "Wheel size (MB)"} ] } }

Measured

CPython 3.12 / manylinux x86_64 wheel sizes reported by PyPI.

Moving Fast! Breaking things!

As Dask developed more users and more downstream dependencies and more companies involved it became, for me, less fun. I found that I enjoy the early-to-middle stages of software development where change is easy.

Change is easy in Frisky, for both good and bad. If you're looking to base your Fortune 500 company's software stack on a distributed computing platform then Frisky is a terrible choice. If you're looking to have fun with rapidly changing distributing computing then Frisky may make sense.

But really, you don't need Frisky

You almost certainly don't need parallelism, and if you do, there are more pragmatic choices.

Frisky isn't a need. Frisky is fun. I had fun building Frisky and I hope that you have fun playing with it.