Performance

Frisky is lightweight and fast. It scales down to run well on a laptop, and can happily saturate 1000-machine clusters. Frisky is written in Rust and targets microsecond overheads and bare-metal performance.

Local Machines

Frisky is lightweight.

You can bring it up in-process and tear it down in a few milliseconds:

import frisky

def increment(x):
    return x + 1

%%time
with frisky.LocalCluster(processes=False) as cluster:
    with cluster.get_client() as client:
        pass

This is faster than a Python Thread Pool. Frisky has less overhead than a Python thread pool too.

import frisky

def increment(x):
    return x + 1

%%time
with frisky.LocalCluster(processes=False) as cluster, cluster.get_client() as client:
    futures = client.map(increment, range(1_000_000))
    results = client.gather(futures)

from concurrent.futures  import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
    results = executor.map(increment, range(1_000_000))

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "data": {"values": [ {"executor": "Frisky", "tasks_per_s": 900000}, {"executor": "ThreadPool", "tasks_per_s": 850000} ]}, "height": 220, "mark": "bar", "encoding": { "x": {"field": "executor", "type": "nominal", "title": null, "axis": {"labelAngle": 0}}, "y": {"field": "tasks_per_s", "type": "quantitative", "title": "Tasks / second (1M trivial tasks)"}, "color": {"field": "executor", "type": "nominal", "legend": null, "scale": {"domain": ["Frisky", "ThreadPool"], "range": ["#0b8784", "#9aa0a6"]}} } }

Placeholder data

These bars are illustrative placeholders, not measurements yet.

This is a silly comparison though. Mostly our points here are "Frisky is lightweight" and "Frisky avoids adding overhead".

Scale

Frisky's scale is limited by the task throughput of the centralized scheduler. Frisky can process roughly 200,000-500,000 tasks per second, so if a task takes 100ms on average then Frisky starts to saturate around 20,000-50,000 CPU-cores.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "data": {"values": [ {"scheduler": "Frisky", "tasks_per_s": 400000}, {"scheduler": "Dask", "tasks_per_s": 2000} ]}, "height": 220, "mark": "bar", "encoding": { "x": {"field": "scheduler", "type": "nominal", "title": null, "axis": {"labelAngle": 0}}, "y": {"field": "tasks_per_s", "type": "quantitative", "title": "Scheduler throughput (tasks / second)"}, "color": {"field": "scheduler", "type": "nominal", "legend": null, "scale": {"domain": ["Frisky", "Dask"], "range": ["#0b8784", "#9aa0a6"]}} } }

Placeholder data

These bars are illustrative placeholders, not measurements yet.

Hardware

Very few distributed computations will be limited by Frisky's task throughput. Instead, we often find that performance is negatively impacted by disk and network throughput.

Frisky takes extra care around network and disk pipes to ensure that they feed data through at optimal bare-metal rates. We achieve this with compression and pipelining.

Disk

When a worker runs low on memory it spills task data to local disk. Raw byte throughput here is set by the hardware — the local SSD's sustained write rate, once the workload is large enough to get past the OS page cache.

Frisky's advantage over Dask is that it compresses spilled data (with LZ4) before writing it, while Dask writes it raw. So the speedup is exactly how compressible your data is. Dense float arrays (random or even smooth) carry high-entropy mantissas that LZ4 can't shrink, so both frameworks are limited by raw disk bandwidth and run neck-and-neck. Low-cardinality, categorical, sparse, or string data compresses well, so Frisky writes a fraction of the bytes and moves that much more data per second.

The chart below shows effective spill throughput — logical bytes pushed out of memory per second of disk-write time — for a dense-float payload (the disk wall, where both tie) and a low-cardinality payload (where Frisky's compression wins):

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "data": {"values": [ {"payload": "dense float64", "framework": "Frisky", "mbps": 600}, {"payload": "dense float64", "framework": "Dask", "mbps": 600}, {"payload": "low-cardinality int", "framework": "Frisky", "mbps": 2100}, {"payload": "low-cardinality int", "framework": "Dask", "mbps": 600} ]}, "height": 240, "mark": "bar", "encoding": { "x": {"field": "payload", "type": "nominal", "title": null, "axis": {"labelAngle": 0}}, "xOffset": {"field": "framework"}, "y": {"field": "mbps", "type": "quantitative", "title": "Effective spill throughput (MB/s)"}, "color": {"field": "framework", "type": "nominal", "title": null, "scale": {"domain": ["Frisky", "Dask"], "range": ["#0b8784", "#9aa0a6"]}} } }

Placeholder data

These bars are illustrative placeholders, not measurements yet.

Network

On a cluster, worker-to-worker transfer bandwidth is the other common limit. This is best measured on real hardware rather than a laptop (loopback transfers run at memory speed and tell you nothing about a real network), so these numbers come from a Coiled cluster.

TODO: run the cross-worker transfer experiment on Coiled and produce a chart.