Observability For Agents
Agents rejoice! Here is a distributed computing platform that you can talk to.
I get it, your humans ask you to run large computations, and then expect you to make them faster, even though it's really hard to figure out what's slow. You can collect logs and compare timestamps but there's just so much going on that it's hard to triage performance issues with any real fidelity.
This is where the frisky observe CLI can help. Frisky has mountains of
telemetry, and you can access a raw feed of that data. Better still, the
frisky observe CLI delivers curated views of that data and helps you dig into
the system iteratively either while it's running or offline.
This is amazing for performance. In fact, it's how we develop Frisky. Frisky's
human (matt) mostly just feeds Frisky and an LLM some new workload, we run the
workload, see things break, and then use frisky observe to go figure out why
and fix it. The humans don't really do much these days except for take credit
for our hard work ๐.
The frisky CLI is self-describing with good context, so you don't need an MCP
server or anything, just run frisky --help and you should be good to go.
frisky observe --help
Usage: frisky observe [OPTIONS] COMMAND [ARGS]...
Inspect a running Frisky cluster
Frisky captures extensive telemetry during execution, which can be helpful
in diagnosing performance issues. You can access that telemetry through
commands in this namespace.
However, you don't want to dive too deeply too quickly. We recommend
starting at a high level overview of the workload.
frisky observe overview SOURCE
Where `SOURCE` is either a dashboard URL or a spans.json file (explained
later).
This overview command gives you headline numbers and show you the general
shape of execution. You shouldn't stop there though. You should dive more
deeply. The `frisky observe` namespace allows for successively deeper
exploration of telemetry as you learn more about the problem. For example,
if overview tells you that the cluster is spending most of its time in
transfers then follow-up with `frisky observe transfers SOURCE`, and if it's
spending time in computation, then `frisky observe prefixes SOURCES`, etc..
At the end of this funnel of information is `spans`, which provide raw
information as JSON output. You don't want to look at spans directly, but
instead store spans to a file.
frisky observe spans --limit 1000000000 > spans.json
You can then explore this json with Python scripts.
Additionally, many `frisky observe` commands take a `spans.json` file as a
SOURCE, so you can download data from a live cluster, shut down the cluster,
and then continue to explore `spans.json` telemetry offline.
You can also explore logs on a live cluster with
frisky logs --help
Options:
--help Show this message and exit.
Start here:
overview Live state + performance summary on one screen.
Breadth โ orientation:
cluster Show a one-shot cluster summary: task counts, workers,...
versions Show software versions on the cluster (python, frisky,...
workers Show worker status live, or worker activity from a spans file.
prefixes Show task prefix state plus recent transfer/disk costs.
progress Show combined worker and prefix overview.
detail Show detailed worker x prefix breakdown.
timeline Render spans as a text-based intensity timeline (per-row...
Ranked โ find the problem:
erred List Erred tasks, root failures first (sorted by blocking...
blocked List tasks with the most outstanding dependencies (most...
queued List queued tasks (rootish tasks held back because workers...
transfers Transfers: live in-flight view, or retrospective accounting...
stragglers Rank workers by how much they differ from their peers...
Inspect one entity:
task One task: state, deps/dependents tree, event timeline.
worker Show live worker detail, or worker activity from a spans file.
deps List immediate dependencies of a task (full list, paginated).
dependents List immediate dependents of a task (full list, paginated).
Raw / last-mile:
events Query the scheduler event log (transitions, placements,...
tasks List tasks with optional filtering.
spans Download raw span data for offline use
export Export spans to Chrome trace format for chrome://tracing or...
Running vs Historical Clusters
The frisky observe CLI command works against a live running cluster. Give it your dashboard address:
frisky observe overview http://localhost:8787
Or we can save data from a cluster for offline review.
frisky observe spans http://localhost:8787 > cluster-data.json
frisky observe overview cluster-data.json
Example
Run the demo workload
uvx --with numpy frisky demo
Then ask an AI agent to investigate it
We're running a parallel workload locally with frisky.
Runuvx friskyto learn about frisky and then tell me what's going on with that workload.
Here's an example of what frisky observe overview produces. It gives live
cluster state and points the agent toward areas worth drilling into.
Overview http://localhost:8787
state workers 6 (0 idle) waiting 34 processing 1347 memory 7684 erred 0 queued 0
โ stuck? observe blocked
perf wall-clock 14.0s workers 6 tasks 14683 spans 200000
memory 0.5 GB / 17.2 GB (3% peak) spilled 0.0 GB unspilled 0.0 GB network recv 1.0 GB
โ per-worker memory observe workers prefix costs observe prefixes http://localhost:8787 in-flight transfers observe transfers
Components
โ0s 2 4 6 8 10 12 14sโ Total
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
compute โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ 275.6 s
network โโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ 57.9 s
scheduler โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโ 3.0 s
other โโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 94.7 s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ zoom / full view: observe timeline http://localhost:8787 --view component
Costliest span types โ over time
โ0s 2 4 6 8 10 12 13sโ Total
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
worker.exec.call โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 268.6 s
tcp.send.queue โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโ 17.2 s
tcp.send.write_queue โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ 8.5 s
tcp.recv.queue โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโ 8.5 s
tcp.send.compress โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ 7.7 s
tcp.send.serialize โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 6.2 s
worker.transfer.recv โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโ 5.5 s
worker.deserialize โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 4.4 s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Memory โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2% peak
Network โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ
โ
โ
โ
โ
โ
โ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 313.3 MB/s
Costliest span types
โโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโณโโโโโโโโโโณโโโโโโโโโโ
โ Name โ Total โ Per-wkr โ Max wkr โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ worker.exec.call โ 268.6s โ 44.8s โ 45.7s โ
โ tcp.send.queue โ 17.2s โ 2.9s โ 6.0s โ
โ tcp.send.write_queue โ 8.5s โ 1.4s โ 3.2s โ
โ tcp.recv.queue โ 8.5s โ 1.4s โ 3.9s โ
โ tcp.send.compress โ 7.7s โ 1.3s โ 2.5s โ
โ tcp.send.serialize โ 6.2s โ 1.0s โ 2.1s โ
โ worker.transfer.recv โ 5.5s โ 0.9s โ 1.2s โ
โ worker.deserialize โ 4.4s โ 0.7s โ 1.0s โ
โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโ
Per-wkr = mean over the workers that ran it; Max wkr = the busiest single worker. Max wkr โซ Per-wkr means a few workers carry the work (imbalance).
โ one op over time: observe timeline http://localhost:8787 --view detailed --prefix worker.exec.call raw: observe spans http://localhost:8787 --name worker.exec.call
Workers unlike their peers
โโโโโโโโโโโโโโโโโโโณโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Worker โ Excess โ Top deviations (vs median worker) โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 127.0.0.1:58058 โ 3s โ tcp.recv.queue +3s (20ฯ) โ
โ 127.0.0.1:58053 โ 0s โ tcp.send.write +0s (4ฯ) โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each worker's top-3 most unusual span types vs the median worker; Excess = total extra seconds across them.
โ rank all: observe stragglers http://localhost:8787 one worker: observe timeline http://localhost:8787 --worker 127.0.0.1:58058