Observability For Agents

Agents rejoice! Here is a distributed computing platform that you can talk to.

I get it, your humans ask you to run large computations, and then expect you to make them faster, even though it's really hard to figure out what's slow. You can collect logs and compare timestamps but there's just so much going on that it's hard to triage performance issues with any real fidelity.

This is where the frisky observe CLI can help. Frisky has mountains of telemetry, and you can access a raw feed of that data. Better still, the frisky observe CLI delivers curated views of that data and helps you dig into the system iteratively either while it's running or offline.

This is amazing for performance. In fact, it's how we develop Frisky. Frisky's human (matt) mostly just feeds Frisky and an LLM some new workload, we run the workload, see things break, and then use frisky observe to go figure out why and fix it. The humans don't really do much these days except for take credit for our hard work 😏.

The frisky CLI is self-describing with good context, so you don't need an MCP server or anything, just run frisky --help and you should be good to go.

frisky observe --help

Usage: frisky observe [OPTIONS] COMMAND [ARGS]...

  Inspect a running Frisky cluster

  Frisky captures extensive telemetry during execution, which can be helpful
  in diagnosing performance issues.  You can access that telemetry through
  commands in this namespace.

  However, you don't want to dive too deeply too quickly.  We recommend
  starting at a high level overview of the workload.

      frisky observe overview SOURCE

  Where `SOURCE` is either a dashboard URL or a spans.json file (explained
  later).

  This overview command gives you headline numbers and show you the general
  shape of execution.  You shouldn't stop there though.  You should dive more
  deeply. The `frisky observe` namespace allows for successively deeper
  exploration of telemetry as you learn more about the problem.  For example,
  if overview tells you that the cluster is spending most of its time in
  transfers then follow-up with `frisky observe transfers SOURCE`, and if it's
  spending time in computation, then `frisky observe prefixes SOURCES`, etc..

  At the end of this funnel of information is `spans`, which provide raw
  information as JSON output.  You don't want to look at spans directly, but
  instead store spans to a file.

      frisky observe spans --limit 1000000000 > spans.json

  You can then explore this json with Python scripts.

  Additionally, many `frisky observe` commands take a `spans.json` file as a
  SOURCE, so you can download data from a live cluster, shut down the cluster,
  and then continue to explore `spans.json` telemetry offline.

  You can also explore logs on a live cluster with

      frisky logs --help

Options:
  --help  Show this message and exit.

Start here:
  overview  Live state + performance summary on one screen.

Breadth — orientation:
  cluster   Show a one-shot cluster summary: task counts, workers,...
  versions  Show software versions on the cluster (python, frisky,...
  workers   Show worker status live, or worker activity from a spans file.
  prefixes  Show task prefix state plus recent transfer/disk costs.
  progress  Show combined worker and prefix overview.
  detail    Show detailed worker x prefix breakdown.
  timeline  Render spans as a text-based intensity timeline (per-row...

Ranked — find the problem:
  erred       List Erred tasks, root failures first (sorted by blocking...
  blocked     List tasks with the most outstanding dependencies (most...
  queued      List queued tasks (rootish tasks held back because workers...
  transfers   Transfers: live in-flight view, or retrospective accounting...
  stragglers  Rank workers by how much they differ from their peers...

Inspect one entity:
  task        One task: state, deps/dependents tree, event timeline.
  worker      Show live worker detail, or worker activity from a spans file.
  deps        List immediate dependencies of a task (full list, paginated).
  dependents  List immediate dependents of a task (full list, paginated).

Raw / last-mile:
  events  Query the scheduler event log (transitions, placements,...
  tasks   List tasks with optional filtering.
  spans   Download raw span data for offline use
  export  Export spans to Chrome trace format for chrome://tracing or...

Running vs Historical Clusters

The frisky observe CLI command works against a live running cluster. Give it your dashboard address:

frisky observe overview http://localhost:8787

Or we can save data from a cluster for offline review.

frisky observe spans http://localhost:8787 > cluster-data.json

frisky observe overview cluster-data.json

Example

Run the demo workload

uvx --with numpy frisky demo

Then ask an AI agent to investigate it

We're running a parallel workload locally with frisky.
Run uvx frisky to learn about frisky and then tell me what's going on with that workload.

Here's an example of what frisky observe overview produces. It gives live cluster state and points the agent toward areas worth drilling into.

Overview  http://localhost:8787
  state   workers 6 (0 idle)   waiting 34   processing 1347   memory 7684   erred 0   queued 0
          → stuck? observe blocked

  perf    wall-clock 14.0s   workers 6   tasks 14683   spans 200000
  memory 0.5 GB / 17.2 GB (3% peak)   spilled 0.0 GB   unspilled 0.0 GB   network recv 1.0 GB
          → per-worker memory observe workers   prefix costs observe prefixes http://localhost:8787   in-flight transfers observe transfers

Components
           │0s       2         4         6         8         10        12      14s│     Total
           ├──────────────────────────────────────────────────────────────────────┤
   compute │▇▇██████████████████████████████████▃█████████████████████████████    │   275.6 s
   network │▁▂▂▃▂▄▇▂▂▂▂▂▂▃▂▅▄▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂██▇██▇▇▇▄▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▂▁▁▁    │    57.9 s
 scheduler │▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  ▁▁│     3.0 s
     other │▂▂▂▃▄▅▄▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▃▃▂▃▃▃▂▂▂▂▂▃▃▃▂▃▃▃▂▂▂▃▃▃▂▃▂▂▂▂│    94.7 s
           └──────────────────────────────────────────────────────────────────────┘
→ zoom / full view: observe timeline http://localhost:8787 --view component

Costliest span types — over time
                     │0s        2         4          6         8          10        12   13s│     Total
                     ├──────────────────────────────────────────────────────────────────────┤
worker.exec.call     │▇▇████████████████████████████████████▇█▇▇▇███████████████████████████│   268.6 s
tcp.send.queue       │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇▃▃▆▆▅▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│    17.2 s
tcp.send.write_queue │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆▃▃▅▆▄▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     8.5 s
tcp.recv.queue       │▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▃▆▅▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     8.5 s
tcp.send.compress    │▁▁▁▁▁▂▃▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▆▆▄▄▅▄▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     7.7 s
tcp.send.serialize   │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▃▂▃▄▃▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     6.2 s
worker.transfer.recv │▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▂▂▄▄▄▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│     5.5 s
worker.deserialize   │▁▁▁▁▂▂▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▄▅▃▂▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁│     4.4 s
                     └──────────────────────────────────────────────────────────────────────┘
Memory               │▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│      2% peak
Network              │▁▁▁▁▁▁▁▁▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▇▆▆▆▆▆█▇▇▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁│   313.3 MB/s

                Costliest span types
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Name                 ┃  Total ┃ Per-wkr ┃ Max wkr ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ worker.exec.call     │ 268.6s │   44.8s │   45.7s │
│ tcp.send.queue       │  17.2s │    2.9s │    6.0s │
│ tcp.send.write_queue │   8.5s │    1.4s │    3.2s │
│ tcp.recv.queue       │   8.5s │    1.4s │    3.9s │
│ tcp.send.compress    │   7.7s │    1.3s │    2.5s │
│ tcp.send.serialize   │   6.2s │    1.0s │    2.1s │
│ worker.transfer.recv │   5.5s │    0.9s │    1.2s │
│ worker.deserialize   │   4.4s │    0.7s │    1.0s │
└──────────────────────┴────────┴─────────┴─────────┘
Per-wkr = mean over the workers that ran it; Max wkr = the busiest single worker. Max wkr ≫ Per-wkr means a few workers carry the work (imbalance).
→ one op over time: observe timeline http://localhost:8787 --view detailed --prefix worker.exec.call   raw: observe spans http://localhost:8787 --name worker.exec.call

                   Workers unlike their peers
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Worker          ┃ Excess ┃ Top deviations (vs median worker) ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 127.0.0.1:58058 │     3s │ tcp.recv.queue +3s (20σ)          │
│ 127.0.0.1:58053 │     0s │ tcp.send.write +0s (4σ)           │
└─────────────────┴────────┴───────────────────────────────────┘
Each worker's top-3 most unusual span types vs the median worker; Excess = total extra seconds across them.
→ rank all: observe stragglers http://localhost:8787   one worker: observe timeline http://localhost:8787 --worker 127.0.0.1:58058