Skip to content

feat: async run_experiment via RunHandle + cancellation + status widget#10

Open
hinderling wants to merge 7 commits into
pertzlab:mainfrom
hinderling:feat/async-run-handle
Open

feat: async run_experiment via RunHandle + cancellation + status widget#10
hinderling wants to merge 7 commits into
pertzlab:mainfrom
hinderling:feat/async-run-handle

Conversation

@hinderling
Copy link
Copy Markdown
Collaborator

@hinderling hinderling commented May 15, 2026

Summary

Move the MDA feed loop onto a worker thread, expose live status through a RunHandle (psygnal Signal), and add a napari dock widget that mirrors + steers the current run. Replaces the synchronous-blocking run_experiment / continue_experiment API.

Draft: breaks the public API. Notebook updates required (see below) before merging. The async demo notebook included here is a test artifact — it must be removed before merge (see Demo notebook section).

Why

The controller's feed loop ran on the main thread, so:

  • napari froze for the duration of every run (no Qt-event processing).
  • run_experiment blocked the calling cell — no interactive monitoring / cancellation without Ctrl-C (which sometimes left device state half-set).
  • Status was opaque: "what timepoint are we on, are we lagging?" was unanswerable.
  • No clean way to cancel or pause a long run.

Moving the loop onto its own thread fixes all of these: napari is responsive by construction, the cell returns immediately, and cancellation / pause / live status become natural.

What changed

New: faro/core/run_status.py

  • RunStatus — immutable snapshot dataclass: state, current_event_index, current_fov, n_events_total, n_events_consumed, n_frames_received, started_at / finished_at, lag_ms, background_errors, fatal_error, …
  • RunHandle — owns the worker thread + cooperative cancel/pause events, carries the run's (sorted) event list. Methods: status(), wait(), cancel(), pause(), resume(), is_running(), is_paused(). Signal: statusChanged (psygnal) emitting the latest RunStatus.
  • RunState: pending → running ⇄ pausing/paused → done/error (cancelling on cancel).

faro/core/controller.py

  • Controller.runStarted = Signal(object) fires on each new run/continue carrying the fresh RunHandle.
  • run_experiment / continue_experiment spawn a worker thread and return the handle immediately; validation still runs synchronously on the caller. Events are sorted once and stashed on the handle so the widget renders them in execution order.
  • _run_worker centralises pre-flight setup and wraps the feed loop so failures land in handle.fatal_error instead of crashing the user.
  • _run_mda_with_events polls cancel_event and pause_event each iteration — pause halts feeding after the in-flight backpressure window drains; resume continues.
  • fix: the engine queue is recreated per run. A cancelled run aborts the engine mid-drain, leaving a stale STOP_EVENT behind; reusing the queue made the next run's engine consume that sentinel and stall after a few events ("stuck at 3/80").
  • fix: _bump_status_for_frame skips IMG_STIM snaps — a stim emission is the SLM-illuminated snap paired with its imaging frame; counting it double-updated lag/elapsed and drifted the frame count off the RTMEvent count.
  • napari preview: the controller no longer carries its own preview-layer machinery, and live mode no longer has to be manually disconnected before a run. napari-micromanager's own _NapariMDAHandler keeps routing frames into the preview layer throughout the run; the controller just stops continuous sequence acquisition once at MDA start to avoid a snap-buffer race. Notebooks can drop the old "break the CoreViewerLink before running" dance.

New: faro/widgets/experiment_status.py

ExperimentStatusWidget — a napari dock panel that mirrors and controls the current run:

  • State chip, legend (imaging / stim / ref).
  • Event strip — one cell per RTMEvent, color-coded by type, past=opaque / future=dimmed progress fill, current cell bordered. Scales to thousands of events.
  • FOV map — one dot per unique stage position, equal-aspect, visit-order path, active dot recolored to the current event type.
  • Stats — event N/M, elapsed, scheduled, lag (red > 5 s), remaining, errors.
  • Pause / Resume + Stop buttons.
  • Theme-adaptive (napari light/dark), auto-rebinds on every new run via runStarted.

Async/Qt fixes folded in

  • PYMM_SIGNALS_BACKEND=psygnal forced in faro/microscope/base.py — with a QApplication loaded, pymmcore-plus otherwise picks the Qt signal backend and queues frameReady to the main thread; if the main thread is blocked (handle.wait()), frames never reach the controller. Forcing psygnal keeps the data path direct/synchronous on the engine thread.
  • Widget connects statusChanged with thread="main" + drives psygnal.qt.start_emitting_from_queue() so worker-thread emits reach QWidgets safely.
  • uv.lock: bumped pymmcore-widgets past an upstream fix (_presets_widget crashing on an empty device label during MDA events).

BREAKING: notebook updates required

Before

ctrl.run_experiment(events, stim_mode="current")   # blocked here
ctrl.finish_experiment()

After — choose one:

(a) Blocking equivalent (smallest diff):

ctrl.run_experiment(events, stim_mode="current").wait()
ctrl.finish_experiment()

(b) Non-blocking, with status / cancel / pause:

handle = ctrl.run_experiment(events, stim_mode="current")
# other cells can run; handle.status() / handle.cancel() / handle.pause()
handle.wait()                  # block at the end if desired
ctrl.finish_experiment()

Optional napari widget:

from faro.widgets import ExperimentStatusWidget
viewer.window.add_dock_widget(ExperimentStatusWidget(ctrl), name="Experiment")

Demo notebook (test artifact — remove before merge)

experiments/02_demo_sim_optogenetic/demo_sim_optogenetic_napari_async.ipynb is included only to exercise this PR against the virtual-microscope optogenetic backend (async run, pause/resume, cancel/restart, the status widget, multi-FOV). It doubles as a worked example of what the migrated notebooks could look like. It should be deleted before this PR merges — the real deliverable is the API + widget, not this notebook.

What to check / test before merging

  • Every notebook in experiments/* that calls run_experiment / continue_experiment — migrate to .wait() or the non-blocking flow. Confirm none rely on the old blocking return.
  • Notebooks that manually tear down the napari live link / CoreViewerLink before a run — that workaround is no longer needed; verify removing it and that the preview layer keeps updating during the run.
  • tests/hardware/* — update for the new RunHandle return type; run on the Moench rig.
  • Multi-channel imaging: the widget's frame counter / strip cursor assume ~1 imaging frame per RTMEvent. For multi-channel plans n_frames_received outpaces the RTMEvent count — verify the strip/stats still read sensibly or gate the assumption.
  • continue_experiment + the widget: confirm the strip/map rebuild correctly for the appended events and the FOV map merges positions.
  • Headless / no-Qt runs (CI, non-microscope dev machine) — import faro stays Qt-free; .wait() path works without a QApplication.
  • Cancel-then-restart and pause/resume on real hardware (verified on the simulator; engine-abort semantics differ per device).
  • Bump the virtual-microscope lockfile pinuv lock --upgrade-package virtual-microscope to pick up the fixes now on its default branch (JIT pre-warm; SimCameraDevice digital ROI / MDA-teardown fix). Without this the demo notebook's first ~4 s of frames stall and the napari Snap preview freezes after a run. Commit the uv.lock change separately (it is not async/widget code).

Related (separate repo)

Two virtual-microscope fixes were needed for the demo notebook and have already landed on its default branch (virtual-env):

  • JIT pre-warm — pre-warms the numba physics-step JIT before the RealtimeEngine starts; otherwise the first ~4 s of snaps stall behind a compile holding the sim lock, so frames arrive in a burst instead of paced.
  • SimCameraDevice digital ROI — implements real ROI cropping. It also fixes an MDA-teardown bug: the camera previously raised NotImplementedError from set_roi, which aborted MDARunner._finish_run before it emitted sequenceFinished; napari-micromanager then never cleared _mda_running, so the Snap preview silently stopped updating after a run.

These are not part of this PR — faro just needs the lockfile bump above to pick them up.

Verification

Exercised end-to-end against the virtual-microscope optogenetic backend (napari + napari-micromanager + the widget):

  • Live status flows worker → widget on the main thread (psygnal queued delivery); strip / FOV map / stats update in real time.
  • Cancel mid-run, then restart from the notebook — reaches steady state, no stall.
  • Pause halts feeding after the backpressure window drains; resume runs to completion.
  • Frame count tracks RTMEvents 1:1 for single-channel plans; stim snaps no longer double-count.
  • 87 unit tests pass.

Compatibility notes

  • Headless / no Qt: works — psygnal delivers slots synchronously without Qt. Widget package is opt-in (import faro.widgets); import faro / import faro.core stay Qt-free.
  • MDA engines other than pymmcore-plus: no regression — the controller still talks to hardware exclusively through AbstractMicroscope.

Screenshot

Screenshot 2026-05-16 at 11 16 31 AM

hinderling and others added 7 commits May 16, 2026 11:35
Move the MDA feed loop onto a worker thread, expose live status through a
RunHandle + psygnal Signal, and add a minimal napari widget that mirrors
the current run.

Breaking change:
  ctrl.run_experiment(events, ...) and ctrl.continue_experiment(...) now
  return a RunHandle immediately instead of blocking until the run is
  done. Existing notebooks that did `ctrl.run_experiment(events, ...)`
  must be updated to either `handle = ctrl.run_experiment(events, ...);
  handle.wait()` for the old blocking semantics, or to use the new
  non-blocking flow (poll handle.status(), subscribe to
  handle.statusChanged, call handle.cancel() to stop early).

What's in this commit:

- faro/core/run_status.py (new):
  * RunStatus -- immutable snapshot dataclass with state, event/FOV
    indices, frame count, lag_ms, error info.
  * RunHandle -- owns the worker thread + cooperative cancel event,
    exposes status()/wait()/cancel()/is_running() + a psygnal
    statusChanged signal that emits the latest RunStatus on each update.
    Subscribers on the main thread see queued-connection delivery via
    psygnal's Qt integration.

- faro/core/controller.py:
  * Controller exposes a class-level runStarted = Signal(object). Fires
    on every new run/continue so widgets can re-bind.
  * run_experiment / continue_experiment spawn a worker thread, return
    the handle, emit runStarted. Validation still happens synchronously
    so a bad event list raises on the calling thread.
  * _run_worker centralises pre-flight setup (writer init -- including
    the potentially-slow zarr rmtree on overwrite -- and Analyzer
    construction) and wraps the feed loop in try/except so worker-side
    failures land in handle.fatal_error rather than crashing the user.
  * _run_mda_with_events accepts the handle, checks handle.cancel_event
    at each loop iteration and in the backpressure throttle, asks the
    engine to cancel the in-flight event when set, and emits status
    updates on each RTMEvent dequeue.
  * _on_frame_ready (and ControllerSimulated._on_frame_ready) call a
    shared _bump_status_for_frame helper that increments
    n_frames_received and computes lag_ms vs event.min_start_time.
  * Now off the main thread, all the prior Qt-pumping helpers
    (_pump_qt_and_sleep, _qt_join, _wait_for_frame_pumping_qt) and the
    superqt ensure_main_thread import are obsolete and removed. The
    preview-layer machinery (viewer=, _on_preview_frame, _apply_preview,
    PREVIEW_LAYER_NAME) is also removed -- napari-micromanager's own
    _NapariMDAHandler already routes generator events into the preview
    layer.
  * finish_experiment now waits for the current handle before shutting
    down the Analyzer.
  * _pending_sentinels guarded by a Lock since extend_experiment now
    runs on the calling thread while the feed loop runs on the worker.

- faro/widgets/experiment_status.py (new):
  * ExperimentStatusWidget -- read-out of state, FOV, event index,
    frame count, lag, elapsed time, error count. Has a Stop button
    that calls handle.cancel(). Subscribes to controller.runStarted
    so it automatically re-binds when a new run begins; cleans up the
    previous handle's signal subscription on each rebind.

Verified end-to-end via a Qt smoke test:
  - Live updates flow from the worker thread to the widget on the main
    thread (psygnal+Qt queued delivery).
  - Stop button triggers handle.cancel(); the worker's cancel-check
    fires within one iteration and the run exits at the next event
    boundary.
  - Starting a new run re-binds the widget to the new handle and resets
    the progress bar / counters.
The OmeZarrWriter init in _run_worker still pulled image height/width
via self._mic.mmc.getImageHeight/Width -- a pymmcore-plus-specific
call that breaks any non-pymmcore microscope.

Use the AbstractMicroscope-level convention: subclasses populate
self.image_height / self.image_width on the microscope instance (Moench
already does this in init_scope). Fall back to mmc if the attributes
aren't present but mmc is, so existing pymmcore-only microscopes keep
working without code changes. Raise a clear error when neither path is
available.
Three independent bugs surfaced when running the new async
run_experiment + ExperimentStatusWidget against a napari viewer
(reproduced with the optogenetic virtual_microscope backend):

1. pymmcore-plus's signals_backend() auto-selects the *qt* backend
   whenever a QApplication is loaded. core.mda.events.frameReady then
   becomes a QtCore.SignalInstance and cross-thread emits land in
   Qt.QueuedConnection, where they're delivered only when the main
   thread pumps events. With Controller.run_experiment now spawning a
   worker and RunHandle.wait() joining on it, the main thread is
   typically idle-blocked exactly when the engine is firing frames --
   so the controller's _on_frame_ready never ran, the engine completed
   "successfully" with zero frames received, and the pipeline never
   saw any data. Force PYMM_SIGNALS_BACKEND=psygnal in
   faro/microscope/base.py so the data path stays direct/synchronous
   on the engine thread regardless of whether Qt is loaded. The
   widget-side path (RunHandle.statusChanged) still uses psygnal's
   own queued delivery -- see fix #2.

2. ExperimentStatusWidget connected handle.statusChanged with the
   default (direct) connection. Status updates emitted from the worker
   thread therefore ran the widget's _refresh slot synchronously
   off-main, calling QLabel.setText / QProgressBar.setValue from a
   non-GUI thread. Under napari that lands in vispy's OpenGL
   compositor and aborts with "Cannot make QOpenGLContext current in
   a different thread" -> SIGABRT (kernel hard-crash in VSCode
   Jupyter). Switch to connect(..., thread="main") so psygnal queues
   the call into its main-thread queue.

3. psygnal's queued callbacks live in QueuedCallback._GLOBAL_QUEUE,
   which nothing drains by default -- the widget would be invoked on
   the main thread, but only when something explicitly calls
   psygnal.emit_queued(). RunHandle's docstring claims auto-Qt
   delivery; that's not how psygnal actually works. Call
   psygnal.qt.start_emitting_from_queue() in the widget's __init__,
   which installs a main-thread QTimer that fires emit_queued() on
   every Qt event-loop tick. Idempotent and global, so multiple
   widgets / multiple runs are safe.

Lockfile: bump pymmcore-widgets (8c8f76e -> 48ff414) so the unrelated
upstream crash in pymmcore_widgets._presets_widget._on_property_changed
when handed an empty device label (virtual_microscope's shutter)
is included. Without that bump, the MDA engine itself aborts on the
first setShutterOpen() once frames actually start flowing.

Verified end-to-end against virtual_microscope's optogenetic backend:
- headless async run: 5/5 frames (regression check, unchanged)
- napari.Viewer() + handle.wait():     5/5 frames (was 0/5)
- napari + napari-micromanager + widget: 5/5 frames, no crash, exit 0
- widget visibly updates progress / frames / state mid-experiment
  (sampled QLabel.text() while pumping Qt events)
- 87 unit tests still pass
Sibling of demo_sim_optogenetic.ipynb that exercises the new async
run_experiment + RunHandle + ExperimentStatusWidget end-to-end against
virtual_microscope's optogenetic backend, with a live napari viewer
dock-attached.

Walks through: handle = ctrl.run_experiment(...) is non-blocking, the
kernel is free; poll handle.status() while it runs; subscribe to
handle.statusChanged from the kernel side; cancel via the widget Stop
button or handle.cancel(); handle.wait() blocks if you want the
old synchronous semantics; continue_experiment() re-binds the widget
automatically via runStarted.

Phases are concatenated with combine(..., axis="t") per the new
RTMSequence API.
Backend changes that make an async run inspectable and steerable --
the data the new ExperimentStatusWidget renders, plus two bug fixes
surfaced while building it.

run_status.py
  - RunHandle.events: optional snapshot of the (sorted) RTMEvents the
    handle is driving, so widgets can render per-event visualisations
    (event strip, FOV map) that need the full plan up front.
  - Pause/resume: RunState gains "pausing"/"paused"; RunHandle gains
    pause()/resume()/is_paused() and a pause_event the feed loop polls.
    cancel() now also clears the pause event so a cancel while paused
    still releases the feed loop.

controller.py
  - run_experiment / continue_experiment sort events once (by
    min_start_time, then position) and stash the sorted list on the
    handle, so the order the worker processes matches what the widget
    displays.
  - Feed loop honors pause_event: before pulling the next RTMEvent it
    checks the flag, flips state to "paused", and idles until resume()
    -- the MDA engine drains whatever is already queued, then waits.
  - fix: the engine queue (self._queue) is recreated per run. The
    finally-block feeds a STOP_EVENT sentinel to stop the engine; on a
    *cancelled* run cancel_mda() aborts the engine, which may stop
    without draining the queue, leaving stale events + the sentinel
    behind. Reusing that queue made the next run's engine consume the
    stale sentinel and exit after a few events ("stuck at 3/80"). A
    fresh queue per run fixes it.
  - fix: _bump_status_for_frame skips IMG_STIM frames. A stim emission
    is the SLM-illuminated snap paired with its imaging frame; counting
    it double-updated the status (lag/elapsed refreshing twice per stim
    event) and made n_frames_received drift away from the RTMEvent
    count. Imaging + ref frames are the meaningful data frames.

Verified end-to-end against the optogenetic virtual-microscope backend:
cancel mid-run then restart reaches steady state (no stall); pause
halts feeding after the backpressure window drains and resume continues
to completion; frame count tracks RTMEvents 1:1 for single-channel plans.
Rework the minimal status widget into a full run dashboard, driven by
the RunHandle data exposed in the previous commit.

Components (top to bottom):
  - State chip -- RUNNING / PAUSED / DONE / ... as plain text in a
    translucent-neutral rounded chip (no per-state fill: a colored
    banner competed with the imaging/stim/ref legend colors).
  - Legend chips -- imaging / stim / ref; the chip matching the current
    event type is fully opaque, the others dimmed.
  - EventStrip -- one cell per RTMEvent, color-coded by type. Past +
    current cells opaque (progress fill), future cells dimmed. Same-type
    runs are coalesced into single fills so thousands of events render
    with correct alpha instead of over-stacking at sub-pixel widths.
    Empty state draws a "(no events loaded)" placeholder.
  - FovMap -- one dot per unique FOV position, equal-aspect (a straight
    line of FOVs stays a line), grey visit-order path, active dot
    recolored to the current event type. Pinned square via resizeEvent.
    Paints its own rounded panel background; "FOV X/Y" counter in the
    corner.
  - Stats form -- event N/M, elapsed, scheduled, lag, remaining, errors.
    Times formatted hh:mm:ss with the leading unit suffixed and dropped
    when zero; lag turns red past 5 s. Wrapped in a shaded panel echoing
    napari's layer-controls boxes.
  - Pause/Resume + Stop buttons.

Threading / theming details:
  - statusChanged is connected with thread="main" and the widget calls
    psygnal.qt.start_emitting_from_queue() so worker-thread emits are
    delivered on the GUI thread (drives QWidgets safely under napari).
  - A 250 ms QTimer ticks the elapsed/remaining clocks between status
    emissions so time fields don't freeze between frames.
  - The strip cursor tracks n_frames_received (actual snaps), not
    n_events_consumed (the feed loop runs 3-4 ahead via backpressure,
    which made the strip jump several cells at run start).
  - Colors/fonts derive from the Qt palette so the widget adapts to
    napari's light/dark theme; corner radii match napari widgets.
Add a second stage position (20, 20, 0) to the baseline / stim /
recovery sequences so the demo exercises a 2-FOV acquisition -- the
ExperimentStatusWidget's FOV map then shows both positions and the
visit-order path between them. Drop the frame interval 1.5s -> 1s.
@hinderling hinderling force-pushed the feat/async-run-handle branch from d473b9b to 3c0e798 Compare May 16, 2026 09:53
@hinderling hinderling marked this pull request as ready for review May 16, 2026 11:29
@hinderling
Copy link
Copy Markdown
Collaborator Author

@alandolt can you have a look if you see any general issues with this architecture change? still a few open TODOs before merging, but the main idea is there i think! but would be great to have your input before i start migrating the other notebooks etc. I think this will also be useful more long-term, running experiments on different microscopes simultaneously with BO for example, in combo with pymmcore-proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant