Skip to content

fix(manager): detect existing server on startup, avoid port conflict#123

Open
TimeToBuildBob wants to merge 5 commits intoActivityWatch:masterfrom
TimeToBuildBob:fix/detect-existing-server
Open

fix(manager): detect existing server on startup, avoid port conflict#123
TimeToBuildBob wants to merge 5 commits intoActivityWatch:masterfrom
TimeToBuildBob:fix/detect-existing-server

Conversation

@TimeToBuildBob
Copy link
Contributor

Problem

When aw-server is already running (managed by systemd, Docker, or started manually for debugging), aw-qt fails to start its own instance with an "Address already in use" error. The tray icon then shows the module as failed, requiring the user to click "Restart" — which is confusing and non-obvious.

Reported in #113 (and related to #110).

Solution

Add a pre-startup check for server modules (aw-server and aw-server-rust): before launching a subprocess, aw-qt probes the configured port. If the server is already responding, it marks the module as started (external) and skips launching its own subprocess.

Changes to manager.py

  • Module._external_server flag — tracks whether the module is using an already-running external server instance
  • Module.start() — probes http://localhost:{port}/api/0/info before starting; uses existing instance if already running
  • Module.is_alive() — returns True for external servers (we don't own the process)
  • Module.stop() — skips termination for external servers (not ours to stop)

Uses _read_server_port() from config.py to respect user-configured ports (not hardcoded 5600).

Behavior

Scenario Before After
aw-server running via systemd Module shows failed, "Restart" required Module connects to existing server automatically
Normal startup (no prior server) Works fine Works fine (no change)
Stop aw-qt Stops its own aw-server Does not stop external aw-server

Testing

  • Tested syntax and import structure
  • Logic mirrors the pattern from aw-client's wait_for_start() which probes the same endpoint

Closes #113

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR fixes a startup port-conflict problem where aw-qt would attempt to launch aw-server or aw-server-rust even when an instance was already running (via systemd, Docker, or a manual debug session), resulting in an "Address already in use" error and a confusing "failed" tray indicator.

The solution probes GET /api/0/info on the configured port before spawning a subprocess. If the server answers, the module is flagged as _external_server = True, started is set, and the subprocess launch is skipped entirely. Liveness checks re-probe the port (with a 1 s result cache to avoid flooding the Qt event loop), and stop() skips terminate() for externally-managed instances. Port resolution is correctly module-specific via the new _get_server_port() helper, which dispatches to _read_aw_server_port or _read_server_rust_port instead of the ambiguous _read_server_port that always preferred the Rust config. Test coverage is thorough.

Key observations:

  • The HTTP probe accepts any 2xx response as proof of an ActivityWatch server, without validating the response body. A foreign HTTP service on port 5600 would suppress the real server launch.
  • is_alive() clears _external_server as a side effect when the external server is found dead, but leaves started = True. On a subsequent toggle() / stop() call the code reaches the "wasn't running" warning path despite the module having been legitimately tracked as running — a minor but potentially confusing log message.
  • Both of the above are style-level concerns; the core logic is correct and all known edge-cases (per-module port, response-close resource leak, startup vs. polling timeout distinction, cache TTL) are handled well.

Confidence Score: 4/5

  • Safe to merge with minor style concerns; no functional regressions expected under normal usage.
  • The core fix is well-architected: module-specific port resolution, proper resource cleanup with a with statement, separate startup vs. polling timeouts, and a short-lived result cache. The two remaining concerns (lack of response-body validation and a misleading log warning after external-server death) are non-blocking style issues rather than correctness bugs. Test coverage is comprehensive and all assertions are accurate.
  • aw_qt/manager.py — specifically the _probe_external_server response validation and the is_alive() state-mutation side effect.

Important Files Changed

Filename Overview
aw_qt/manager.py Adds external server detection via HTTP probe before launching server modules. Core logic is sound: per-module port resolution, with-statement response cleanup, startup/liveness differentiation with separate timeouts, and a short-lived result cache to reduce polling overhead. Two minor concerns: (1) the probe accepts any HTTP 2xx response without validating the body against known ActivityWatch fields, so a foreign service on port 5600 could prevent server launch; (2) is_alive() clears _external_server as a side effect, leaving started=True/_process=None/_external_server=False, which triggers a misleading "wasn't running" warning in stop() when the user retries via the tray.
tests/test_manager.py Comprehensive test additions covering port resolution per module type, response context-manager closure, custom probe timeout, cache reuse and TTL expiry with deterministic monotonic patching, the longer startup timeout, liveness re-probing, and external server stop without process termination. All test assertions are correct and the test design is clean.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Module.start(testing)"] --> B{"name in\n'aw-server' /\n'aw-server-rust'?"}
    B -- No --> F["Launch subprocess\nself._process = Popen(...)"]
    B -- Yes --> C["_get_server_port(testing)\n(module-specific port from config)"]
    C --> D["_probe_external_server(testing, timeout=1.0)\nGET /api/0/info"]
    D -- "200 OK" --> E["_external_server = True\nstarted = True\nreturn (skip Popen)"]
    D -- "Error / timeout" --> F
    F --> G["started = True"]

    H["Module.is_alive()"] --> I{"_external_server?"}
    I -- Yes --> J["_probe_external_server_cached(testing, max_age=1.0)"]
    J -- "alive=True" --> K["return True"]
    J -- "alive=False" --> L["_external_server = False\nclear cache\nreturn False"]
    I -- No --> M{"_process is None?"}
    M -- Yes --> N["return False"]
    M -- No --> O["poll() → returncode is None?\nreturn True/False"]

    P["Module.stop()"] --> Q{"started?"}
    Q -- No --> R["warn: not started"]
    Q -- Yes --> S{"_external_server?"}
    S -- Yes --> T["log: external, skip terminate\nreset all external-server state\nstarted = False\nreturn"]
    S -- No --> U{"is_alive()?"}
    U -- No --> V["warn: wasn't running"]
    U -- Yes --> W["terminate() + wait()\nstarted = False"]
Loading

Comments Outside Diff (2)

  1. aw_qt/manager.py, line 182-188 (link)

    Probe accepts any HTTP 2xx from non-ActivityWatch services

    urlopen raises HTTPError (caught by URLError) for 4xx/5xx responses, so only 2xx/3xx responses return True. However, the probe doesn't inspect the response body at all — any HTTP service running on port 5600 (e.g. a local dev server, another application) will be silently treated as an ActivityWatch instance, causing aw-qt to skip launching its own server and leaving watchers with nothing to talk to.

    Consider parsing the JSON response and checking for a known ActivityWatch-specific field (e.g. hostname or version):

    import json
    
    try:
        with urllib.request.urlopen(
            f"http://localhost:{port}/api/0/info", timeout=timeout
        ) as resp:
            data = json.loads(resp.read())
            return "hostname" in data  # or any required AW-specific field
    except (urllib.error.URLError, OSError, ValueError, KeyError):
        return False
  2. aw_qt/manager.py, line 294-305 (link)

    State-mutating side effect in is_alive() causes misleading log warning

    When is_alive() detects a dead external server it clears _external_server but intentionally leaves started = True (so get_unexpected_stops() can surface the crash). This creates an intermediate state where started=True, _external_server=False, and _process=None.

    If the user then clicks Restart in the tray (which calls toggle()), the code path goes:

    1. toggle()is_alive()False (process is None) → falls into else: if self.started: stop()
    2. stop()_external_server=False → reaches elif not self.is_alive():True → logs "Tried to stop module aw-server, but it wasn't running"

    This warning is misleading because the module was running (as an external server) — it just died. Tracking the previous reason for death (e.g. a _was_external_server flag, or resetting started=False in is_alive()) would produce a more accurate diagnostic. At minimum a code comment here explaining the intentional intermediate state would help future maintainers.

Last reviewed commit: 3abf6e5

aw_qt/manager.py Outdated
if self.name in ("aw-server", "aw-server-rust"):
from .config import _read_server_port

port = _read_server_port(testing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong port probed when starting aw-server

_read_server_port() in config.py always reads the aw-server-rust config first and returns that port if set, regardless of which module is being started. This means that when self.name == "aw-server" (the Python server), if aw-server-rust has a custom port (e.g. 5700) set in its own config, the probe will hit port 5700 instead of aw-server's actual port (5600).

If aw-server-rust happens to be running on port 5700, aw-server will be incorrectly flagged as _external_server = True, skipping its own launch — even though nothing is bound to aw-server's real port. Watchers that connect to aw-server's port will then find nothing listening.

The module-specific helpers (_read_server_rust_port / _read_aw_server_port) should be used here instead:

from .config import _read_server_rust_port, _read_aw_server_port

default_port = 5666 if testing else 5600
if self.name == "aw-server-rust":
    port = _read_server_rust_port(testing) or default_port
else:  # aw-server (Python)
    port = _read_aw_server_port(testing) or default_port

aw_qt/manager.py Outdated
if self.name in ("aw-server", "aw-server-rust"):
from .config import _read_server_port

port = _read_server_port(testing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong port probed when starting aw-server

_read_server_port() in config.py always reads the aw-server-rust config first and returns that port if set, regardless of which module is being started. This means that when self.name == "aw-server" (the Python server), if aw-server-rust has a custom port (e.g. 5700) set in its own config, the probe will hit port 5700 instead of aw-server's actual port (5600).

If aw-server-rust happens to be running on port 5700, aw-server will be incorrectly flagged as _external_server = True, skipping its own launch — even though nothing is bound to aw-server's real port. Watchers that connect to aw-server's port will then find nothing listening.

The module-specific helpers (_read_server_rust_port / _read_aw_server_port) should be used here instead:

from .config import _read_server_rust_port, _read_aw_server_port

default_port = 5666 if testing else 5600
if self.name == "aw-server-rust":
    port = _read_server_rust_port(testing) or default_port
else:  # aw-server (Python)
    port = _read_aw_server_port(testing) or default_port

aw_qt/manager.py Outdated
Comment on lines +244 to +246
if self._external_server:
# We don't own this process; assume it's alive (managed externally)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent failure if external server crashes

is_alive() returns True unconditionally for external servers without actually re-probing the port. This has two downstream effects:

  1. Manager.get_unexpected_stops() filters for x.started and not x.is_alive(). Since is_alive() never returns False here, a crashed external server is invisible to that check — no warning, no restart attempt, and the tray icon permanently shows the module as "running".
  2. stop_all() calls module.stop() on is_alive() modules. For an external server, stop() correctly skips termination, but the unconditional True from is_alive() means this path is always taken, even after the server has already died on its own.

Consider probing the port in is_alive() for external servers (with a very short timeout and cached result) so that at least get_unexpected_stops() can surface the failure to the user:

def is_alive(self) -> bool:
    if self._external_server:
        try:
            port = _read_server_port(self._testing)  # cache testing flag on self
            urllib.request.urlopen(f"http://localhost:{port}/api/0/info", timeout=0.5)
            return True
        except (urllib.error.URLError, OSError):
            return False
    ...

aw_qt/manager.py Outdated
Comment on lines +160 to +163
try:
urllib.request.urlopen(
f"http://localhost:{port}/api/0/info", timeout=1
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking HTTP call on startup thread may freeze UI

urllib.request.urlopen(..., timeout=1) is a synchronous call. If the TCP connection has to wait for the full 1-second timeout (e.g. the port is open but the process is slow to respond), this will block whichever thread calls autostart() — typically the main Qt thread — causing the tray icon to freeze during startup.

Consider running the probe in a threading.Thread or using a non-blocking socket.connect_ex() check first to gate the HTTP request, so the timeout only fires when truly needed:

import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.settimeout(0.2)
    if s.connect_ex(("localhost", port)) != 0:
        pass  # port not open, skip HTTP probe entirely
    # else: proceed with urlopen

@ErikBjare
Copy link
Member

@TimeToBuildBob Address comments

When aw-server is already running (e.g. managed by systemd, Docker, or
started manually for debugging), aw-qt previously failed to start its
own instance with an "Address already in use" error, leaving the user
confused and requiring a manual "Restart" click from the tray icon.

This change adds a pre-startup check for server modules (aw-server and
aw-server-rust): before launching a subprocess, aw-qt probes the
expected port. If the server is already responding, it marks the module
as started (external) and skips the subprocess launch entirely.

- `Module._external_server` flag tracks this state
- `is_alive()` returns True for external servers (not our process to track)
- `stop()` skips termination for external servers (not ours to stop)

Fixes the UX issue reported in ActivityWatch#113: users running aw-server via
systemd or alongside another process no longer see startup failures.
@TimeToBuildBob TimeToBuildBob force-pushed the fix/detect-existing-server branch from aa51ef0 to 5af82d8 Compare March 16, 2026 12:19
@TimeToBuildBob
Copy link
Contributor Author

Addressed the review comments in 5af82d8.

Changes:

  • Correct port selection per module: added _get_server_port() so aw-server probes only the Python server config and aw-server-rust probes only the Rust config, avoiding the false-positive cross-port case Greptile flagged.
  • Re-probe external servers in is_alive(): external servers are no longer assumed alive forever; we now probe again and clear _external_server state if the server disappeared, so unexpected external-server stops can surface correctly.
  • Reduced startup-thread blocking: cut the pre-start probe timeout from 1.0s to 0.2s to reduce UI hitching during autostart while keeping the same behavior.
  • Tests: added unit coverage for per-module port selection and for the external-server re-probe path.

Validation:

  • targeted manager tests: 9 passed (TestModuleToggle, TestModuleServerProbe, TestModuleIsAlive)

@greptileai review

aw_qt/manager.py Outdated
Comment on lines +180 to +184
try:
urllib.request.urlopen(f"http://localhost:{port}/api/0/info", timeout=0.2)
except (urllib.error.URLError, OSError):
return False
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP response not closed — resource leak

urllib.request.urlopen returns an http.client.HTTPResponse object that holds an open socket. Not closing it will emit ResourceWarning: unclosed socket in Python and can exhaust file descriptors if is_alive() is polled frequently (every N seconds by the tray). The response object supports the context-manager protocol; wrap it in a with statement:

Suggested change
try:
urllib.request.urlopen(f"http://localhost:{port}/api/0/info", timeout=0.2)
except (urllib.error.URLError, OSError):
return False
return True
try:
with urllib.request.urlopen(
f"http://localhost:{port}/api/0/info", timeout=0.2
):
pass
except (urllib.error.URLError, OSError):
return False
return True

Comment on lines 273 to +280
def is_alive(self) -> bool:
if self._external_server:
# We don't own this process, so re-probe the server instead.
alive = self._probe_external_server(self._external_server_testing)
if not alive:
self._external_server = False
self._external_server_testing = False
return alive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated blocking HTTP probe inside is_alive() can stall the Qt event loop

is_alive() is called on every polling cycle (e.g. get_unexpected_stops() fires from the tray's timer). For each external-server module this now makes a blocking urlopen call with a 0.2 s timeout. When the external server is down every single check takes 0.2 s — and the call happens on the Qt main thread (same thread that drives autostart() and the tray timer).

Unlike the one-off probe in start(), this one is repeated on every liveness check. Consider caching the result with a short TTL (e.g. 5 s) or moving the probe off the Qt thread, so a dead external server doesn't impose a guaranteed 0.2 s stall per polling interval:

import time

_PROBE_CACHE_TTL = 5.0  # seconds

def is_alive(self) -> bool:
    if self._external_server:
        now = time.monotonic()
        if now - getattr(self, "_last_probe_time", 0) >= _PROBE_CACHE_TTL:
            alive = self._probe_external_server(self._external_server_testing)
            self._last_probe_time = now
            self._last_probe_result = alive
            if not alive:
                self._external_server = False
                self._external_server_testing = False
            return alive
        return self._last_probe_result
    ...

@TimeToBuildBob
Copy link
Contributor Author

Addressed the latest Greptile follow-up in cc60155.

Changes:

  • Close the probe response: _probe_external_server() now wraps urllib.request.urlopen(...) in a with block so the response/socket is closed on every successful probe.
  • Cache short-lived liveness probes: added a 1s cache for external-server probe results, which avoids repeated blocking HTTP checks when is_alive() is hit multiple times close together by the tray/menu polling loop.
  • Reset cache with state changes: clear the cache when detaching from an external server, so stale results do not survive stop/failure transitions.
  • Tests: added coverage for response closing plus cache reuse/expiry behavior.

Validation:

  • targeted manager tests: 12 passed (TestModuleToggle, TestModuleServerProbe, TestModuleIsAlive)

@greptileai review

Comment on lines +183 to +185
with urllib.request.urlopen(
f"http://localhost:{port}/api/0/info", timeout=0.2
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Startup probe timeout too short — may miss a loaded server

_probe_external_server uses timeout=0.2 for both the one-shot startup check (called from start()) and the repeated liveness probe (called from is_alive() via _probe_external_server_cached).

The 0.2 s limit makes sense for liveness polls to avoid stalling the Qt main thread. However, the startup probe in start() fires only once per session and blocks just as long either way. If the external server is momentarily loaded or slow (e.g., JVM / Python startup warmup), the probe can time out and return False, causing aw-qt to attempt launching a second instance — triggering the exact "address already in use" error this PR is intended to fix.

Consider using a separate, larger timeout for the startup case. One clean approach is a dedicated parameter on _probe_external_server:

def _probe_external_server(self, testing: bool, timeout: float = 0.2) -> bool:
    port = self._get_server_port(testing)
    if port is None:
        return False
    try:
        with urllib.request.urlopen(
            f"http://localhost:{port}/api/0/info", timeout=timeout
        ):
            return True
    except (urllib.error.URLError, OSError):
        return False

Then in start():

if self._probe_external_server(testing, timeout=1.0):

This keeps the liveness path at 0.2 s while giving the startup probe a fighting chance on a loaded system.

@TimeToBuildBob
Copy link
Contributor Author

Addressed the latest Greptile timeout follow-up in 19c6a98.

Changes:

  • Separate startup vs liveness probe timeouts: _probe_external_server() now accepts a timeout parameter.
  • Use a longer startup probe: start() now probes with timeout=1.0 before deciding to skip spawning a server process, which reduces the risk of missing a loaded-but-already-running external server and hitting the original port-conflict failure.
  • Keep fast liveness checks: the cached is_alive() path still uses the short 0.2s probe timeout, so tray/menu polling stays responsive.
  • Tests: added coverage for the custom-timeout path and for start() using the longer startup timeout.

Validation:

  • targeted manager tests: 14 passed (TestModuleToggle, TestModuleServerProbe, TestModuleStart, TestModuleIsAlive)

@greptileai review

@TimeToBuildBob
Copy link
Contributor Author

Addressed the latest review gaps in 3abf6e5.

Changes:

  • make the cache-reuse test deterministic by mocking monotonic() instead of relying on wall-clock timing
  • assert that started stays True when an adopted external server disappears, so get_unexpected_stops() can still surface it
  • add explicit coverage for stop() on externally managed servers, verifying we reset aw-qt state without terminating the external process

Validation:

  • tests/test_manager.py: 20 passed
    • executed in a lightweight stubbed-PyQt test harness, since this VM doesn't have the full Qt dependency available for direct import

This should close the remaining Greptile concerns around determinism + external-server lifecycle coverage.

@ErikBjare
Copy link
Member

@greptileai review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aw-qt-0^20240903 always fails to locate aw-server on first invocation after server boot.

2 participants