Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions plans/2026-06-27-service-subscriptions-memory-leak.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## Root cause: orphaned `Sub` entries in the service client's `subscriptions` map

**The leak is service-specific and was introduced by PR #1667 "messaging services" (`f0b7a4be`).** A long-lived messaging-service connection accumulates per-queue `Sub` records in its `Client.subscriptions` map that are **never removed** when the associated queues are deleted or unassociated — only the counter is decremented. Over normal queue churn the map grows monotonically for the entire lifetime of the service connection.

### The proof — an asymmetry between two handlers in `serverThread`

Both individual queue subscriptions and service subscriptions store a `Sub` per queue in `Client.subscriptions` (= `clientSubs` for the SMP subscriber thread, wired at `Server.hs:189`). When a queue ends/is deleted, the two paths diverge:

**Individual subscriber — entry IS removed** (`Server.hs:332`, `346`):
```haskell
CSAEndSub qId -> atomically (endSub c qId) >>= a unsub_ -- :332
...
endSub c qId = TM.lookupDelete qId (clientSubs c) >>= (removeWhenNoSubs c $>) -- :346
```

**Service subscriber — entry is NOT removed** (`Server.hs:336-340`):
```haskell
CSAEndServiceSub qId -> atomically $ do
modifyTVar' (clientServiceSubs c) decrease -- decrements serviceSubsCount
modifyTVar' totalServiceSubs decrease -- decrements global count
where decrease = subtractServiceSubs (1, queueIdHash qId)
-- never touches (clientSubs c) — the Sub for qId stays forever
```

### Where the orphaned entries are added (both new in this PR)
- `Server.hs:1860-1862` — on service subscribe (`SSUB`), one `Sub` inserted per queue that has a pending message.
- `Server.hs:2039-2043` (`newServiceDeliverySub`) — on **every** `SEND` to a service-associated queue with no existing sub, a `Sub` is inserted into the service client's `subscriptions`. After delivery the thread state resets to `NoSub` (`:2069`) but the map entry remains as a "already delivering" marker (`:1856-1859`).

### Why they leak
The only places the service client's `subscriptions` map is cleared are:
- `clientDisconnected` — `swapTVar subscriptions M.empty` (`Server.hs:1097`) — only on disconnect.
- `CSADecreaseSubs` — `swapTVar (clientSubs c) M.empty` (`Server.hs:343`) — only on full service takeover by another connection.
- `delQueueAndMsgs` — `TM.lookupDelete entId $ subscriptions clnt` (`Server.hs:2164`) — but `clnt` here is **the recipient deleting its own queue, not the service client**. The service's entry for that queue is reached only via the `CSDeleted → endServiceSub → CSAEndServiceSub` path (`Server.hs:306, 313, 336`), which decrements the counter but leaves the map entry.

**Concrete scenario (fully traced):** Service `S` subscribes (`SSUB`) and stays connected for days. Recipient `R` owns service-associated queue `Q`. A `SEND` to `Q` inserts a `Sub` into `S.subscriptions[Q]` (`:2042`). `R` later deletes `Q` → `delQueueAndMsgs` runs on `R`'s connection, removes `Q` from `R.subscriptions`, decrements counters, enqueues `CSDeleted Q (Just S)` (`:2167`) → `serverThread` runs `CSAEndServiceSub Q` for `S` (`:336`), decrementing `S.serviceSubsCount` but **leaving `S.subscriptions[Q]` in place**. Net: one orphaned `Sub` (record + 2 TVars) per service-associated queue ever deleted/unassociated, never reclaimed until `S` disconnects. The logical counter `serviceSubsCount` correctly drops, so the map size diverges from the counter — making the leak invisible to the existing service-sub metric.

### Verdict
This is a deterministic, static-provable memory leak — no production logging needed to confirm the existence; the asymmetry between `CSAEndSub` (removes) and `CSAEndServiceSub` (doesn't) is the smoking gun. It is specific to messaging-service certificate clients, which is exactly the population added by the services/certificate PR.

### Secondary findings (lower impact, same PR area, not the primary cause)
- **`forkClient` register-after-fork race** (`Server.hs:1356-1359`): if the forked action's `finally` delete (`:1358`) runs before the parent's `IM.insert` (`:1359`), a `Weak ThreadId` of a dead thread is left in `endThreads` until disconnect. Pre-existing, tiny per-entry, but exercised far more by the PR's higher END/DELD volume.
- **Wrong-client counter decrement** (`Server.hs:2166`): `delQueueAndMsgs` decrements `serviceSubsCount` of the *deleting* client, not the service; harmless for non-service deleters (floored at 0) but corrupts accounting if a service deletes its own queue.

---

### Recommended fix (mirror `endSub` in the service path)
Make `CSAEndServiceSub` also delete the per-queue `Sub` and cancel its delivery thread, exactly as `CSAEndSub`/`endSub` do for individual subscribers. Roughly:

```haskell
CSAEndServiceSub qId -> do
s_ <- atomically $ do
modifyTVar' (clientServiceSubs c) decrease
modifyTVar' totalServiceSubs decrease
TM.lookupDelete qId (clientSubs c) <* removeWhenNoSubs c
forM_ unsub_ $ \unsub -> mapM_ unsub s_
where decrease = subtractServiceSubs (1, queueIdHash qId)
```
22 changes: 13 additions & 9 deletions src/Simplex/Messaging/Server.hs
Original file line number Diff line number Diff line change
Expand Up @@ -329,21 +329,25 @@ smpServer started cfg@ServerConfig {transports, transportConfig = tCfg, startOpt
endPreviousSubscriptions = mapM_ $ \(c, subAction, evt) -> do
atomically $ modifyTVar' pendingEvents $ IM.alter (Just . maybe [evt] (evt <|)) (clientId c)
case subAction of
CSAEndSub qId -> atomically (endSub c qId) >>= a unsub_
where
a (Just unsub) (Just s) = unsub s
a _ _ = pure ()
CSAEndServiceSub qId -> atomically $ do
modifyTVar' (clientServiceSubs c) decrease
modifyTVar' totalServiceSubs decrease
where
decrease = subtractServiceSubs (1, queueIdHash qId)
CSAEndSub qId -> atomically (endSub c qId) >>= unsubPrev
-- like endSub, also removes the delivery subscription from the service client's subscriptions map,
-- otherwise the map retains entries for queues unassociated/deleted while the service stays connected.
CSAEndServiceSub qId -> atomically (endServiceQueueSub c qId) >>= unsubPrev
CSADecreaseSubs changedSubs -> do
atomically $ modifyTVar' totalServiceSubs $ subtractServiceSubs changedSubs
forM_ unsub_ $ \unsub -> atomically (swapTVar (clientSubs c) M.empty) >>= mapM_ unsub
where
unsubPrev :: Maybe sub -> IO ()
unsubPrev s_ = sequence_ (unsub_ <*> s_)
endSub :: Client s -> QueueId -> STM (Maybe sub)
endSub c qId = TM.lookupDelete qId (clientSubs c) >>= (removeWhenNoSubs c $>)
endServiceQueueSub :: Client s -> QueueId -> STM (Maybe sub)
endServiceQueueSub c qId = do
modifyTVar' (clientServiceSubs c) decrease
modifyTVar' totalServiceSubs decrease
endSub c qId
where
decrease = subtractServiceSubs (1, queueIdHash qId)
-- remove client from server's subscribed cients
removeWhenNoSubs c = do
noClientSubs <- null <$> readTVar (clientSubs c)
Expand Down
Loading