Conversation
ojarjur
left a comment
There was a problem hiding this comment.
Thanks for the fix!
There are some formatting issues that need to be addressed, but also please update the PR description to give background on what the issue is (connections leaking of the close message gets dropped) and how this PR addresses it.
Thanks!
The ba-proxy-agent currently experiences increasing memory consumption,
leading to daily or weekly restarts. Previous mitigation attempts were
insufficient, and the issue has been isolated to high memory usage on
the agent connection side.
This CL mitigates the issue by enforcing a timeout on idle connections.
Since the root cause remains elusive, forcefully closing unused
connections prevents memory accumulation from persistent links.
Implementation details:
* Introduced `lastActivityTime` property to agent connections, which
updates upon usage.
* Added a background routine to monitor connection activity.
* Configured the routine to explicitly close connections that remain
idle for more than 30 seconds.
| // The server messages channel has been closed. | ||
| return nil, fmt.Errorf("attempt to read a server message from a closed websocket connection") | ||
| } | ||
| conn.updateActivity() |
There was a problem hiding this comment.
This also needs to be called at the beginning of this method (i.e. on line 258).
Otherwise, a connection that the client is continuously polling on, but where the server has not yet responded, will be closed as inactive.
| return msgs, nil | ||
| } | ||
| } | ||
| case <-time.After(time.Second * 20): |
There was a problem hiding this comment.
The timeout here needs to be the minimum of 20 seconds and some amount less than the configured idle timeout (e.g. half of the idle timeout).
That way, a polling request from an actively connected client will timeout and be re-run before the activity tracking logic decides that the connection is idle.
There was a problem hiding this comment.
Alternatively, we could put in some sort of check to enforce that the idle timeout is greater than 20 seconds (e.g. setting a minimum of 30 seconds).
There was a problem hiding this comment.
Hi Omar,
are you suggesting to implement the above fix for case <-time.After(time.Second * 20): ?
and we should parameterized the timeout value?
There was a problem hiding this comment.
What I'm saying is that the timeout used here (on line 279) needs to be smaller than the timeout used for detecting idle connections.
The issue is that the client might have a connection open and be waiting for a message from the server.
If that happens, then the connection activity is only updated at the start of each polling request from the client.
However, the polling request will be held open as long as the timeout here on line 279.
So, if the idle timeout for connections is not longer than the timeout on line 279, then connection will appear to be idle even though there is a client actively polling it.
To prevent that, we need to do something to make sure that the timeout used on line 279 is shorter than the timeout used for idleness detection.
This can either setting a lower bound on the timeout for idle detection, or making the timeout on line 279 shorter in the case that the idle detection timeout is not longer than 20 seconds.
We also want some buffer between the two timeouts; i.e. the timeout on line 279 needs to be strictly shorter than the timeout used for idle detection.
Finally, we need to add a test case that exercises this scenario to verify that our change doesn't cause the connections to be prematurely pruned.
No description provided.