Skip to content

Discarded readReadyForQuery() EOF causes connection pool poisoning via stuck inProgress flag #1320

@m1ralx

Description

@m1ralx

Summary

A pooler-side ErrorResponse followed by an immediate connection close (no trailing ReadyForQuery) can permanently poison a pooled connection. After the trigger, every subsequent query on that connection fails with: pq: there is already a query being processed on this connection.

The connection is never evicted from the pool because database/sql doesn't see driver.ErrBadConn. This is the same end-state as #1298, but reached through a different bug-path that the merge of #1299 (commit 6d77ced) does not close.

Root cause

PR #1272 added an inProgress atomic flag that is set at the start of query()/Exec() and only cleared when ReadyForQuery is received from the server. If a network error prevents ReadyForQuery from arriving, the flag stays stuck at true.

Five sites in conn.go (readParseResponse, readStatementDescribeResponse, readPortalDescribeResponse, readBindResponse, postExecuteWorkaround) handle a mid-extended-protocol ErrorResponse by draining the trailing ReadyForQuery and discarding whatever it returns:

case proto.ErrorResponse:
    err := parseError(r, "")
    _ = cn.readReadyForQuery()
    return err

When the peer closes mid-stream, readReadyForQuery() returns io.EOF. The _ = drops it before handleError can classify it, so cn.err is never set, IsValid() returns true, and database/sql keeps handing out the broken connection. The CompareAndSwap guard rejects every subsequent query with errQueryInProgress — which is not driver.ErrBadConn, so (*DB).retry won't retry on a fresh connection either. The change merged via #1299 only addresses io.ErrUnexpectedEOF in handleError; on this path the EOF is dropped before handleError is reached.

How to reproduce

A reproducer is at: https://github.com/m1ralx/pq-bug-demo

It uses a TCP fault-injection proxy between a Go client and a real PostgreSQL 16 instance (via Docker). On a specific Parse, the proxy writes a hand-crafted ErrorResponse (severity ERROR, SQLSTATE 08P01) directly to the client and closes the connection before any ReadyForQuery reaches the client. This mirrors pgbouncer 1.15's disconnect_server(false, ...) -> send_pooler_error(client, false, ...) byte sequence.

git clone https://github.com/m1ralx/pq-bug-demo
cd pq-bug-demo
make up          # start PostgreSQL 16 via Docker
make test-buggy  # demonstrates the poisoning on v1.12.3
make test-fix    # passes with the proposed patch
make down

Real-world impact

We hit this in stage environment with services routed through pgbouncer ≥1.15.

PR Fix

#1321

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions