Обсуждение: BUG #17948: libpq seems to misbehave in a pipelining corner case
The following bug has been logged on the website:
Bug reference: 17948
Logged by: Ivan Trofimov
Email address: i.trofimow@yandex.ru
PostgreSQL version: 15.3
Operating system: Ubuntu 20.04
Description:
As far as i understand, there is an invariant in libpq, that if
PQpipelineSync call finished successfully,
PQresultStatus will eventually return PGRES_PIPELINE_SYNC, or the connection
will be in CONNECTION_BAD state.
This is highlighted in several places in the docs:
1. "PGRES_PIPELINE_SYNC is reported exactly once for each PQpipelineSync at
the corresponding point in the pipeline."
2. "From the client's perspective, after PQresultStatus returns
PGRES_FATAL_ERROR, the pipeline is flagged as aborted.
PQresultStatus will report a PGRES_PIPELINE_ABORTED result for each
remaining queued operation in an aborted pipeline.
The result for PQpipelineSync is reported as PGRES_PIPELINE_SYNC to signal
the end of the aborted pipeline and
resumption of normal result processing. The client must process results with
PQgetResult during error recovery."
So i expect a code like this
if (!PQpipelineSync(conn)) exit(1);
while (PQstatus(conn) != CONNECTION_BAD) {
res = PQgetResult(conn);
if (PQresultStatus(res) == PGRES_PIPELINE_SYNC) {
break;
}
}
to eventually exit the loop.
However, if instead of expected "ReadyToQuery" response server sends an
error and closes the connection (
say, backend terminated by administrator or with a proxy in place and
upstream PG server being dead) this loop
gets stuck in busy-loop.
My understanding is that transitions in pipelining state-machine go like
this:
1. Initial PQgetResult call gets here
https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L2081
2. parseInput gets an error the server send
https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-protocol3.c#L216
and switches the state into PGASYNC_READY
3. PQgetResult continues and here
https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L2126
advances the queue, so the Sync entry is gone, and switches into
PGASYNC_PIPELINE_IDLE later.
4. Next PQgetResult call gets here
https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#LL2110C4-L2110C26,
and pqPipelineProcessQueue switches into PGASYNC_IDLE here
https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L3084
Now we are stuck in the position where libpq considers the connection being
in CONNECTION_OK state
(because no reads over half-closed socket have been issues),
asyncStatus being PGASYNC_IDLE, pipeline being in PQ_PIPELINE_ABORTED state,
and the expected PGRES_PIPELINE_SYNC never came and never will (because
PGASYNC_IDLE, so PQgetResult returns NULL right away).
I am able to reproduce this behavior reliably via
https://pastebin.com/raw/4j3v3QzC,
linking against libpq5=15.3 and running the program against PostgreSQL
15.2.
Who is at fault here, is it libpq or me misunderstanding/misusing libpq?
On 2023-Aug-05, Трофимов Иван wrote: > We've managed to reproduce client <-> server de-synchronization triggered > by this, which we are seeing in production. > Libpq considers '< 2TDCEZ' a sufficient response to '> BDESS', when > according to specification one more 'Z' is expected. > > A bit more context and a MRE: > https://github.com/itrofimow/libpq_protocol_desync Does this patch fix the problems? I think the sync loss is because we're consuming elements from the command queue even when we shouldn't, because we don't know for an ERROR whether we should do so or not. What this patch does is a bit of a hack, I think, but it does solve your test case (and it doesn't break anything else in the tree) AFAICS. I may have to play some more with it before gaining confidence, though. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay dos momentos en la vida de un hombre en los que no debería especular: cuando puede permitírselo y cuando no puede" (Mark Twain)
Вложения
Thank you for looking into this, much appreciated. Yes, I can confirm that the patch fixes both the busy-loop and client<->server desynchronization problem for me.
On 2023-Dec-02, Ivan Trofimov wrote: > Thank you for looking into this, much appreciated. > > Yes, I can confirm that the patch fixes both the busy-loop and > client<->server desynchronization problem for me. Great, thanks for the confirmation. Mulling it over while writing better comments for it, I realized that it could still be somewhat brittle, and it's also confusing that part of the logic is in PQgetResult and the rest in pqCommandQueueAdvance; so I moved it all to the latter, which also lets me document it more clearly in the comments. I addressed the brittleness by changing what to test: instead of just "it's an error, then do nothing if the queue has a SYNC", we can verify that we actually have a matching SYNC in both places. This seems much better. Patch v2 attached. I'll be pushing this to 14+ tomorrow, unless something ugly comes up. Thanks for reporting this problem! -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)