Обсуждение: BUG #17948: libpq seems to misbehave in a pipelining corner case
The following bug has been logged on the website: Bug reference: 17948 Logged by: Ivan Trofimov Email address: i.trofimow@yandex.ru PostgreSQL version: 15.3 Operating system: Ubuntu 20.04 Description: As far as i understand, there is an invariant in libpq, that if PQpipelineSync call finished successfully, PQresultStatus will eventually return PGRES_PIPELINE_SYNC, or the connection will be in CONNECTION_BAD state. This is highlighted in several places in the docs: 1. "PGRES_PIPELINE_SYNC is reported exactly once for each PQpipelineSync at the corresponding point in the pipeline." 2. "From the client's perspective, after PQresultStatus returns PGRES_FATAL_ERROR, the pipeline is flagged as aborted. PQresultStatus will report a PGRES_PIPELINE_ABORTED result for each remaining queued operation in an aborted pipeline. The result for PQpipelineSync is reported as PGRES_PIPELINE_SYNC to signal the end of the aborted pipeline and resumption of normal result processing. The client must process results with PQgetResult during error recovery." So i expect a code like this if (!PQpipelineSync(conn)) exit(1); while (PQstatus(conn) != CONNECTION_BAD) { res = PQgetResult(conn); if (PQresultStatus(res) == PGRES_PIPELINE_SYNC) { break; } } to eventually exit the loop. However, if instead of expected "ReadyToQuery" response server sends an error and closes the connection ( say, backend terminated by administrator or with a proxy in place and upstream PG server being dead) this loop gets stuck in busy-loop. My understanding is that transitions in pipelining state-machine go like this: 1. Initial PQgetResult call gets here https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L2081 2. parseInput gets an error the server send https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-protocol3.c#L216 and switches the state into PGASYNC_READY 3. PQgetResult continues and here https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L2126 advances the queue, so the Sync entry is gone, and switches into PGASYNC_PIPELINE_IDLE later. 4. Next PQgetResult call gets here https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#LL2110C4-L2110C26, and pqPipelineProcessQueue switches into PGASYNC_IDLE here https://github.com/postgres/postgres/blob/a817edbf6f302c376f5c0012d19a0474b6bdea88/src/interfaces/libpq/fe-exec.c#L3084 Now we are stuck in the position where libpq considers the connection being in CONNECTION_OK state (because no reads over half-closed socket have been issues), asyncStatus being PGASYNC_IDLE, pipeline being in PQ_PIPELINE_ABORTED state, and the expected PGRES_PIPELINE_SYNC never came and never will (because PGASYNC_IDLE, so PQgetResult returns NULL right away). I am able to reproduce this behavior reliably via https://pastebin.com/raw/4j3v3QzC, linking against libpq5=15.3 and running the program against PostgreSQL 15.2. Who is at fault here, is it libpq or me misunderstanding/misusing libpq?
On 2023-Aug-05, Трофимов Иван wrote: > We've managed to reproduce client <-> server de-synchronization triggered > by this, which we are seeing in production. > Libpq considers '< 2TDCEZ' a sufficient response to '> BDESS', when > according to specification one more 'Z' is expected. > > A bit more context and a MRE: > https://github.com/itrofimow/libpq_protocol_desync Does this patch fix the problems? I think the sync loss is because we're consuming elements from the command queue even when we shouldn't, because we don't know for an ERROR whether we should do so or not. What this patch does is a bit of a hack, I think, but it does solve your test case (and it doesn't break anything else in the tree) AFAICS. I may have to play some more with it before gaining confidence, though. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay dos momentos en la vida de un hombre en los que no debería especular: cuando puede permitírselo y cuando no puede" (Mark Twain)
Вложения
Thank you for looking into this, much appreciated. Yes, I can confirm that the patch fixes both the busy-loop and client<->server desynchronization problem for me.
On 2023-Dec-02, Ivan Trofimov wrote: > Thank you for looking into this, much appreciated. > > Yes, I can confirm that the patch fixes both the busy-loop and > client<->server desynchronization problem for me. Great, thanks for the confirmation. Mulling it over while writing better comments for it, I realized that it could still be somewhat brittle, and it's also confusing that part of the logic is in PQgetResult and the rest in pqCommandQueueAdvance; so I moved it all to the latter, which also lets me document it more clearly in the comments. I addressed the brittleness by changing what to test: instead of just "it's an error, then do nothing if the queue has a SYNC", we can verify that we actually have a matching SYNC in both places. This seems much better. Patch v2 attached. I'll be pushing this to 14+ tomorrow, unless something ugly comes up. Thanks for reporting this problem! -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)