exactly what is COPY BOTH mode supposed to do in case of an error?
От | Robert Haas |
---|---|
Тема | exactly what is COPY BOTH mode supposed to do in case of an error? |
Дата | |
Msg-id | CA+TgmobsPB=xO7+z3z=hOa0s438fs7r-hGvU35-bkwPARR3JsQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: exactly what is COPY BOTH mode supposed to do in case
of an error?
(Simon Riggs <simon@2ndQuadrant.com>)
|
Список | pgsql-hackers |
It seems the backend and libpq don't agree. The backend makes no special provision to wait for a CopyDone message if an error occurs during copy-both. It simply sends an ErrorResponse and that's it. libpq, on the other hand, treats either CopyDone or ErrorResponse as a cue to transition to PGASYNC_COPY_IN (see pqGetCopyData3). And that's a problem, because in PGASYNC_COPY_IN mode, libpq is unwilling to process the ErrorResponse. getCopyDataMessage doesn't do anything with it, and if the user calls PQgetResult(), pqParseInput3() won't process it either, because it's only willing to do that when the status is PGASYNC_BUSY. So the bottom line is that after the server throws an error, the server gets back to thinking that the connection is idle, while the client ends up thinking that we're still in copy-in mode. The client can try to recover by doing PQputCopyEnd(), which will get libpq back to an idle state, but if the extended-query protocol is in use, the Sync it sends will cause the server to send an extra ReadyForQuery that libpq isn't expecting. Hilarity ensues. It's perhaps not coincidental that "48.2.5 COPY operations" is silent about what is supposed to happen when an error occurs in copy-both mode, though it does talk about both copy-in and copy-out. On copy-in, it says: In the event of a backend-detected error during copy-in mode (including receipt of a CopyFail message), the backend will issue an ErrorResponse message. If the COPY command was issued via an extended-query message, the backend will now discard frontend messages until a Sync message is received, then it will issue ReadyForQuery and return to normal processing. And regarding copy-out, it says: In the event of a backend-detected error during copy-out mode, the backend will issue an ErrorResponse message and revert to normal processing. The frontend should treat receipt of ErrorResponse as terminating the copy-out mode. There's no corresponding statement about error-handling in the copy-both case. However, since the apparent intent is that an error message from the server trumps anything the client may have had in mind, it seems reasonable to decide that an ErrorResponse is intended to fully terminate copy-both mode (not just switch to copy-in) and initiate a skip-until-sync. That's what the server actually does, but libpq has other ideas. One way to see the practical effect of this is to set up a streaming replication slave but modify the backup label to reference some future WAL location not yet written (I just changed the "0" before the slash to a "1"). This will cause the server to throw the following error: ereport(ERROR, (errmsg("requested starting point %X/%X is ahead of the WAL flush position of this server %X/%X", (uint32) (cmd->startpoint >> 32), (uint32) (cmd->startpoint), (uint32) (FlushPtr >> 32), (uint32) (FlushPtr)))); Doing this will cause the PQgetCopyData in libpqrcv_receive to return -1, so you might expected that slave would get the following error: ereport(ERROR, (errmsg("could not receive data from WAL stream: %s", PQerrorMessage(streamConn)))); But you don't, because PQgetResult() returns PGRES_COPY_IN. So the slave thinks that the master made a *normal* termination of streaming replication, due to a timeline change, and it prints this: LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 at 0/0 It then calls PQputCopyEnd to end a copy that the server thinks is no longer in progress and then invokes PQgetResult again. And now it gets the error message: FATAL: error reading result of streaming command: ERROR: requested starting point 1/5000000 is ahead of the WAL flush position of this server 0/6000348 This doesn't in practice matter very much, because either way, we're going to slam shut the server connection at this point. But it's clearly messed up - the error is actually NOT in response to the CopyDone we sent, but rather in response to the START_STREAMING command that preceded it. libpq, however, refuses to receive the error message until after the unnecessary CopyDone has been sent. I believe that the only other place where this coding pattern arises is in receivelog.c. That code has actually adopted the opposite convention from the backend: it thinks it needs to send CopyDone whether or not an error has occurred. This turns out not to matter particularly, because the server is just going to try to close the connection anyway. But if it did anything else after ending the copy, I suspect the wheels would come off. I'm attaching a patch which adopts the position that the backend is right and libpq is wrong. The opposite approach is also possible, but I haven't tried to implement it. Or maybe there's a third way which is better still. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: David FetterДата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Fix collation assignment for aggregates with ORDER BY.