Обсуждение: libpq copy error handling busted

Поиск

Список

Период

Сортировка

libpq copy error handling busted

От

Andres Freund

Дата:

03 июня 2020 г., 23:12:42

Hi,

When libpq is used to COPY data to the server, it doesn't properly
handle errors.

An easy way to trigger the problem is to start pgbench -i with a
sufficiently large scale, and then just shut the server down. pgbench
will happily use 100% of the cpu trying to send data to the server, even
though libpq knows that the connection is broken.

It can't even be cancelled using ctrl-c anymore, because the cancel
request cannot be sent:

andres@awork3:~/src/postgresql$ pgbench -i -s 4000 -q
dropping old tables...
creating tables...
generating data (client-side)...
80889300 of 400000000 tuples (20%) done (elapsed 85.00 s, remaining 335.33 s)
^CCould not send cancel request: PQcancel() -- connect() failed: No such file or directory


This is partially an old problem, and partially got recently
worse. Before the below commit we detected terminated connections, but
we didn't handle copy failing.


The failure to handle terminated connections originates in:

commit 1f39a1c0641531e0462a4822f2dba904c5d4d699
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   2019-03-19 16:20:20 -0400

    Restructure libpq's handling of send failures.


The problem is basically that pqSendSome() returns *success* in all
failure cases. Both when a failure is already known:

+    /*
+     * If we already had a write failure, we will never again try to send data
+     * on that connection.  Even if the kernel would let us, we've probably
+     * lost message boundary sync with the server.  conn->write_failed
+     * therefore persists until the connection is reset, and we just discard
+     * all data presented to be written.
+     */
+    if (conn->write_failed)
+    {
+        /* conn->write_err_msg should be set up already */
+        conn->outCount = 0;
+        return 0;
+    }
+

and when initially "diagnosing" the failure:
            /* Anything except EAGAIN/EWOULDBLOCK/EINTR is trouble */
            switch (SOCK_ERRNO)
...
                    /* Discard queued data; no chance it'll ever be sent */
                    conn->outCount = 0;
                    return 0;

The idea of the above commit was:
    Instead, let's fix this in a more general fashion by eliminating
    pqHandleSendFailure altogether, and instead arranging to postpone
    all reports of send failures in libpq until after we've made an
    attempt to read and process server messages.  The send failure won't
    be reported at all if we find a server message or detect input EOF.

but that doesn't work here, because we never process the error
message. There's no code in pqParseInput3() to process server messages
while doing copy.


I'm honestly a bit baffled. How can we not have noticed that COPY FROM
STDIN doesn't handle errors before the input is exhausted? It's not just
pgbench, it's psql (and I asume pg_restore) etc as well.

Greetings,

Andres Freund

Re: libpq copy error handling busted

От

Tom Lane

Дата:

04 июня 2020 г., 01:41:28

Andres Freund <andres@anarazel.de> writes:
> When libpq is used to COPY data to the server, it doesn't properly
> handle errors.
> This is partially an old problem, and partially got recently
> worse. Before the below commit we detected terminated connections, but
> we didn't handle copy failing.

Yeah.  After poking at that for a little bit, there are at least three
problems:

* pqSendSome() is responsible not only for pushing out data, but for
calling pqReadData in any situation where it can't get rid of the data
promptly.  1f39a1c06 overlooked that requirement, and the upshot is
that we don't necessarily notice that the connection is broken (it's
pqReadData's job to detect that).  Putting a pqReadData call into
the early-exit path helps, but doesn't fix things completely.

* The more longstanding problem is that the PQputCopyData code path
doesn't have any mechanism for consuming an 'E' (error) message
once pqReadData has collected it.  AFAICS that's ancient.  (It does
not affect the behavior of this case if you use an immediate-mode
shutdown, because then the backend never issues an 'E' message;
but it does matter in a fast-mode shutdown.)  I think that the
idea was to let the client dump all its copy data and then report
the error message when PQendCopy is called, but as you say, that's
none too friendly when there's gigabytes of data involved.  I'm
not sure we can do anything about this without effectively changing
the client API for copy errors, though.

* As for control-C not getting out of it: there is

        if (CancelRequested)
            break;

in pgbench's loop, but this does nothing in this scenario because
fe-utils/cancel.c only sets that flag when it successfully sends a
Cancel ... which it certainly cannot if the postmaster is gone.
I suspect that this may be relatively recent breakage.  It doesn't look
too well thought out, in any case.  The places that are testing this
flag look like they'd rather not be bothered with the fine point of
whether the cancel request actually went anywhere.  (And aside from this
issue, I see no mechanism for that flag to become unset once it's set.
Current users of cancel.c probably don't care, but we'd have noticed if
we tried to make psql use it.)

            regards, tom lane

Re: libpq copy error handling busted

От

Tom Lane

Дата:

04 июня 2020 г., 04:35:53

I wrote:
> * pqSendSome() is responsible not only for pushing out data, but for
> calling pqReadData in any situation where it can't get rid of the data
> promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> that we don't necessarily notice that the connection is broken (it's
> pqReadData's job to detect that).  Putting a pqReadData call into
> the early-exit path helps, but doesn't fix things completely.

Ah, it's better if I put the pqReadData call into *both* the paths
where 1f39a1c06 made pqSendSome give up.  The attached patch seems
to fix the issue for the "pgbench -i" scenario, with either fast-
or immediate-mode server stop.  I tried it with and without SSL too,
just to see.  Still, it's not clear to me whether this might worsen
any of the situations we discussed in the lead-up to 1f39a1c06 [1].
Thomas, are you in a position to redo any of that testing?

> * The more longstanding problem is that the PQputCopyData code path
> doesn't have any mechanism for consuming an 'E' (error) message
> once pqReadData has collected it.

At least with pgbench's approach (die immediately on PQputline failure)
this isn't very relevant once we apply the attached.  Perhaps we should
revisit this behavior anyway, but I'd be afraid to back-patch a change
of that nature.

> * As for control-C not getting out of it: there is
>         if (CancelRequested)
>             break;
> in pgbench's loop, but this does nothing in this scenario because
> fe-utils/cancel.c only sets that flag when it successfully sends a
> Cancel ... which it certainly cannot if the postmaster is gone.

I'll send a patch for this later.

            regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 9afa0533a6..9273984727 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -823,6 +823,10 @@ definitelyFailed:
  * Return 0 on success, -1 on failure and 1 when not all data could be sent
  * because the socket would block and the connection is non-blocking.
  *
+ * Note that this is also responsible for consuming data from the socket
+ * (putting it in conn->inBuffer) in any situation where we can't send
+ * all the specified data immediately.
+ *
  * Upon write failure, conn->write_failed is set and the error message is
  * saved in conn->write_err_msg, but we clear the output buffer and return
  * zero anyway; this is because callers should soldier on until it's possible
@@ -842,12 +846,20 @@ pqSendSome(PGconn *conn, int len)
      * on that connection.  Even if the kernel would let us, we've probably
      * lost message boundary sync with the server.  conn->write_failed
      * therefore persists until the connection is reset, and we just discard
-     * all data presented to be written.
+     * all data presented to be written.  However, as long as we still have a
+     * valid socket, we should continue to absorb data from the backend, so
+     * that we can collect any final error messages.
      */
     if (conn->write_failed)
     {
         /* conn->write_err_msg should be set up already */
         conn->outCount = 0;
+        /* Absorb input data if any, and detect socket closure */
+        if (conn->sock != PGINVALID_SOCKET)
+        {
+            if (pqReadData(conn) < 0)
+                return -1;
+        }
         return 0;
     }
 
@@ -917,6 +929,13 @@ pqSendSome(PGconn *conn, int len)
 
                     /* Discard queued data; no chance it'll ever be sent */
                     conn->outCount = 0;
+
+                    /* Absorb input data if any, and detect socket closure */
+                    if (conn->sock != PGINVALID_SOCKET)
+                    {
+                        if (pqReadData(conn) < 0)
+                            return -1;
+                    }
                     return 0;
             }
         }

Re: libpq copy error handling busted

От

Thomas Munro

Дата:

04 июня 2020 г., 04:53:59

On Thu, Jun 4, 2020 at 1:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > * pqSendSome() is responsible not only for pushing out data, but for
> > calling pqReadData in any situation where it can't get rid of the data
> > promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> > that we don't necessarily notice that the connection is broken (it's
> > pqReadData's job to detect that).  Putting a pqReadData call into
> > the early-exit path helps, but doesn't fix things completely.
>
> Ah, it's better if I put the pqReadData call into *both* the paths
> where 1f39a1c06 made pqSendSome give up.  The attached patch seems
> to fix the issue for the "pgbench -i" scenario, with either fast-
> or immediate-mode server stop.  I tried it with and without SSL too,
> just to see.  Still, it's not clear to me whether this might worsen
> any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> Thomas, are you in a position to redo any of that testing?

Yes, sure.  The testing consisted of running on a system with OpenSSL
1.1.1a (older versions didn't have the problem).  It originally showed
up on eelpout, a very underpowered build farm machine running Linux on
ARM64, but then later we worked out we could make it happen on a Mac
or any other Linux system if we had bad enough luck or if we added a
sleep in a particular spot.  We could do it with psql running in a
loop using a bad certificate from the testing setup stuff, as shown
here:

https://www.postgresql.org/message-id/CA%2BhUKGJafyTgpsYBgQGt1EX0O8UnL4VGHSc7J0KZyMH4_jPGBw%40mail.gmail.com

I don't have access to eelpout from where I am right now, but I'll try
that test now on the Debian 10 amd64 system I have here.  OpenSSL has
since moved on to 1.1.1d-0+deb10u3, but that should be fine, the
triggering change was the move to TLS1.3 so let me see what happens if
I do that with your patch applied...




> > * The more longstanding problem is that the PQputCopyData code path
> > doesn't have any mechanism for consuming an 'E' (error) message
> > once pqReadData has collected it.
>
> At least with pgbench's approach (die immediately on PQputline failure)
> this isn't very relevant once we apply the attached.  Perhaps we should
> revisit this behavior anyway, but I'd be afraid to back-patch a change
> of that nature.
>
> > * As for control-C not getting out of it: there is
> >               if (CancelRequested)
> >                       break;
> > in pgbench's loop, but this does nothing in this scenario because
> > fe-utils/cancel.c only sets that flag when it successfully sends a
> > Cancel ... which it certainly cannot if the postmaster is gone.
>
> I'll send a patch for this later.
>
>                         regards, tom lane
>
> [1]
https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com

Re: libpq copy error handling busted

От

Thomas Munro

Дата:

04 июня 2020 г., 06:36:52

On Thu, Jun 4, 2020 at 1:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jun 4, 2020 at 1:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Ah, it's better if I put the pqReadData call into *both* the paths
> > where 1f39a1c06 made pqSendSome give up.  The attached patch seems
> > to fix the issue for the "pgbench -i" scenario, with either fast-
> > or immediate-mode server stop.  I tried it with and without SSL too,
> > just to see.  Still, it's not clear to me whether this might worsen
> > any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> > Thomas, are you in a position to redo any of that testing?

It seems to be behave correctly in that scenario.

Here's what I tested.  First, I put this into pgdata/postgresql.conf:

ssl=on
ssl_ca_file='root+client_ca.crt'
ssl_cert_file='server-cn-only.crt'
ssl_key_file='server-cn-only.key'
ssl_crl_file='root+client.crl'
ssl_min_protocol_version='TLSv1.2'
ssl_max_protocol_version='TLSv1.1'
ssl_min_protocol_version='TLSv1.2'
ssl_max_protocol_version=''

I copied the named files from src/test/ssl/ssl/ into pgdata, and I ran
chmod 600 on the .key file.

I put this into pgdata/pg_hba.conf at the top:

hostssl all all 127.0.0.1/32 cert clientcert=verify-full

I made a copy of src/test/ssl/ssl/client-revoked.key and ran chmod 600 on it.

Now on unpatched master I get:

$ psql "host=127.0.0.1 port=5432 dbname=postgres user=tmunro
sslcert=src/test/ssl/ssl/client-revoked.crt sslkey=client-revoked.key
sslmode=require"
psql: error: could not connect to server: SSL error: sslv3 alert
certificate revoked

It's the same if I add in this sleep in fe-connect.c:

+sleep(1);
                                /*
                                 * Send the startup packet.
                                 *

If I revert 1f39a1c0641531e0462a4822f2dba904c5d4d699 "Restructure
libpq's handling of send failures.", I get the error that eelpout
showed intermittently:

psql: error: could not connect to server: server closed the connection
unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
could not send startup packet: Connection reset by peer

I go back to master, and apply your patch.  I get the expected error:

psql: error: could not connect to server: SSL error: sslv3 alert
certificate revoked

Re: libpq copy error handling busted

От

Thomas Munro

Дата:

04 июня 2020 г., 07:05:43

On Thu, Jun 4, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's what I tested.

In passing, I noticed that this:

$ psql ...
psql: error: could not connect to server: private key file
"src/test/ssl/ssl/client-revoked.key" has group or world access;
permissions should be u=rw (0600) or less

... leads to this nonsensical error message on the server:

2020-06-04 16:03:11.547 NZST [7765] LOG:  could not accept SSL
connection: Success

Re: libpq copy error handling busted

От

Oleksandr Shulgin

Дата:

04 июня 2020 г., 09:22:15

On Thu, Jun 4, 2020 at 5:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Jun 4, 2020 at 1:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jun 4, 2020 at 1:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Ah, it's better if I put the pqReadData call into *both* the paths
> > where 1f39a1c06 made pqSendSome give up. The attached patch seems
> > to fix the issue for the "pgbench -i" scenario, with either fast-
> > or immediate-mode server stop. I tried it with and without SSL too,
> > just to see. Still, it's not clear to me whether this might worsen
> > any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> > Thomas, are you in a position to redo any of that testing?

It seems to be behave correctly in that scenario.

Here's what I tested. First, I put this into pgdata/postgresql.conf:

ssl=on
ssl_ca_file='root+client_ca.crt'
ssl_cert_file='server-cn-only.crt'
ssl_key_file='server-cn-only.key'
ssl_crl_file='root+client.crl'
ssl_min_protocol_version='TLSv1.2'
ssl_max_protocol_version='TLSv1.1'
ssl_min_protocol_version='TLSv1.2'
ssl_max_protocol_version=''

I copied the named files from src/test/ssl/ssl/ into pgdata, and I ran
chmod 600 on the .key file.

I put this into pgdata/pg_hba.conf at the top:

hostssl all all 127.0.0.1/32 cert clientcert=verify-full

I made a copy of src/test/ssl/ssl/client-revoked.key and ran chmod 600 on it.

Would it be feasible to capture this in a sort of a regression (TAP?) test?

Alex

Re: libpq copy error handling busted

От

Thomas Munro

Дата:

04 июня 2020 г., 12:04:59

On Thu, Jun 4, 2020 at 6:22 PM Oleksandr Shulgin
<oleksandr.shulgin@zalando.de> wrote:
> On Thu, Jun 4, 2020 at 5:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Here's what I tested.  First, I put this into pgdata/postgresql.conf:

> Would it be feasible to capture this in a sort of a regression (TAP?) test?

If I'm remembering correctly, it wouldn't work on Windows.  After
you've had an error sending to a socket, you can't receive (even if
there was something sent to you earlier).  At least that's how it
seemed from the experiments on that other thread.  The other problem
is that it requires inserting a sleep into the code...

Re: libpq copy error handling busted

От

Tom Lane

Дата:

04 июня 2020 г., 19:29:06

I wrote:
> * As for control-C not getting out of it: there is
>         if (CancelRequested)
>             break;
> in pgbench's loop, but this does nothing in this scenario because
> fe-utils/cancel.c only sets that flag when it successfully sends a
> Cancel ... which it certainly cannot if the postmaster is gone.
> I suspect that this may be relatively recent breakage.  It doesn't look
> too well thought out, in any case.  The places that are testing this
> flag look like they'd rather not be bothered with the fine point of
> whether the cancel request actually went anywhere.

On closer inspection, it seems that scripts_parallel.c does have a
dependency on the cancel request having been sent, because it insists
on collecting a query result off the active connection after detecting
CancelRequested.  This seems dangerously overoptimistic to me; it will
lock up if for any reason the server doesn't honor the cancel request.
It's also pointless, because all the calling apps are just going to close
their connections and exit(1) afterwards, so there's no use in trying to
resynchronize the connection state.  (Plus, disconnectDatabase will
issue cancels on any busy connections; which would be necessary anyway
in a parallel operation, since cancel.c could only have signaled one of
them.)  So the attached patch just removes the useless consumeQueryResult
call, and then simplifies select_loop's API a bit.

With that change, I don't see any place that wants the existing definition
of CancelRequested rather than the simpler meaning of "SIGINT was
received", so I just changed it to mean that.  We could certainly also
have a variable tracking whether a cancel request was sent, but I see
no point in one right now.

It's easiest to test this *without* the other patch -- just run the
pgbench scenario Andres demonstrated, and see whether control-C gets
pgbench to quit cleanly.

            regards, tom lane

diff --git a/src/bin/scripts/scripts_parallel.c b/src/bin/scripts/scripts_parallel.c
index 45c69b8d19..c3384d452a 100644
--- a/src/bin/scripts/scripts_parallel.c
+++ b/src/bin/scripts/scripts_parallel.c
@@ -28,7 +28,7 @@
 #include "scripts_parallel.h"

 static void init_slot(ParallelSlot *slot, PGconn *conn);
-static int    select_loop(int maxFd, fd_set *workerset, bool *aborting);
+static int    select_loop(int maxFd, fd_set *workerset);

 static void
 init_slot(ParallelSlot *slot, PGconn *conn)
@@ -41,23 +41,16 @@ init_slot(ParallelSlot *slot, PGconn *conn)
 /*
  * Loop on select() until a descriptor from the given set becomes readable.
  *
- * If we get a cancel request while we're waiting, we forego all further
- * processing and set the *aborting flag to true.  The return value must be
- * ignored in this case.  Otherwise, *aborting is set to false.
+ * Returns -1 on failure (including getting a cancel request).
  */
 static int
-select_loop(int maxFd, fd_set *workerset, bool *aborting)
+select_loop(int maxFd, fd_set *workerset)
 {
     int            i;
     fd_set        saveSet = *workerset;

     if (CancelRequested)
-    {
-        *aborting = true;
         return -1;
-    }
-    else
-        *aborting = false;

     for (;;)
     {
@@ -90,7 +83,7 @@ select_loop(int maxFd, fd_set *workerset, bool *aborting)
         if (i < 0 && errno == EINTR)
             continue;            /* ignore this */
         if (i < 0 || CancelRequested)
-            *aborting = true;    /* but not this */
+            return -1;            /* but not this */
         if (i == 0)
             continue;            /* timeout (Win32 only) */
         break;
@@ -135,7 +128,6 @@ ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
     {
         fd_set        slotset;
         int            maxFd = 0;
-        bool        aborting;

         /* We must reconstruct the fd_set for each call to select_loop */
         FD_ZERO(&slotset);
@@ -157,19 +149,12 @@ ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
         }

         SetCancelConn(slots->connection);
-        i = select_loop(maxFd, &slotset, &aborting);
+        i = select_loop(maxFd, &slotset);
         ResetCancelConn();

-        if (aborting)
-        {
-            /*
-             * We set the cancel-receiving connection to the one in the zeroth
-             * slot above, so fetch the error from there.
-             */
-            consumeQueryResult(slots->connection);
+        /* failure? */
+        if (i < 0)
             return NULL;
-        }
-        Assert(i != 0);

         for (i = 0; i < numslots; i++)
         {
diff --git a/src/fe_utils/cancel.c b/src/fe_utils/cancel.c
index eb4056a9a6..51fb67d384 100644
--- a/src/fe_utils/cancel.c
+++ b/src/fe_utils/cancel.c
@@ -43,11 +43,11 @@
 static PGcancel *volatile cancelConn = NULL;

 /*
- * CancelRequested tracks if a cancellation request has completed after
- * a signal interruption.  Note that if cancelConn is not set, in short
- * if SetCancelConn() was never called or if ResetCancelConn() freed
- * the cancellation object, then CancelRequested is switched to true after
- * all cancellation attempts.
+ * CancelRequested is set when we receive SIGINT (or local equivalent).
+ * There is no provision in this module for resetting it; but applications
+ * might choose to clear it after successfully recovering from a cancel.
+ * Note that there is no guarantee that we successfully sent a Cancel request,
+ * or that the request will have any effect if we did send it.
  */
 volatile sig_atomic_t CancelRequested = false;

@@ -148,6 +148,8 @@ handle_sigint(SIGNAL_ARGS)
     int            save_errno = errno;
     char        errbuf[256];

+    CancelRequested = true;
+
     if (cancel_callback != NULL)
         cancel_callback();

@@ -156,7 +158,6 @@ handle_sigint(SIGNAL_ARGS)
     {
         if (PQcancel(cancelConn, errbuf, sizeof(errbuf)))
         {
-            CancelRequested = true;
             write_stderr(_("Cancel request sent\n"));
         }
         else
@@ -165,8 +166,6 @@ handle_sigint(SIGNAL_ARGS)
             write_stderr(errbuf);
         }
     }
-    else
-        CancelRequested = true;

     errno = save_errno;            /* just in case the write changed it */
 }
@@ -193,6 +192,8 @@ consoleHandler(DWORD dwCtrlType)
     if (dwCtrlType == CTRL_C_EVENT ||
         dwCtrlType == CTRL_BREAK_EVENT)
     {
+        CancelRequested = true;
+
         if (cancel_callback != NULL)
             cancel_callback();

@@ -203,7 +204,6 @@ consoleHandler(DWORD dwCtrlType)
             if (PQcancel(cancelConn, errbuf, sizeof(errbuf)))
             {
                 write_stderr(_("Cancel request sent\n"));
-                CancelRequested = true;
             }
             else
             {
@@ -211,8 +211,6 @@ consoleHandler(DWORD dwCtrlType)
                 write_stderr(errbuf);
             }
         }
-        else
-            CancelRequested = true;

         LeaveCriticalSection(&cancelConnLock);

Re: libpq copy error handling busted

От

Andres Freund

Дата:

06 июня 2020 г., 07:30:04

Hi,

On 2020-06-03 18:41:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > When libpq is used to COPY data to the server, it doesn't properly
> > handle errors.
> > This is partially an old problem, and partially got recently
> > worse. Before the below commit we detected terminated connections, but
> > we didn't handle copy failing.
> 
> Yeah.  After poking at that for a little bit, there are at least three
> problems:
> 
> * pqSendSome() is responsible not only for pushing out data, but for
> calling pqReadData in any situation where it can't get rid of the data
> promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> that we don't necessarily notice that the connection is broken (it's
> pqReadData's job to detect that).  Putting a pqReadData call into
> the early-exit path helps, but doesn't fix things completely.

Is that fully necessary? Couldn't we handle at least the case I had by
looking at write_failed in additional places?

It might still be the right thing to continue to call pqReadData() from
pqSendSome(), don't get me wrong.

> * The more longstanding problem is that the PQputCopyData code path
> doesn't have any mechanism for consuming an 'E' (error) message
> once pqReadData has collected it.  AFAICS that's ancient.

Yea, I looked back quite a bit, and it looked that way for a long
time. I thought for a moment that it might be related to the copy-both
introduction, but it wasn't.

> I think that the idea was to let the client dump all its copy data and
> then report the error message when PQendCopy is called, but as you
> say, that's none too friendly when there's gigabytes of data involved.
> I'm not sure we can do anything about this without effectively
> changing the client API for copy errors, though.

Hm. Why would it *really* be an API change? Until recently connection
failures etc were returned from PQputCopyData(), and it is documented
that way:

/*
 * PQputCopyData - send some data to the backend during COPY IN or COPY BOTH
 *
 * Returns 1 if successful, 0 if data could not be sent (only possible
 * in nonblock mode), or -1 if an error occurs.
 */
int
PQputCopyData(PGconn *conn, const char *buffer, int nbytes)

So consuming 'E' when in copy mode doesn't seem like a crazy change to
me. Probably a bit too big to backpatch though. But given that this
hasn't been complained about much in however many years...

Greetings,

Andres Freund

Re: libpq copy error handling busted

От

Tom Lane

Дата:

06 июня 2020 г., 19:16:04

Andres Freund <andres@anarazel.de> writes:
> On 2020-06-03 18:41:28 -0400, Tom Lane wrote:
>> * pqSendSome() is responsible not only for pushing out data, but for
>> calling pqReadData in any situation where it can't get rid of the data
>> promptly.  1f39a1c06 overlooked that requirement, and the upshot is
>> that we don't necessarily notice that the connection is broken (it's
>> pqReadData's job to detect that).  Putting a pqReadData call into
>> the early-exit path helps, but doesn't fix things completely.

> Is that fully necessary? Couldn't we handle at least the case I had by
> looking at write_failed in additional places?

No doubt there's more than one way to do it, but I like fixing this in
pqSendSome; that's adding symmetry not warts.  It's already the case
that pqSendSome must absorb input when it's transiently unable to send
(that has to be true to avoid livelock when TCP buffers are full in both
directions).  So making it absorb input when it's permanently unable
to send seems consistent with that.  Also, fixing this at outer levels
would make it likely that we're not covering as many cases; which was
essentially the point of 1f39a1c06.

>> I think that the idea was to let the client dump all its copy data and
>> then report the error message when PQendCopy is called, but as you
>> say, that's none too friendly when there's gigabytes of data involved.
>> I'm not sure we can do anything about this without effectively
>> changing the client API for copy errors, though.

> Hm. Why would it *really* be an API change?

It'd still conform to the letter of the documentation, sure, but it'd
nonetheless be a user-visible behavioral change.

It strikes me that we could instead have the COPY code path "peek"
to see if an 'E' message is waiting in the inBuffer, without actually
consuming it, and start failing PQputCopyData calls if so.  That would
be less of a behavioral change than consuming the message, in the sense
that the error would still be available to be reported when PQendcopy is
called.

On the other hand, that approach assumes that the application will
indeed call PQendcopy to see what's up, rather than just printing some
unhelpful "copy failed" message and going belly up.  pgbench is, um, a
counterexample.  If we suppose that pgbench is representative of the
state of the art in applications, then we'd be better off consuming the
error message and reporting it via the notice mechanism.  Which would
effectively mean that the user gets to see it and the application
doesn't.  On the whole I don't like that, but if we do it the first way
then there might be a lot of apps that need upgrades to handle COPY
failures nicely.  (And on the third hand, those apps *already* need
upgrades to handle COPY failures nicely, plus right now you have to
wait till the end of the COPY.  So anything would be an improvement.)

> But given that this
> hasn't been complained about much in however many years...

Yeah, it's kind of hard to summon the will to break things when
there aren't user complaints.  You can bet that somebody will
complain if we change this, in either direction.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: libpq copy error handling busted