Обсуждение: OOM in libpq and infinite loop with getCopyStart()

Поиск

Список

Период

Сортировка

OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

01 марта 2016 г., 08:38:28

Hi all,

One year and a half ago Heikki has reported that libpq could run on an
infinite loop in a couple of code paths should an OOM happen in
PQExec():
http://www.postgresql.org/message-id/547480DE.4040408@vmware.com
The code paths are:
- BIND messages for getRowDescriptions(), or 'T' messages, fixed per 7b96bf44
- Code paths handling some calls to PQmakeEmptyPGresult(), fixed per 414bef3
- Start message for COPY command, not fixed yet.
The two first problems have been addressed, not the last one. And as
the last thread became long and confusing here is a revival of the
topic on a new thread, with a fresh patch, and a fresh study of the
problem submitted for this CF because it would be nice to get things
fixed, or improved in some cases, but see below...

Here is a summary of what this patch does and improves in some cases:
1) COPY IN and COPY OUT, without the patch, those two remain stuck in
an infinite loop... With the patch an error "out of memory" is
reported to the caller.
I had a look as well at some community utilities, like pgloader and
pg_bulkload, though none of them are using getCopyStart(). Do other
people have ideas to things to look at? pgadmin for example?

Now in the case of COPY_BOTH. let's take a couple of examples:
2) pg_receivexlog:
In the case of pg_receivexlog, the patch allows libpq to throw
correctly an error back to the caller, then pg_receivexlog will try to
send START_REPLICATION again at its next loop:
pg_receivexlog: could not send replication command
"START_REPLICATION": out of memory
Without the patch pg_receivexlog would just ignore the error, and at
the next iteration of ParseInput() the message obtained from server
would be treated because the message cursor does not move on in this
case.
3) pg_recvlogical
Similarly pg_recvlogical fails with the following error with the patch attached:
pg_recvlogical: could not send replication command "START_REPLICATION
SLOT "toto" LOGICAL 0/0": out of memory
And after the failure it loops back to the next attempt, and is able
to perform its streaming stuff. Without the patch, similarly to
pg_receivexlog, COPY_BOTH would take the code path of ParseInput()
once again and it would treat the message, ignoring any OOM problems
on the way.
4) libpqwalreceiver: upon noticing the OOM error with the patch
attached, a standby would fail after sending START_REPLICATION, log
the OOM error properly and attempt to restart later a new WAL receiver
process to try again. Without this patch, the WAL receiver would still
fail to start, and log the following, unhelpful message:
2016-03-01 14:07:00 JST [84058]: [22-1] db=,user=,app=,client= LOG:
invalid record length at 0/3000970
That's not really informational for the user :( Note though that in
this case the WAL receiver does not remain in an infinite loop, and
that it is able to restart properly because it fails with this
"invalid record length error", and then start streaming at its next
attempt when the WAL receiver is started again.

So, that's how things are standing. I still see that it would be a
gain in terms of failure visibility to log properly this OOM failure
in all cases. Now it is true that if there are some applications that
may expect libpq to *not* return an error and treat the COPY start
message at the next loop of ParseInput(), though by looking at what is
in-core things can be handled properly.

Thoughts? I have registered that in the CF app, and a patch is attached.
--
Michael

Вложения

0001-Fix-OOM-error-handling-in-COPY-protocol-of-libpq.patch

Re: OOM in libpq and infinite loop with getCopyStart()

От

Aleksander Alekseev

Дата:

03 марта 2016 г., 16:21:46

Hello, Michael

I didn't checked your patch in detail yet but here is a thought I would
like to share.

In my experience usually it takes number of rewrites before patch will
be accepted. To make sure that after every rewrite your patch still
solves an issue you described you should probably provide a few test
programs too. If it's possible to make these programs part of regression
tests suite it would be just great.

Without such test programs I personally find it difficult to verify
that your patch fixes something. If such examples are provided your
patch would be much more likely to accepted. "See, this program crashes
everything. Now we apply a patch and everything works! Cool, heh?"

Best regards,
Aleksander

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

03 марта 2016 г., 16:45:53

On Thu, Mar 3, 2016 at 10:18 PM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
> I didn't checked your patch in detail yet but here is a thought I would
> like to share.
>
> In my experience usually it takes number of rewrites before patch will
> be accepted. To make sure that after every rewrite your patch still
> solves an issue you described you should probably provide a few test
> programs too. If it's possible to make these programs part of regression
> tests suite it would be just great.

Yes that's usually the case, this patch is not at its first version.

> Without such test programs I personally find it difficult to verify
> that your patch fixes something. If such examples are provided your
> patch would be much more likely to accepted. "See, this program crashes
> everything. Now we apply a patch and everything works! Cool, heh?"

The easiest way to perform tests with this patch is to take a debugger
and enforce the malloc'd pointers to NULL in the code paths. That's
wild but efficient, and that's actually for we did for the other two
libpq patches treating those OOM failures. In the case of the standby
code path what I did was to put a sleep in the WAL receiver code and
then trigger the OOM manually, leading to the errors listed upthread.
Regarding automated tests, it is largely feasible to have tests on
Linux at least by using for example LD_PRELOAD or glibc-specific
malloc hooks, though those would be platform-restricted. Here is for
example one for COPY IN/OUT, that this patch also passed (there has
been already some discussion for this patch).
http://www.postgresql.org/message-id/55FC501A.5000409@iki.fi
Applying it for recovery is doable as well by manipulating the
recovery protocol with libpq though it is aimed really at covering a
narrow case.
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

Aleksander Alekseev

Дата:

03 марта 2016 г., 19:02:05

Hello, Michael

> The easiest way to perform tests with this patch is to take a debugger
> and enforce the malloc'd pointers to NULL in the code paths.

I see. Still I don't think it's an excuse to not provide clear steps to
reproduce an issue. As I see it anyone should be able to easily check
your patch locally without having deep understanding of how libpq is
implemented or reading thread which contains 48 e-mails.

For instance you cold provide a step-by-step instruction like:

1. run gdb --args some-prog arg1 arg2 ...
2. b some_file.c:123
3. c
4. ...
5. we expect ... but actually see ...

Or you cold automate these steps using gdb batch file:

gdb --batch --command=gdb.script --args some-prog arg1 arg2

... where gdb.script is:

b some_file.c:123
c
... etc ...

Naturally all of this is optional. But it will simplify both
understanding and reviewing of this path. Thus chances that this patch
will be accepted will be increased and time required for this will be
reduced significantly.

Best regards,
Aleksander

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

04 марта 2016 г., 08:16:48

On Fri, Mar 4, 2016 at 12:59 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> The easiest way to perform tests with this patch is to take a debugger
>> and enforce the malloc'd pointers to NULL in the code paths.
>
> I see. Still I don't think it's an excuse to not provide clear steps to
> reproduce an issue. As I see it anyone should be able to easily check
> your patch locally without having deep understanding of how libpq is
> implemented or reading thread which contains 48 e-mails.

OK, here they are if that helps. Following those steps and having some
knowledge of lldb or gdb is enough. The key point is that getCopyStart
is the routine to breakpoint and make fail.
A) psql and COPY.
1) gdb --args psql -c 'COPY (SELECT 1) TO stdout'
2) Then take a breakpoint at getCopyStart()
3) run
4) Enforce the return result of PQmakeEmptyPGresult to NULL:
p result = 0
5) Enjoy the infinite loop with HEAD, and the error with the patch

B) pg_receivexlog:
1) mkdir data
gdb --args pg_receivexlog --verbose -D data/
2) Take a breakpoint at getCopyStart
3) run
4) enforce result = 0 after the call to PQmakeEmptyPGresult
5) And enjoy what has been reported

C) pg_recvlogical, similar to B.
1) Create a replication slot on a server that accepts them
select pg_create_logical_replication_slot('foo', 'test_decoding');
2) Same breakpoint as B) with this start command or similar (depends
on where the logical slot has been created):
pg_recvlogical --start --slot foo -f - -d postgres

D) triggering standby problem:
1) Patch upstream, as follows by adding a sleep in
libpqrcv_startstreaming of libpqwalreceiver.c. This is a simple method
to have the time to take a breakpoint on the WAL receiver process that
has been started, with streaming that has not begun yet.
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index f670957..c7fccf1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr
startpoint, char *slotname)
    char        cmd[256];
    PGresult   *res;

+   /* ZZZzzz... */
+   pg_usleep(10000L);
+
    /* Start streaming from the point requested by startup process */
    if (slotname != NULL)
        snprintf(cmd, sizeof(cmd),
Trying to design a test taking into account the context of a WAL
receiver does not seem worth it to me with a simple method like this
one...
2) Start the standby and attach debugger on the WAL receiver process,
send SIGSTOP to it, whatever...
3) breakpoint at getCopyStart
4) Then move on with the same process as previously, and enjoy the errors.
--
Michael

Вложения

sleep-wal-receiver.patch

Re: OOM in libpq and infinite loop with getCopyStart()

От

Aleksander Alekseev

Дата:

09 марта 2016 г., 17:56:55

Hello, Michael

Thanks a lot for steps to reproduce you provided.

I tested your path on Ubuntu Linux 14.04 LTS (GCC 4.8.4) and FreeBSD
10.2 RELEASE (Clang 3.4.1). In both cases patch applies cleanly, there
are no warnings during compilation and all regression tests pass. A few
files are not properly pgindent'ed though:

```
diff --git a/src/interfaces/libpq/fe-exec.c
b/src/interfaces/libpq/fe-exec.c index c99f193..2769719 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -2031,7 +2031,7 @@ PQexecFinish(PGconn *conn)           conn->status == CONNECTION_BAD)           break;       else
if((conn->asyncStatus == PGASYNC_COPY_IN ||
 
-                 conn->asyncStatus == PGASYNC_COPY_OUT  ||
+                 conn->asyncStatus == PGASYNC_COPY_OUT ||                 conn->asyncStatus == PGASYNC_COPY_BOTH) &&
            result->resultStatus == PGRES_FATAL_ERROR)           break;
 
diff --git a/src/interfaces/libpq/fe-protocol3.c
b/src/interfaces/libpq/fe-protocol3.c index 21a1d9b..280ca16 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -49,9 +49,9 @@ static int    getParamDescriptions(PGconn *conn, int
msgLength); static int getAnotherTuple(PGconn *conn, int msgLength);static int getParameterStatus(PGconn *conn);static
intgetNotify(PGconn *conn);
 
-static int getCopyStart(PGconn *conn,
-                        ExecStatusType copytype,
-                        int msgLength);
+static int getCopyStart(PGconn *conn,
+            ExecStatusType copytype,
+            int msgLength);static int getReadyForQuery(PGconn *conn);static void reportErrorPosition(PQExpBuffer msg,
constchar *query,                   int loc, int encoding);
 
```

Indeed your patch solves issues you described. Still here is something
that concerns me:

```
$ gdb --args pg_receivexlog --verbose -D ./temp_data/

(gdb) b getCopyStart
Function "getCopyStart" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (getCopyStart) pending.
(gdb) r
Starting program: /usr/local/pgsql/bin/pg_receivexlog --verbose
-D ./temp_data/ [Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1". pg_receivexlog: starting log
streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610220, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398        const char
*errmsg = NULL; (gdb) n
1400        result = PQmakeEmptyPGresult(conn, copytype);
(gdb) 
1401        if (!result)
(gdb) p result = 0
$1 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398        const char
*errmsg = NULL; (gdb) n
1400        result = PQmakeEmptyPGresult(conn, copytype);
(gdb) n
1401        if (!result)
(gdb) p result = 0
$2 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398        const char
*errmsg = NULL;
```

Granted this behaviour is a bit better then the current one. But
basically it's the same infinite loop only with pauses and warnings. I
wonder if this is a behaviour we really want. For instance wouldn't it
be better just to terminate an application in out-of-memory case? "Let
it crash" as Erlang programmers say.

Best regards,
Aleksander

Re: OOM in libpq and infinite loop with getCopyStart()

От

Alvaro Herrera

Дата:

09 марта 2016 г., 18:18:05

Aleksander Alekseev wrote:

> pg_receivexlog: could not send replication command "START_REPLICATION":
> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
> again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)
> 
> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
> msgLength=3) at fe-protocol3.c:1398 1398        const char
> *errmsg = NULL;
> ```
> 
> Granted this behaviour is a bit better then the current one. But
> basically it's the same infinite loop only with pauses and warnings. I
> wonder if this is a behaviour we really want. For instance wouldn't it
> be better just to terminate an application in out-of-memory case? "Let
> it crash" as Erlang programmers say.

Hmm.  It would be useful to retry in the case that there is a chance
that the program releases memory and can continue later.  But if it will
only stay there doing nothing other than retrying, then that obviously
will not happen.  One situation where this might help is if the overall
*system* is short on memory and we expect that situation to resolve
itself after a while -- after all, if the system is so loaded that it
can't allocate a few more bytes for the COPY message, then odds are that
other things are also crashing and eventually enough memory will be
released that pg_receivexlog can continue.

On the other hand, if the system is so loaded, perhaps it's better to
"let it crash" and have it restart later -- presumably once the admin
notices the problem and restarts it manually after cleaning up the mess.

If all programs are well behaved and nothing crashes when OOM but they
all retry instead, then everything will continue to retry infinitely and
make no progress.  That cannot be good.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

10 марта 2016 г., 10:22:15

On Thu, Mar 10, 2016 at 12:12 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Aleksander Alekseev wrote:
>> pg_receivexlog: could not send replication command "START_REPLICATION":
>> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
>> again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)
>>
>> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
>> msgLength=3) at fe-protocol3.c:1398 1398              const char
>> *errmsg = NULL;
>> ```
>>
>> Granted this behaviour is a bit better then the current one. But
>> basically it's the same infinite loop only with pauses and warnings. I
>> wonder if this is a behaviour we really want. For instance wouldn't it
>> be better just to terminate an application in out-of-memory case? "Let
>> it crash" as Erlang programmers say.
>
> Hmm.  It would be useful to retry in the case that there is a chance
> that the program releases memory and can continue later.  But if it will
> only stay there doing nothing other than retrying, then that obviously
> will not happen.  One situation where this might help is if the overall
> *system* is short on memory and we expect that situation to resolve
> itself after a while -- after all, if the system is so loaded that it
> can't allocate a few more bytes for the COPY message, then odds are that
> other things are also crashing and eventually enough memory will be
> released that pg_receivexlog can continue.

Yep, that's my assumption regarding that, at some point the system may
succeed, and I don't think that we should break the current behaviors
of pg_receivexlog and pg_recvlogical regarding that in the
back-branches. Now, note that without the patch we actually have the
same problem. Say if OOMs happen continuously in getCopyStart, with
COPY_BOTH libpq would attempt to read the next message continuously
and would keep failing. Except that in this case the caller has no
idea what is happening as things keep running in libpq itself.

> On the other hand, if the system is so loaded, perhaps it's better to
> "let it crash" and have it restart later -- presumably once the admin
> notices the problem and restarts it manually after cleaning up the mess.
>
> If all programs are well behaved and nothing crashes when OOM but they
> all retry instead, then everything will continue to retry infinitely and
> make no progress.  That cannot be good.

That's something we could take care of in those client utilities I
think with a new option like --maximum-retries or similar, but anyway
I think that's a different discussion. The patch I am proposing here
allows a client application to be made aware of OOM errors that
happen. If we don't do something about that first, something like
--maximum-retries would be useless for COPY_BOTH as the client will
never be made aware of the OOM that happened in libpq and would keep
looping inside libpq itself until some memory is freed.
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

Robert Haas

Дата:

21 марта 2016 г., 19:44:05

On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Thoughts? I have registered that in the CF app, and a patch is attached.

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)        if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
   PQresultStatus(lastResult) == PGRES_COPY_OUT ||            PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
 
+            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||            PQstatus(streamConn) == CONNECTION_BAD)
     break;    }
 

I mean, why would it be a good idea to blindly skip over fatal errors?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: OOM in libpq and infinite loop with getCopyStart()

От

Amit Kapila

Дата:

22 марта 2016 г., 06:54:54

On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Thoughts? I have registered that in the CF app, and a patch is attached.
>
> It is very difficult to believe that this is a good idea:
>
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> PQstatus(streamConn) == CONNECTION_BAD)
> break;
> }
>
> I mean, why would it be a good idea to blindly skip over fatal errors?
>

I think it is not about skipping the FATAL error, rather to stop trying to get further results on FATAL error. This is to ensure that OOM error is reported rather that ignored. There has been discussion about this previously as well [1].

[1] - http://www.postgresql.org/message-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: OOM in libpq and infinite loop with getCopyStart()

От

Tom Lane

Дата:

22 марта 2016 г., 07:16:11

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It is very difficult to believe that this is a good idea:
>> 
>> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>>          if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>>              PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>>              PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> +            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>>              PQstatus(streamConn) == CONNECTION_BAD)
>>             break;
>> 
>> I mean, why would it be a good idea to blindly skip over fatal errors?

> I think it is not about skipping the FATAL error, rather to stop trying to
> get further results on FATAL error.

If the code already includes "lost the connection" as a case to break on,
I'm not quite sure why "got a query error" is not.  Maybe that PQstatus
check is broken too, but it doesn't seem like this patch makes it more so.
        regards, tom lane

Re: OOM in libpq and infinite loop with getCopyStart()

От

Amit Kapila

Дата:

22 марта 2016 г., 08:19:34

On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> It is very difficult to believe that this is a good idea:
> >>
> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> >> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> >> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> >> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> >> PQstatus(streamConn) == CONNECTION_BAD)
> >> break;
> >>
> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>
> > I think it is not about skipping the FATAL error, rather to stop trying to
> > get further results on FATAL error.
>
> If the code already includes "lost the connection" as a case to break on,
> I'm not quite sure why "got a query error" is not.
>

This error check is exactly same as PQexecFinish() and there some explanation is given in comments which hints towards the reason for continuing on error, basically both the functions PQexecFinish() and libpqrcv_PQexec() returns the last result if there are many and PQexecFinish() concatenates the errors as well in some cases. Do we see any use in continuing to get result after getting PGRES_FATAL_ERROR error?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

22 марта 2016 г., 10:37:12

On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas@gmail.com>
>> > wrote:
>> >> It is very difficult to believe that this is a good idea:
>> >>
>> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>> >>          if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>> >>              PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>> >>              PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> >> +            PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>> >>              PQstatus(streamConn) == CONNECTION_BAD)
>> >>             break;
>> >>
>> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>>
>> > I think it is not about skipping the FATAL error, rather to stop trying
>> > to
>> > get further results on FATAL error.
>>
>> If the code already includes "lost the connection" as a case to break on,
>> I'm not quite sure why "got a query error" is not.
>>
>
> This error check is exactly same as PQexecFinish() and there some
> explanation is given in comments which hints towards the reason for
> continuing on error, basically both the functions PQexecFinish() and
> libpqrcv_PQexec() returns the last result if there are many and
> PQexecFinish() concatenates the errors as well in some cases.  Do we see any
> use in continuing to get result after getting PGRES_FATAL_ERROR error?

I spent quite a couple of hours looking at that, and I still fail to
see how it would be an advantage to stack errors. We reduce the error
message visibility for frontend clients, and this does not prevent
clients to actually see error messages generated by a backend, take a
WAL sender. And actually depending on the message type received we may
end up by simply ignoring those error messages and have libpq discard
them. So it seems like an oversight of 03a571a4.

One could always say that this change breaks the case where multiple
error messages are sent in a row from a client with COPY_* (IN, OUT
and BOTH), because we'd get back the last error message, while with
this change we let client know the first one, though that would mean
that client is missing something when using COPY_BOTH.

Also...
@@ -2023,6 +2030,11 @@ PQexecFinish(PGconn *conn)           result->resultStatus == PGRES_COPY_BOTH ||
conn->status== CONNECTION_BAD)           break;
 
+       else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
+                 conn->asyncStatus == PGASYNC_COPY_OUT  ||
+                 conn->asyncStatus == PGASYNC_COPY_BOTH) &&
+                result->resultStatus == PGRES_FATAL_ERROR)
+           break;
The reason behind this check in PQexecFinish is that we need to make
the difference between an error where there is not enough data, which
means that we want to retry again the message fetching through
parseInput3(), and the case where an actual OOM happened, where we
want to exit immediately and let the caller know that there has been a
frontend error. Now the other reason why we went on with
PGRES_FATAL_ERROR here is that it was discussed that it was an
overkill to assign a special status to PGASYNC to detect a
frontend-side failure, because we already treat other frontend-side
errors with PGRES_FATAL_ERROR and assign an error message to them (see
for example when an allocation fails for 'C', introduced in another
patch of the OOM failure series), and we still need to know that we
are in PGASYNC_COPY_* mode in this code path as well because it means
that the start message has been processed, but we had a failure in
deparsing it so we bypassed it, and need to fail immediately. So using
PGREC_FATAL_ERROR simplifies the code paths creating an error stack.

Hope this brings some light in.
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

David Steele

Дата:

29 марта 2016 г., 18:01:52

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:

> Hope this brings some light in.

Do you know when you'll have time to respond to Michael's last email? 
I've marked this patch "waiting on author" in the meantime.

Thanks,
-- 
-David
david@pgmasters.net

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

30 марта 2016 г., 03:21:12

On Wed, Mar 30, 2016 at 12:01 AM, David Steele <david@pgmasters.net> wrote:
> Hi Amit,
>
> On 3/22/16 3:37 AM, Michael Paquier wrote:
>
>> Hope this brings some light in.
>
>
> Do you know when you'll have time to respond to Michael's last email? I've
> marked this patch "waiting on author" in the meantime.

Shouldn't it be "needs review" instead? I am marked as the author of this patch.
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

David Steele

Дата:

30 марта 2016 г., 03:23:48

On 3/29/16 8:21 PM, Michael Paquier wrote:
> On Wed, Mar 30, 2016 at 12:01 AM, David Steele <david@pgmasters.net> wrote:
>> Hi Amit,
>>
>> On 3/22/16 3:37 AM, Michael Paquier wrote:
>>
>>> Hope this brings some light in.
>>
>>
>> Do you know when you'll have time to respond to Michael's last email? I've
>> marked this patch "waiting on author" in the meantime.
> 
> Shouldn't it be "needs review" instead? I am marked as the author of this patch.

Whoops!  I got the author and reviewer backwards.  Set back to "needs
review".

-- 
-David
david@pgmasters.net

Re: OOM in libpq and infinite loop with getCopyStart()

От

Amit Kapila

Дата:

30 марта 2016 г., 07:55:55

On Tue, Mar 29, 2016 at 8:31 PM, David Steele <david@pgmasters.net> wrote:

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:

Hope this brings some light in.

Do you know when you'll have time to respond to Michael's last email? I've marked this patch "waiting on author" in the meantime.

I think this patch needs to be looked upon by committer now. I have done review and added some code in this patch as well long back, just see the e-mail [1], patch is just same as it was in October 2015. I think myself and Michael are in agreement that this patch solves the reported problem. There is one similar problem [2] reported by Heikki which I think can be fixed separately.

I think the main reason of moving this thread to hackers from bugs is to gain some more traction which I see that it achieves its purpose to some extent, but I find that more or less we are at same situation as we were back in October.

Let me know if you think anything more from myside can help in moving patch.

[1] - http://www.postgresql.org/message-id/CAB7nPqTNrw8LR1HD7ZLnOJjC1BP1Evw_EMAdGOrV+s-sBCrKtw@mail.gmail.com

[2] - http://www.postgresql.org/message-id/566EF84F.1030206@iki.fi

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: OOM in libpq and infinite loop with getCopyStart()

От

Aleksander Alekseev

Дата:

30 марта 2016 г., 12:46:00

> I think this patch needs to be looked upon by committer now.  I have
> done review and added some code in this patch as well long back, just
> see the e-mail [1], patch is just same as it was in October 2015.  I
> think myself and Michael are in agreement that this patch solves the
> reported problem. There is one similar problem [2] reported by Heikki
> which I think can be fixed separately.
> [...]
> Let me know if you think anything more from myside can help in moving
> patch.

+1, same here. Lets see whats committer's opinion.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/

Re: OOM in libpq and infinite loop with getCopyStart()

От

Tom Lane

Дата:

01 апреля 2016 г., 04:06:34

Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:
> +1, same here. Lets see whats committer's opinion.

I fooled around with this patch quite a bit but could not bring myself
to pull the trigger, because I think it fundamentally breaks applications
that follow the "repeat PQgetResult until NULL" rule.  The only reason
that psql manages not to fail is that it doesn't do that, it just calls
PQexec; and you've hacked PQexecFinish so that it falls out without
pumping PQgetResult till dry.  But that's not a good solution, it's just
a hack that makes the behavior unprincipled and unlike direct use of
PQgetResult.  The key problem is that, assuming that the changes in
getCopyStart() succeed in returning a PGRES_FATAL_ERROR PGresult, the
application may just follow the rule of doing nothing with it unless it's
the last non-null result from PQgetResult.  And it won't be, because
you've switched libpq's asyncStatus into one or another COPY status, which
will cause PQgetResult to continually create and return PGRES_COPY_XXX
results, which is what it's supposed to do in that situation (cf the last
step in getCopyResult).

Now, this will accidentally fail to fail if PQgetResult's attempt to
manufacture a PGRES_COPY_XXX result fails and returns null, which is
certainly possible if we're up against an OOM situation.  But what if
it doesn't fail --- which is also possible, because a PGRES_COPY_XXX
with no attached data will not need as much space as a PGRES_FATAL_ERROR
with attached error message.  The app probably throws away the
PGRES_FATAL_ERROR and tries to use the PGRES_COPY_XXX result, which
is almost okay, except that it will lack the copy format information
which will be fatal if the application is relying on that.

So AFAICT, this patch kinda sorta works for psql but it is not going
to make things better for other apps.

The other problem is that I don't have a lot of faith in the theory
that getCopyStart is going to be able to make an error PGresult when
it couldn't make a COPY PGresult.  The existing message-receipt routines
that this is modeled on are dealing with PGresults that are expected to
grow large, so throwing away the accumulated PGresult is highly likely
to free enough memory to let us allocate an error PGresult.  Not so
here.

I have to run, but the bottom line is I don't feel comfortable with
this.  It's trying to fix what's a very corner-case problem (since
COPY PGresults aren't large) but it's introducing new corner case
behaviors of its own.
        regards, tom lane

Re: OOM in libpq and infinite loop with getCopyStart()

От

Tom Lane

Дата:

01 апреля 2016 г., 06:48:48

I thought about this patch a bit more...

When I first looked at the patch, I didn't believe that it worked at
all: it failed to return a PGRES_COPY_XXX result to the application,
and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
Wouldn't things be hopelessly confused?  I tried it out and saw that
indeed it seemed to work in psql, and after tracing through that found
that psql has no idea what's going on, but when psql issues its next
command PQexecStart manages to get us out of the copy state (barring
more OOM failures, anyway).  That seems a bit accidental, though,
and for sure it wasn't documented in the patch.  In any case, having
libpq in that state would have side-effects on how it responds to
application actions, which is the core of my complaint about
PQgetResult behavior.  Those side effects only make sense if the app
knows (or should have known) that the connection is in COPY state.

I think it might be possible to get saner behavior if we invent a
set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the
semantics that we got the relevant CopyStart message from the backend
but were unable to report it to the application.  PQexecStart would
treat these the same as the corresponding normal PGASYNC_COPY_XXX
states and do what's needful to get out of the copy mode.  But in
PQgetResult, we would *not* treat those states as a reason to return a
PGRES_COPY_XXX result to the application.  Probably we'd just return
NULL, with the implication being "we're done so far as you're concerned,
please give a new command".  I might be underthinking it though.

Anyway, the short of my review is that we need more clarity of thought
about what state libpq is in after a failure like this, and what that
state looks like to the application, and how that should affect how
libpq reacts to application calls.

I'll set this back to Waiting on Author ...
        regards, tom lane

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

01 апреля 2016 г., 08:05:11

On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I thought about this patch a bit more...
>
> When I first looked at the patch, I didn't believe that it worked at
> all: it failed to return a PGRES_COPY_XXX result to the application,
> and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
> Wouldn't things be hopelessly confused?  I tried it out and saw that
> indeed it seemed to work in psql, and after tracing through that found
> that psql has no idea what's going on, but when psql issues its next
> command PQexecStart manages to get us out of the copy state (barring
> more OOM failures, anyway).  That seems a bit accidental, though,
> and for sure it wasn't documented in the patch.

From the patch, that's mentioned:
+    * Stop if we are in copy mode and error has occurred, the pending results
+    * will be discarded during next execution in PQexecStart.

> In any case, having
> libpq in that state would have side-effects on how it responds to
> application actions, which is the core of my complaint about
> PQgetResult behavior.  Those side effects only make sense if the app
> knows (or should have known) that the connection is in COPY state.

OK. So we'd want to avoid relying on subsequent PQ* calls and return
into a clean state once the OOM shows up.

> I think it might be possible to get saner behavior if we invent a
> set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the
> semantics that we got the relevant CopyStart message from the backend
> but were unable to report it to the application.  PQexecStart would
> treat these the same as the corresponding normal PGASYNC_COPY_XXX
> states and do what's needful to get out of the copy mode.  But in
> PQgetResult, we would *not* treat those states as a reason to return a
> PGRES_COPY_XXX result to the application.  Probably we'd just return
> NULL, with the implication being "we're done so far as you're concerned,
> please give a new command".  I might be underthinking it though.

I raised something like that a couple of months back, based on the
fact that PGASYNC_* are flags internal to libpq on frontend:
http://www.postgresql.org/message-id/CAB7nPqTaeDwpRCqAk-cZZxwWaPdEYr1gUg0vwvG7RhzhmAJy3Q@mail.gmail.com
In this case what was debated is PGASYNC_FATAL... Something that is
quite different with your proposal is that we return NULL or something
else, and it did not discard any pending data from the client. For
applications running PQgetResult in a loop that would work.

> Anyway, the short of my review is that we need more clarity of thought
> about what state libpq is in after a failure like this, and what that
> state looks like to the application, and how that should affect how
> libpq reacts to application calls.

Hm. I would have thought that returning to the caller a result
generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error
string marked "out of memory" would be the best thing to do instead of
NULL... If getCopyStart() complains because of a lack of data, we loop
back into pqParseInput3 to try again. If an OOM happens, we still loop
into pqParseInput3 but all the pending data received from client needs
to be discarded so as we don't rely on the next calls of PQexecStart
to do the cleanup, once all the data is received we get out of the
loop and generate the result with PGRES_FATAL_ERROR. Does that match
what you have in mind?
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

Amit Kapila

Дата:

01 апреля 2016 г., 10:05:51

On Fri, Apr 1, 2016 at 10:34 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I thought about this patch a bit more...
> >
> > When I first looked at the patch, I didn't believe that it worked at
> > all: it failed to return a PGRES_COPY_XXX result to the application,
> > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
> > Wouldn't things be hopelessly confused? I tried it out and saw that
> > indeed it seemed to work in psql, and after tracing through that found
> > that psql has no idea what's going on, but when psql issues its next
> > command PQexecStart manages to get us out of the copy state (barring
> > more OOM failures, anyway). That seems a bit accidental, though,
> > and for sure it wasn't documented in the patch.
>
> From the patch, that's mentioned:
> + * Stop if we are in copy mode and error has occurred, the pending results
> + * will be discarded during next execution in PQexecStart.
>

Yeah, and same is mentioned in PQexecStart function as well which indicates that above assumption is right.

>
> > Anyway, the short of my review is that we need more clarity of thought
> > about what state libpq is in after a failure like this, and what that
> > state looks like to the application, and how that should affect how
> > libpq reacts to application calls.
>

Very valid point. So, if we see with patch, I think libpq will be in PGASYNC_COPY_XXX state after such a failure and next time when it tries to again execute statement, it will end copy mode and then allow to proceed from there. This design is required for other purposes like silently discarding left over result. Now, I think the other option(if possible) seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error as we do at some other places in case of error as below and return OOM failure as we do in patch.

PQgetResult()

{

if (flushResult ||

pqWait(TRUE, FALSE, conn) ||

pqReadData(conn) < 0)

{

* conn->errorMessage has been set by pqWait or pqReadData. We

* want to append it to any already-received error message.

pqSaveErrorResult(conn);

conn->asyncStatus = PGASYNC_IDLE;

return pqPrepareAsyncResult(conn);

}

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: OOM in libpq and infinite loop with getCopyStart()

От

Tom Lane

Дата:

01 апреля 2016 г., 18:03:10

Michael Paquier <michael.paquier@gmail.com> writes:
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, the short of my review is that we need more clarity of thought
>> about what state libpq is in after a failure like this, and what that
>> state looks like to the application, and how that should affect how
>> libpq reacts to application calls.

> Hm. I would have thought that returning to the caller a result
> generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error
> string marked "out of memory" would be the best thing to do instead of
> NULL...

Sure, we're trying to do that, but the question is what happens afterward.
An app that is following the documented usage pattern for PQgetResult
will call it again, expecting to get a NULL, but as the patch stands it
might get a PGRES_COPY_XXX instead.  What drove me to decide the patch
wasn't committable was realizing that Robert was right upthread: the
change you propose in libpqwalreceiver.c is not a bug fix.  That code
is doing exactly what it's supposed to do per the libpq.sgml docs, namely
iterate till it gets a NULL.[1]  If we have to change it, that means
(a) we've changed the API spec for PQgetResult and (b) every other
existing caller of PQgetResult would need to change similarly, or else
risk corner-case bugs about as bad as the one we're trying to fix.

So the core of my complaint is that we need to fix things so that, whether
or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
better consider the behavior when we cannot), we need to leave libpq
in a state where subsequent calls to PQgetResult will return NULL.

> If getCopyStart() complains because of a lack of data, we loop
> back into pqParseInput3 to try again. If an OOM happens, we still loop
> into pqParseInput3 but all the pending data received from client needs
> to be discarded so as we don't rely on the next calls of PQexecStart
> to do the cleanup, once all the data is received we get out of the
> loop and generate the result with PGRES_FATAL_ERROR. Does that match
> what you have in mind?

If we could, that would be OK, but (a) you're only considering the
COPY OUT case not the COPY IN case, and (b) a persistent OOM condition
could very well prevent us from ever completing the copy.  For example,
some COPY DATA messages might be too big for our current buffer, but
we won't be able to enlarge the buffer.  I'm inclined to think that
reporting the OOM to the client is the highest-priority goal, and
that attempting to clean up during the next PQexec/PQsendQuery is a
reasonable design.  If we fail to do that, the client won't really
understand that it's a cleanup failure and not a failure to issue the
new query, but it's not clear why it needs to know the difference.
        regards, tom lane

[1] Well, almost.  Really, after dealing with a PGRES_COPY_XXX result
it ought to go back to pumping PQgetResult.  It doesn't, but that's
an expected client behavior, and cleaning up after that is exactly what
PQexecStart is intended for.

Re: OOM in libpq and infinite loop with getCopyStart()

От

Tom Lane

Дата:

01 апреля 2016 г., 18:07:15

Amit Kapila <amit.kapila16@gmail.com> writes:
> Very valid point.  So, if we see with patch, I think libpq will be
> in PGASYNC_COPY_XXX state after such a failure and next time when it tries
> to again execute statement, it will end copy mode and then allow to proceed
> from there.   This design is required for other purposes like silently
> discarding left over result.  Now, I think the other option(if possible)
> seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error
> as we do at some other places in case of error as below and return OOM
> failure as we do in patch.

If we transition to PGASYNC_IDLE then the next PQexecStart will not
realize that it needs to do something to get out of the COPY state;
but it does, since the backend thinks that we're doing COPY.  That was
the reasoning behind my proposal to invent new PGASYNC states.
Obviously there's more than one way to represent this new state, eg
you could have a separate error flag instead --- but it is a new state.
        regards, tom lane

Re: OOM in libpq and infinite loop with getCopyStart()

От

Tom Lane

Дата:

01 апреля 2016 г., 18:30:28

I wrote:
> So the core of my complaint is that we need to fix things so that, whether
> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
> better consider the behavior when we cannot), ...

BTW, the real Achilles' heel of any attempt to ensure sane behavior at
the OOM limit is this possibility of being unable to create a PGresult
with which to inform the client that we failed.

I wonder if we could make things better by keeping around an emergency
backup PGresult struct.  Something along these lines:

1. Add a field "PGresult *emergency_result" to PGconn.

2. At the very beginning of any PGresult-returning libpq function, check
to see if we have an emergency_result, and if not make one, ensuring
there's room in it for a reasonable-size error message; or maybe even
preload it with "out of memory" if we assume that's the only condition
it'll ever be used for.  If malloc fails at this point, just return NULL
without doing anything or changing any libpq state.  (Since a NULL result
is documented as possibly caused by OOM, this isn't violating any API.)

3. Subsequent operations never touch the emergency_result unless we're
up against an OOM, but it can be used to return a failure indication
to the client so long as we leave libpq in a state where additional
calls to PQgetResult would return NULL.

Basically this shifts the point where an unreportable OOM could happen
from somewhere in the depths of libpq to the very start of an operation,
where we're presumably in a clean state and OOM failure doesn't leave
us with a mess we can't clean up.
        regards, tom lane

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

06 апреля 2016 г., 09:09:23

On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> So the core of my complaint is that we need to fix things so that, whether
>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
>> better consider the behavior when we cannot), ...
>
> BTW, the real Achilles' heel of any attempt to ensure sane behavior at
> the OOM limit is this possibility of being unable to create a PGresult
> with which to inform the client that we failed.
>
> I wonder if we could make things better by keeping around an emergency
> backup PGresult struct.  Something along these lines:
>
> 1. Add a field "PGresult *emergency_result" to PGconn.
>
> 2. At the very beginning of any PGresult-returning libpq function, check
> to see if we have an emergency_result, and if not make one, ensuring
> there's room in it for a reasonable-size error message; or maybe even
> preload it with "out of memory" if we assume that's the only condition
> it'll ever be used for.  If malloc fails at this point, just return NULL
> without doing anything or changing any libpq state.  (Since a NULL result
> is documented as possibly caused by OOM, this isn't violating any API.)
>
> 3. Subsequent operations never touch the emergency_result unless we're
> up against an OOM, but it can be used to return a failure indication
> to the client so long as we leave libpq in a state where additional
> calls to PQgetResult would return NULL.
>
> Basically this shifts the point where an unreportable OOM could happen
> from somewhere in the depths of libpq to the very start of an operation,
> where we're presumably in a clean state and OOM failure doesn't leave
> us with a mess we can't clean up.

I have moved this patch to next CF for the time being. As that's a
legit bug and not a feature, that should be fine to pursue work on
this item even if this CF ends.
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

18 апреля 2016 г., 10:48:38

On Wed, Apr 6, 2016 at 3:09 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wrote:
>>> So the core of my complaint is that we need to fix things so that, whether
>>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
>>> better consider the behavior when we cannot), ...
>>
>> BTW, the real Achilles' heel of any attempt to ensure sane behavior at
>> the OOM limit is this possibility of being unable to create a PGresult
>> with which to inform the client that we failed.
>>
>> I wonder if we could make things better by keeping around an emergency
>> backup PGresult struct.  Something along these lines:
>>
>> 1. Add a field "PGresult *emergency_result" to PGconn.
>>
>> 2. At the very beginning of any PGresult-returning libpq function, check
>> to see if we have an emergency_result, and if not make one, ensuring
>> there's room in it for a reasonable-size error message; or maybe even
>> preload it with "out of memory" if we assume that's the only condition
>> it'll ever be used for.  If malloc fails at this point, just return NULL
>> without doing anything or changing any libpq state.  (Since a NULL result
>> is documented as possibly caused by OOM, this isn't violating any API.)
>>
>> 3. Subsequent operations never touch the emergency_result unless we're
>> up against an OOM, but it can be used to return a failure indication
>> to the client so long as we leave libpq in a state where additional
>> calls to PQgetResult would return NULL.
>>
>> Basically this shifts the point where an unreportable OOM could happen
>> from somewhere in the depths of libpq to the very start of an operation,
>> where we're presumably in a clean state and OOM failure doesn't leave
>> us with a mess we can't clean up.

Sorry for the late reply here. I am looking back at this thing more seriously.

So, if I understand that correctly. As the emergency result is
pre-allocated, this leverages any future OOM errors because we could
always fallback to this error in case of problems, so this would
indeed help. And as this is independent of the rest of the status of
pg_conn, any subsequent calls of any PQ* routines returning a PGresult
would remain in the same state.

On top of this emergency code path, don't you think that we need as
well a flag that is switched to 'on' once an OOM error is faced? In
consequence, on a code path where an OOM happens, this flag is
activated. Then any subsequent calls of routines returning a PGresult
checks for this flag, resets it, and returns the emergency result. At
the end, it seems to me that the code paths where we check if the
emergency flag is activated or not are the beginning of PQgetResult
and PQexecFinish. In the case of PQexecFinish(), pending results would
be cleared the next time PQexecStart is called. For PQgetResult, this
gives the possibility to the application to report the OOM properly.
However, successive calls to PQgetResult() would still make the
application wait undefinitely for more data if it doesn't catch the
error.

Another thing that I think needs to be patched is libpqrcv_PQexec by
the way, so as we check if any errors are caught on the way when
calling multiple times PQgetResult or this code path could remain
stuck with an infinite wait. As a result, it seems to me that this
offers no real way to fix completely applications doing something like
that:
PQsendQuery(conn);
for (;;)
{   while (PQisBusy(conn))   {       //wait here for some event   }   result = PQgetResult(conn);   if (!result)
break;
}
The issue is that we'd wait for a NULL result to be received from
PQgetResult, however even with this patch asyncStatus remaining set to
PGASYNC_BUSY would make libpq waiting forever with pqWait for data
that will never show up. We could have a new status for asyncStatus to
be able to skip properly the loops where asyncStatus == PGASYNC_BUSY
and make PQisBusy smarter but this would actually break the semantics
of calling successively PQgetResult() if I am following correctly the
last posts of this thread.

Another, perhaps, better idea is to add some more extra logic to
pg_conn as follows:
bool    is_emergency;
PGresult *emergency_result;
bool    is_emergency_consumed;
So as when an OOM shows up, is_emergency is set to true. Then a first
call of PQgetResult returns the PGresult with the OOM in
emergency_result, setting is_emergency_consumed to true and switching
is_emergency back to false. Then a second call to PQgetResult enforces
the consumption of any results remaining without waiting for them, at
the end returning NULL to the caller, resetting is_emergency_consumed
to false. That's different compared to the extra failure
PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy().

Thoughts?
-- 
Michael

Re: OOM in libpq and infinite loop with getCopyStart()

От

Michael Paquier

Дата:

19 апреля 2016 г., 10:19:44

On Mon, Apr 18, 2016 at 4:48 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Another, perhaps, better idea is to add some more extra logic to
> pg_conn as follows:
> bool    is_emergency;
> PGresult *emergency_result;
> bool    is_emergency_consumed;
> So as when an OOM shows up, is_emergency is set to true. Then a first
> call of PQgetResult returns the PGresult with the OOM in
> emergency_result, setting is_emergency_consumed to true and switching
> is_emergency back to false. Then a second call to PQgetResult enforces
> the consumption of any results remaining without waiting for them, at
> the end returning NULL to the caller, resetting is_emergency_consumed
> to false. That's different compared to the extra failure
> PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy().

So, in terms of code, it gives more or less the attached. I have
coupled the emergency_result with an enum flag emergency_status that
has 3 states:
- NONE, which is the default
- ENABLE, which is when an OOM or other error has been found in libpq.
Making the next call of PQgetResult return the emergency_result
- CONSUMED, set after PQgetResult is consuming emergency_result, to
return to caller NULL so as it can get out of any PQgetResult loop
expecting a result at the end of. Once NULL is returned, the flag is
switched back to NONE.
This works in the case of getCopyStart because the OOM failures are
happening before any messages are sent to server.

The checks for the emergency result are done in PQgetResult, so as any
upper-level routine gets the message. Tom, others, what do you think?
--
Michael

Вложения

libpq-oom-copy-v2.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: OOM in libpq and infinite loop with getCopyStart()

Вложения

Вложения

Вложения