Обсуждение: Cryptic error message in low-memory conditions

Поиск
Список
Период
Сортировка

Cryptic error message in low-memory conditions

От
Daniel Farina
Дата:
Hello list,

This is something that I've only recently somewhat pinned down to a cause...

Some Postgres servers will error out for a while with the following
error message:

"expected authentication request from server, but received c"

If one uses Their Favorite Search Engine, this message is scattered
around the internet, all in reference to Postgres, I think, but none
of the top results seem to have any lucid responses or cause listed.
We've seen this reproduce -- sometimes for minutes at a time -- and
after catching one in the act I am reasonably confident that one
common cause of this is systems that are low on memory, which I
confirmed by looking at postgres logs and matching them up against our
monitoring system.

Critical statistics first: the systems run Linux with overcommit off,
so malloc returns NULL now and again. There is no OOM killer activity.SSL is the transport, and SQL role password
authenticationis in use.There is no swap.
 

Here's an example of the various kinds of failure one can get from
connecting to a system that is low on memory:

2011-08-26 16:03:06 | INFO "psql? failed with exception #<PGError:
FATAL:  out of memory
DETAIL:  Failed on request of size 488.
>"
2011-08-26 16:02:27 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 16:01:51 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 16:01:15 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 16:00:39 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 16:00:01 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 15:59:25 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 15:58:48 | INFO "psql? failed with exception #<PGError:
expected authentication request from server, but received c
>"
2011-08-26 15:58:12 | INFO "psql? failed with exception #<PGError:
FATAL:  out of memory
>"

On the backend side, one can see that often there is a failure to
fork, which is basically expected in this condition.  Various
statements will be reporting OOM also.

The commonality of an error message that does not say anything about
being out of memory is representative of the norm, and nominally one
does not get any express indication that the system is out of memory,
but otherwise responsive.  This puts someone doing monitoring (like
us) in a tricky position: the utilizer of the database is free to use
their memory -- that's what it's for -- but the problem is we cannot
determine that the server is basically online, if fully utilized.
This defeats the ever-common "authenticate and run SELECT 1;"  basic
monitoring style frequently used to determine the most basic levels of
uptime.

Should the 'out of memory' conditions were delivered most of the time
we could act differently, but for now we basically have to assume that
postgres is offline and poke around.  It's also interesting to note
that the systems are basically responsive (ssh can always seem to
fork, as I'm poking around tools like 'ls' et al seem to be fine), and
sometimes the load average isn't even extreme -- a leaky application
with too many connections can cause this, so it's not like every tiny
last scrap of memory has been consumed.

-- 
fdr


Re: Cryptic error message in low-memory conditions

От
Tom Lane
Дата:
Daniel Farina <daniel@heroku.com> writes:
> Some Postgres servers will error out for a while with the following
> error message:
> "expected authentication request from server, but received c"
> [ and this seems to be triggered by fork failures in the server ]

<spock>Fascinating.</spock>

I poked at this for awhile by the expedient of hot-wiring postmaster.c
to always fail the fork request:

*** src/backend/postmaster/postmaster.c.orig    Tue Aug 23 17:13:09 2011
--- src/backend/postmaster/postmaster.c    Fri Aug 26 19:08:25 2011
***************
*** 3182,3188 **** #ifdef EXEC_BACKEND     pid = backend_forkexec(port); #else                            /*
!EXEC_BACKEND*/
 
!     pid = fork_process();     if (pid == 0)                /* child */     {         free(bn);
--- 3182,3189 ---- #ifdef EXEC_BACKEND     pid = backend_forkexec(port); #else                            /*
!EXEC_BACKEND*/
 
!     pid = -1; // fork_process();
!     errno = ENOMEM;     if (pid == 0)                /* child */     {         free(bn);

The basic case did what I was expecting:

$ psql "dbname=regression"
psql: could not fork new process for connection: Cannot allocate memory

which was unsurprising since I'm quite sure that code path got tested
in basically this fashion, back in the dark ages.  But sure enough,
over a TCP connection with SSL support turned on, it fails as described.
strace shows the postmaster is sending what it's supposed to.
I eventually figured out the problem: libpq sees the "E" response,
thinks it must be talking to a postmaster too old to understand
NEGOTIATE_SSL_CODE, and closes the socket and tries again.  But *it
forgets to clear out its input buffer*, so the body of the error message
is still there waiting to be read, and the next connection sees the "c"
as the first byte of the first postmaster response.  The fix for this is

*** src/interfaces/libpq/fe-connect.c.orig    Thu Jul 28 10:39:57 2011
--- src/interfaces/libpq/fe-connect.c    Fri Aug 26 18:52:18 2011
***************
*** 2064,2069 ****
--- 2064,2072 ----                         conn->allow_ssl_try = false;                         /* Assume it ain't
gonnahandle protocol 3, either */                         conn->pversion = PG_PROTOCOL(2, 0);
 
+                         /* Discard any unread/unsent data */
+                         conn->inStart = conn->inCursor = conn->inEnd = 0;
+                         conn->outCount = 0;                         /* Must drop the old connection */
        closesocket(conn->sock);                         conn->sock = -1;
 

and similarly in the other places in fe-connect.c where we're abandoning
an open connection (not sure every one of those has to have it, but it
seems a good safety practice).

That gets us to the right place in the default SSL mode:

$ psql "host=localhost sslmode=prefer dbname=regression"
psql: could not fork new process for connection: Cannot allocate memory

but we're still not there if you're trying to force SSL:

$ psql "host=localhost sslmode=require dbname=regression"
psql: server does not support SSL, but SSL was required

The reason for this is that that same bit of code supposes that any
"E" response must mean that the postmaster didn't recognize
NEGOTIATE_SSL_CODE.  It doesn't (and of course shouldn't) pay any
attention to the actual textual error message.

Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build
since PG 7.0, I believe that this is dead code and we could remove it;
it seems exceedingly unlikely that any modern build of libpq will ever
be used to talk to a server that responds to that with "E".

In fact it's worse than just delivering a misleading error message in
the "require" case, because if you're not doing "require" then what
happens next is that the code forces the protocol level down to 2 and
tries again, supposing that it must be dealing with an ancient server.
In the normal case where you are talking to a server under load rather
than a deliberately-broken one, it's entirely possible that the second
connection attempt succeeds.  And what you've got then is that the
connection is operating in protocol 2 rather than what the user probably
expected, disabling assorted functionality that he may well be depending
on.  I don't recall having seen reports that could match that syndrome,
but then again Daniel's complaint is a new one on me too.

So I'm thinking we'd be well advised to eliminate the assumption that an
"E" response could be fixed by downgrading to protocol 2, and instead
just make this code report the error message it got from the postmaster.
That's more than a one-liner so I don't have a patch for it yet.

Lastly, I noticed that if I tried this repeatedly on a Unix socket,
I sometimes got

psql: server closed the connection unexpectedly       This probably means the server terminated abnormally       before
orwhile processing the request.
 
could not send startup packet: Broken pipe

rather than the expected results.  I think what is happening here is a
race condition, such that if the postmaster closes the socket without
having read the startup packet, the client might not have actually
gotten to its send() yet, and then it will get EPIPE from send() before
it gets to the point of reading the error response.  I tried to fix this
by having report_fork_failure_to_client eat any pending data before
responding:

*** src/backend/postmaster/postmaster.c.orig    Tue Aug 23 17:13:09 2011
--- src/backend/postmaster/postmaster.c    Fri Aug 26 19:08:25 2011
***************
*** 3262,3276 ****     char        buffer[1000];     int            rc;      /* Format the error message packet (always
V2protocol) */     snprintf(buffer, sizeof(buffer), "E%s%s\n",              _("could not fork new process for
connection:"),              strerror(errnum)); 
 
-     /* Set port to non-blocking.  Don't do send() if this fails */
-     if (!pg_set_noblock(port->sock))
-         return;
-      /* We'll retry after EINTR, but ignore all other failures */     do     {
--- 3262,3291 ----     char        buffer[1000];     int            rc; 
+     /* Set port to non-blocking.  Don't do anything if this fails */
+     if (!pg_set_noblock(port->sock))
+         return;
+ 
+     /*
+      * Try to read the startup packet, if it's available which it likely is.
+      * The reason for doing this is that when the client is connected via Unix
+      * socket, if we close the socket without having read data it already
+      * sent, many kernels will report EPIPE instead of letting the client
+      * read whatever we sent.  That will likely result in an unhelpful
+      * "server closed the connection unexpectedly" report instead of what
+      * we are trying to tell them.  So read and discard any already-available
+      * data.  We already set nonblock mode, so we won't block here.
+      */
+     do
+     {
+         rc = recv(port->sock, buffer, sizeof(buffer), 0);
+     } while (rc > 0 || (rc < 0 && errno == EINTR));
+      /* Format the error message packet (always V2 protocol) */     snprintf(buffer, sizeof(buffer), "E%s%s\n",
      _("could not fork new process for connection: "),              strerror(errnum));      /* We'll retry after
EINTR,but ignore all other failures */     do     {
 


This seems to get rid of the problem on my Fedora box, but my old HPUX
box still does it some of the time.  I'm inclined to recommend applying
this patch anyway --- the only alternative I can see is to try to
rejigger the libpq code to temporarily ignore EPIPE during startup,
which seems both messy and fragile.

Comments?
        regards, tom lane


Re: Cryptic error message in low-memory conditions

От
Magnus Hagander
Дата:
On Sat, Aug 27, 2011 at 01:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <daniel@heroku.com> writes:
>> Some Postgres servers will error out for a while with the following
>> error message:
>> "expected authentication request from server, but received c"
>> [ and this seems to be triggered by fork failures in the server ]
>
> <spock>Fascinating.</spock>

Agreed.

<snip>

> The reason for this is that that same bit of code supposes that any
> "E" response must mean that the postmaster didn't recognize
> NEGOTIATE_SSL_CODE.  It doesn't (and of course shouldn't) pay any
> attention to the actual textual error message.
>
> Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build
> since PG 7.0, I believe that this is dead code and we could remove it;
> it seems exceedingly unlikely that any modern build of libpq will ever
> be used to talk to a server that responds to that with "E".

What will be the result if you do use the modern libpq against that?

Anyway - that's 5 *unsupported* versions back. So even if people do
use that, I say they have to downgrade libpq as well ;)

+1 for removing it.

In fact, when do we reach the point that we can remove all the support
for the v2 protocol completely? (this would obviously not be as a
bugfix, but perhaps in 9.2)? Is there any particular reason we need to
support both anymore? At least in the client?


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Cryptic error message in low-memory conditions

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> On Sat, Aug 27, 2011 at 01:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build
>> since PG 7.0, I believe that this is dead code and we could remove it;
>> it seems exceedingly unlikely that any modern build of libpq will ever
>> be used to talk to a server that responds to that with "E".

> What will be the result if you do use the modern libpq against that?

I'll check it after I write the patch, but what I'd expect to happen is
that libpq would fail the connection and report the server's error
message, which would be something like "unrecognized protocol version
number".  Anybody who did complain of this could be told to use
sslmode=disable when talking to the ancient server.

> Anyway - that's 5 *unsupported* versions back.

More to the point, it's been a very very long time since we've heard of
anyone using a server older than 7.2.  (And there's a good reason for
that, which is that 7.2 was the first version that wouldn't go belly-up
at 4 billion transactions.)

> In fact, when do we reach the point that we can remove all the support
> for the v2 protocol completely? (this would obviously not be as a
> bugfix, but perhaps in 9.2)? Is there any particular reason we need to
> support both anymore? At least in the client?

Fair question.  We *have* still heard of people using 7.2/7.3, I think.
Another point here is that there are JDBC people intentionally forcing
protocol version 2 as a means of controlling prepared-statement plan
lifespan.  I hope that the auto-replan code that I intend to get into
9.2 will provide a better answer for those folks, but removing the
workaround at the same time might be a tad premature.

So my feeling is "not quite yet, maybe in a couple more years".
        regards, tom lane


Re: Cryptic error message in low-memory conditions

От
Tom Lane
Дата:
I wrote:
> Lastly, I noticed that if I tried this repeatedly on a Unix socket,
> I sometimes got

> psql: server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> could not send startup packet: Broken pipe

> rather than the expected results.  I think what is happening here is a
> race condition, such that if the postmaster closes the socket without
> having read the startup packet, the client might not have actually
> gotten to its send() yet, and then it will get EPIPE from send() before
> it gets to the point of reading the error response.  I tried to fix this
> by having report_fork_failure_to_client eat any pending data before
> responding:

I've applied patches for the other two issues, but I'm having second
thoughts about trying to hack around this one.  The proposed patch
doesn't really eliminate the problem, and in any case the message is
not totally off base: the server did close the connection unexpectedly.
It'd be nicer if users didn't have to look in the server log to find out
why, but we can't guarantee that.

However, I've developed a second concern about
report_fork_failure_to_client, which is its habit of sending the fork
failure message formatted according to 2.0 protocol.  This causes libpq
(and possibly other clients) to suppose that it's talking to a pre-7.4
server and try again in 2.0 protocol.  So if the fork failure is
transient, you have the problem of being unexpectedly and silently
downgraded to 2.0 protocol.

We could fix that by changing the function to send the message in 3.0
protocol always --- it would take more code but it's certainly doable.
The trouble with that is that a pre-7.4 libpq would see the error
message as garbage; and I'm not sure how pleasantly the JDBC driver
handles it either, if it is trying to use 2.0 protocol.

A more long-range point about it is that the next time we make a
protocol version bump that affects the format of error messages,
the problem comes right back.  It'd be better if the message somehow
indicated that the server hadn't made any attempt to match the client
protocol version.  I guess if we went up to 3.0 protocol, we could
include a SQLSTATE value in the message and libpq could test that before
making assumptions.

Thoughts?
        regards, tom lane


Re: Cryptic error message in low-memory conditions

От
Tom Lane
Дата:
I wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Sat, Aug 27, 2011 at 01:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build
>>> since PG 7.0, I believe that this is dead code and we could remove it;
>>> it seems exceedingly unlikely that any modern build of libpq will ever
>>> be used to talk to a server that responds to that with "E".

>> What will be the result if you do use the modern libpq against that?

> I'll check it after I write the patch, but what I'd expect to happen is
> that libpq would fail the connection and report the server's error
> message, which would be something like "unrecognized protocol version
> number".  Anybody who did complain of this could be told to use
> sslmode=disable when talking to the ancient server.

Just for the archives' sake, what happens with the committed patch is
either a successful non-SSL connection:

$ psql "dbname=template1 sslmode=prefer host=localhost"
NOTICE:  Unrecognized variable client_encoding
psql (9.2devel, server 6.5.3)
WARNING: psql version 9.2, server version 6.5.        Some psql features might not work.
Type "help" for help.

or if you tried to force SSL usage, you get this:

$ psql "dbname=template1 sslmode=require host=localhost"
psql: Unsupported frontend protocol.
Unsupported frontend protocol.$

The reason for the repeated message is that libpq tries twice and
appends the error messages to its buffer both times.  I didn't think
this was important enough to try to fix; and anyway I seem to recall
that it's intentional that we append the messages from multiple
connection attempts.

BTW, this response starting with "U", together with the buffer flush bug,
seems to explain some of the old reports in the archives, such as
http://archives.postgresql.org/pgsql-hackers/2005-09/msg01106.php
        regards, tom lane


Re: Cryptic error message in low-memory conditions

От
Daniel Farina
Дата:
On Sat, Aug 27, 2011 at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A more long-range point about it is that the next time we make a
> protocol version bump that affects the format of error messages,
> the problem comes right back.  It'd be better if the message somehow
> indicated that the server hadn't made any attempt to match the client
> protocol version.  I guess if we went up to 3.0 protocol, we could
> include a SQLSTATE value in the message and libpq could test that before
> making assumptions.

Hmm.  What do you think about the way ssh handles support and
negotiation of cipher methods between client and server: in that case,
I believe both client and server advertise what ciphers they are
willing to accept.  Changing the format of that negotiation procedure
would be a pain, but hopefully that could be gotten pretty much right
the first time.  Right now, my understanding is libpq sends a version
to the server, but doesn't really receive a list of supported formats
in response, from this thread it seems what happens is if an old-style
message is returned then it's presumed the server can't handle the new
versions, which is not necessarily true.

The server seems to written with the sensible assumption that it
should use the oldest sensible message format to get the idea across
to libpq of as many versions as possible, but libpq is taking a
byproduct of the *way* the message/error is sent to learn something
about the server's supported versions, and using that to affect more
behavior.

So perhaps one solution for the distant future is to respond with
protocol versions so that libpq need not rely on guessing based on a
particular message style it receives.

Alternatively, could it make sense to just always report fork failure
in the newest libpq version and broadcast failures in each protocol
version's format, one after another?  This presumes that old libpqs
are forward-compatible to skip un-understood messages forever, though,
which is saying a lot about the future...and that seems to be
affecting ancient libpqs, as per your message.

Please correct me; my understandings of the of the startup process
have been hastily acquired from the documentation.

--
fdr


Re: Cryptic error message in low-memory conditions

От
Tom Lane
Дата:
Daniel Farina <daniel@heroku.com> writes:
> On Sat, Aug 27, 2011 at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A more long-range point about it is that the next time we make a
>> protocol version bump that affects the format of error messages,
>> the problem comes right back. �It'd be better if the message somehow
>> indicated that the server hadn't made any attempt to match the client
>> protocol version. �I guess if we went up to 3.0 protocol, we could
>> include a SQLSTATE value in the message and libpq could test that before
>> making assumptions.

> Hmm.  What do you think about the way ssh handles support and
> negotiation of cipher methods between client and server: in that case,
> I believe both client and server advertise what ciphers they are
> willing to accept.

The difficulty here is that, having failed to fork off a subprocess from
the postmaster, we really can't do much of anything beyond blasting out
a predetermined failure message.  We can *not* have the postmaster wait
to see what protocol the client asked for, or we risk denial-of-service
problems from malfunctioning or malicious clients.  So there's basically
no solution to be found by altering what it is that the client sends.

> Alternatively, could it make sense to just always report fork failure
> in the newest libpq version and broadcast failures in each protocol
> version's format, one after another?  This presumes that old libpqs
> are forward-compatible to skip un-understood messages forever, though,
> which is saying a lot about the future...and that seems to be
> affecting ancient libpqs, as per your message.

Yeah.  We could possibly hope that the current format of error messages
is sufficiently general that it'll never be obsoleted, or at least will
always be a valid subset of future formats.
        regards, tom lane


Re: Cryptic error message in low-memory conditions

От
Daniel Farina
Дата:
On Sat, Aug 27, 2011 at 3:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <daniel@heroku.com> writes:
>> On Sat, Aug 27, 2011 at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> A more long-range point about it is that the next time we make a
>>> protocol version bump that affects the format of error messages,
>>> the problem comes right back.  It'd be better if the message somehow
>>> indicated that the server hadn't made any attempt to match the client
>>> protocol version.  I guess if we went up to 3.0 protocol, we could
>>> include a SQLSTATE value in the message and libpq could test that before
>>> making assumptions.
>
>> Hmm.  What do you think about the way ssh handles support and
>> negotiation of cipher methods between client and server: in that case,
>> I believe both client and server advertise what ciphers they are
>> willing to accept.
>
> The difficulty here is that, having failed to fork off a subprocess from
> the postmaster, we really can't do much of anything beyond blasting out
> a predetermined failure message.  We can *not* have the postmaster wait
> to see what protocol the client asked for, or we risk denial-of-service
> problems from malfunctioning or malicious clients.  So there's basically
> no solution to be found by altering what it is that the client sends.

Hmm. Well, for this requirement I don't think it's necessary to have
the postmaster wait, luckily: the missing information on the client
side is "what protocols does this postmaster support", and it's
sniffing that out of the format a received message.  Instead, the
postmaster could send something as simple as a quite small static
string burned in at compile-time that advertises to the receiving
libpq what is supported, that way it doesn't have to guess.  Other
than a tiny amount of outbound traffic per bogus connection attempt
(which I think is spent anyway to say "authentication denied" or in
some cases "out of memory" for example) I think this avoids a problem
with malicious clients.

Such a versions-supported string probably has a smaller complexity
that is vulnerable to change vs. the error protocol, and is explicit:
libpq guessworking the supported messages on the server based on
traffic sniffing seems pretty tortured.

SSL might throw a small wrench into everything, though: does one want
to send the supported protocol information when the channel is
susceptible to MITM attacks, and *then* move to a secure channel?
This seems like a pretty slim vulnerability (the ability for an
attacker to poke at the protocol versions supported string, which
presumably would give them all sorts of other nasty power already) as
long as all other communications are SSLified.  One could probably get
around this by sending the version information twice, once before and
once after SSL negotiation so the client has an opportunity to spot
icky things going on.

(Embellishment: if one has policy in pg_hba.conf as to what protocols
are supported it may need to be a dynamic string influenced by the
conf on startup or whatever, but I think this is a comparatively small
detail.)
--
fdr