Обсуждение: Cryptic error message in low-memory conditions
Hello list, This is something that I've only recently somewhat pinned down to a cause... Some Postgres servers will error out for a while with the following error message: "expected authentication request from server, but received c" If one uses Their Favorite Search Engine, this message is scattered around the internet, all in reference to Postgres, I think, but none of the top results seem to have any lucid responses or cause listed. We've seen this reproduce -- sometimes for minutes at a time -- and after catching one in the act I am reasonably confident that one common cause of this is systems that are low on memory, which I confirmed by looking at postgres logs and matching them up against our monitoring system. Critical statistics first: the systems run Linux with overcommit off, so malloc returns NULL now and again. There is no OOM killer activity.SSL is the transport, and SQL role password authenticationis in use.There is no swap. Here's an example of the various kinds of failure one can get from connecting to a system that is low on memory: 2011-08-26 16:03:06 | INFO "psql? failed with exception #<PGError: FATAL: out of memory DETAIL: Failed on request of size 488. >" 2011-08-26 16:02:27 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 16:01:51 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 16:01:15 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 16:00:39 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 16:00:01 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 15:59:25 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 15:58:48 | INFO "psql? failed with exception #<PGError: expected authentication request from server, but received c >" 2011-08-26 15:58:12 | INFO "psql? failed with exception #<PGError: FATAL: out of memory >" On the backend side, one can see that often there is a failure to fork, which is basically expected in this condition. Various statements will be reporting OOM also. The commonality of an error message that does not say anything about being out of memory is representative of the norm, and nominally one does not get any express indication that the system is out of memory, but otherwise responsive. This puts someone doing monitoring (like us) in a tricky position: the utilizer of the database is free to use their memory -- that's what it's for -- but the problem is we cannot determine that the server is basically online, if fully utilized. This defeats the ever-common "authenticate and run SELECT 1;" basic monitoring style frequently used to determine the most basic levels of uptime. Should the 'out of memory' conditions were delivered most of the time we could act differently, but for now we basically have to assume that postgres is offline and poke around. It's also interesting to note that the systems are basically responsive (ssh can always seem to fork, as I'm poking around tools like 'ls' et al seem to be fine), and sometimes the load average isn't even extreme -- a leaky application with too many connections can cause this, so it's not like every tiny last scrap of memory has been consumed. -- fdr
Daniel Farina <daniel@heroku.com> writes: > Some Postgres servers will error out for a while with the following > error message: > "expected authentication request from server, but received c" > [ and this seems to be triggered by fork failures in the server ] <spock>Fascinating.</spock> I poked at this for awhile by the expedient of hot-wiring postmaster.c to always fail the fork request: *** src/backend/postmaster/postmaster.c.orig Tue Aug 23 17:13:09 2011 --- src/backend/postmaster/postmaster.c Fri Aug 26 19:08:25 2011 *************** *** 3182,3188 **** #ifdef EXEC_BACKEND pid = backend_forkexec(port); #else /* !EXEC_BACKEND*/ ! pid = fork_process(); if (pid == 0) /* child */ { free(bn); --- 3182,3189 ---- #ifdef EXEC_BACKEND pid = backend_forkexec(port); #else /* !EXEC_BACKEND*/ ! pid = -1; // fork_process(); ! errno = ENOMEM; if (pid == 0) /* child */ { free(bn); The basic case did what I was expecting: $ psql "dbname=regression" psql: could not fork new process for connection: Cannot allocate memory which was unsurprising since I'm quite sure that code path got tested in basically this fashion, back in the dark ages. But sure enough, over a TCP connection with SSL support turned on, it fails as described. strace shows the postmaster is sending what it's supposed to. I eventually figured out the problem: libpq sees the "E" response, thinks it must be talking to a postmaster too old to understand NEGOTIATE_SSL_CODE, and closes the socket and tries again. But *it forgets to clear out its input buffer*, so the body of the error message is still there waiting to be read, and the next connection sees the "c" as the first byte of the first postmaster response. The fix for this is *** src/interfaces/libpq/fe-connect.c.orig Thu Jul 28 10:39:57 2011 --- src/interfaces/libpq/fe-connect.c Fri Aug 26 18:52:18 2011 *************** *** 2064,2069 **** --- 2064,2072 ---- conn->allow_ssl_try = false; /* Assume it ain't gonnahandle protocol 3, either */ conn->pversion = PG_PROTOCOL(2, 0); + /* Discard any unread/unsent data */ + conn->inStart = conn->inCursor = conn->inEnd = 0; + conn->outCount = 0; /* Must drop the old connection */ closesocket(conn->sock); conn->sock = -1; and similarly in the other places in fe-connect.c where we're abandoning an open connection (not sure every one of those has to have it, but it seems a good safety practice). That gets us to the right place in the default SSL mode: $ psql "host=localhost sslmode=prefer dbname=regression" psql: could not fork new process for connection: Cannot allocate memory but we're still not there if you're trying to force SSL: $ psql "host=localhost sslmode=require dbname=regression" psql: server does not support SSL, but SSL was required The reason for this is that that same bit of code supposes that any "E" response must mean that the postmaster didn't recognize NEGOTIATE_SSL_CODE. It doesn't (and of course shouldn't) pay any attention to the actual textual error message. Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build since PG 7.0, I believe that this is dead code and we could remove it; it seems exceedingly unlikely that any modern build of libpq will ever be used to talk to a server that responds to that with "E". In fact it's worse than just delivering a misleading error message in the "require" case, because if you're not doing "require" then what happens next is that the code forces the protocol level down to 2 and tries again, supposing that it must be dealing with an ancient server. In the normal case where you are talking to a server under load rather than a deliberately-broken one, it's entirely possible that the second connection attempt succeeds. And what you've got then is that the connection is operating in protocol 2 rather than what the user probably expected, disabling assorted functionality that he may well be depending on. I don't recall having seen reports that could match that syndrome, but then again Daniel's complaint is a new one on me too. So I'm thinking we'd be well advised to eliminate the assumption that an "E" response could be fixed by downgrading to protocol 2, and instead just make this code report the error message it got from the postmaster. That's more than a one-liner so I don't have a patch for it yet. Lastly, I noticed that if I tried this repeatedly on a Unix socket, I sometimes got psql: server closed the connection unexpectedly This probably means the server terminated abnormally before orwhile processing the request. could not send startup packet: Broken pipe rather than the expected results. I think what is happening here is a race condition, such that if the postmaster closes the socket without having read the startup packet, the client might not have actually gotten to its send() yet, and then it will get EPIPE from send() before it gets to the point of reading the error response. I tried to fix this by having report_fork_failure_to_client eat any pending data before responding: *** src/backend/postmaster/postmaster.c.orig Tue Aug 23 17:13:09 2011 --- src/backend/postmaster/postmaster.c Fri Aug 26 19:08:25 2011 *************** *** 3262,3276 **** char buffer[1000]; int rc; /* Format the error message packet (always V2protocol) */ snprintf(buffer, sizeof(buffer), "E%s%s\n", _("could not fork new process for connection:"), strerror(errnum)); - /* Set port to non-blocking. Don't do send() if this fails */ - if (!pg_set_noblock(port->sock)) - return; - /* We'll retry after EINTR, but ignore all other failures */ do { --- 3262,3291 ---- char buffer[1000]; int rc; + /* Set port to non-blocking. Don't do anything if this fails */ + if (!pg_set_noblock(port->sock)) + return; + + /* + * Try to read the startup packet, if it's available which it likely is. + * The reason for doing this is that when the client is connected via Unix + * socket, if we close the socket without having read data it already + * sent, many kernels will report EPIPE instead of letting the client + * read whatever we sent. That will likely result in an unhelpful + * "server closed the connection unexpectedly" report instead of what + * we are trying to tell them. So read and discard any already-available + * data. We already set nonblock mode, so we won't block here. + */ + do + { + rc = recv(port->sock, buffer, sizeof(buffer), 0); + } while (rc > 0 || (rc < 0 && errno == EINTR)); + /* Format the error message packet (always V2 protocol) */ snprintf(buffer, sizeof(buffer), "E%s%s\n", _("could not fork new process for connection: "), strerror(errnum)); /* We'll retry after EINTR,but ignore all other failures */ do { This seems to get rid of the problem on my Fedora box, but my old HPUX box still does it some of the time. I'm inclined to recommend applying this patch anyway --- the only alternative I can see is to try to rejigger the libpq code to temporarily ignore EPIPE during startup, which seems both messy and fragile. Comments? regards, tom lane
On Sat, Aug 27, 2011 at 01:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniel Farina <daniel@heroku.com> writes: >> Some Postgres servers will error out for a while with the following >> error message: >> "expected authentication request from server, but received c" >> [ and this seems to be triggered by fork failures in the server ] > > <spock>Fascinating.</spock> Agreed. <snip> > The reason for this is that that same bit of code supposes that any > "E" response must mean that the postmaster didn't recognize > NEGOTIATE_SSL_CODE. It doesn't (and of course shouldn't) pay any > attention to the actual textual error message. > > Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build > since PG 7.0, I believe that this is dead code and we could remove it; > it seems exceedingly unlikely that any modern build of libpq will ever > be used to talk to a server that responds to that with "E". What will be the result if you do use the modern libpq against that? Anyway - that's 5 *unsupported* versions back. So even if people do use that, I say they have to downgrade libpq as well ;) +1 for removing it. In fact, when do we reach the point that we can remove all the support for the v2 protocol completely? (this would obviously not be as a bugfix, but perhaps in 9.2)? Is there any particular reason we need to support both anymore? At least in the client? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Sat, Aug 27, 2011 at 01:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build >> since PG 7.0, I believe that this is dead code and we could remove it; >> it seems exceedingly unlikely that any modern build of libpq will ever >> be used to talk to a server that responds to that with "E". > What will be the result if you do use the modern libpq against that? I'll check it after I write the patch, but what I'd expect to happen is that libpq would fail the connection and report the server's error message, which would be something like "unrecognized protocol version number". Anybody who did complain of this could be told to use sslmode=disable when talking to the ancient server. > Anyway - that's 5 *unsupported* versions back. More to the point, it's been a very very long time since we've heard of anyone using a server older than 7.2. (And there's a good reason for that, which is that 7.2 was the first version that wouldn't go belly-up at 4 billion transactions.) > In fact, when do we reach the point that we can remove all the support > for the v2 protocol completely? (this would obviously not be as a > bugfix, but perhaps in 9.2)? Is there any particular reason we need to > support both anymore? At least in the client? Fair question. We *have* still heard of people using 7.2/7.3, I think. Another point here is that there are JDBC people intentionally forcing protocol version 2 as a means of controlling prepared-statement plan lifespan. I hope that the auto-replan code that I intend to get into 9.2 will provide a better answer for those folks, but removing the workaround at the same time might be a tad premature. So my feeling is "not quite yet, maybe in a couple more years". regards, tom lane
I wrote: > Lastly, I noticed that if I tried this repeatedly on a Unix socket, > I sometimes got > psql: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > could not send startup packet: Broken pipe > rather than the expected results. I think what is happening here is a > race condition, such that if the postmaster closes the socket without > having read the startup packet, the client might not have actually > gotten to its send() yet, and then it will get EPIPE from send() before > it gets to the point of reading the error response. I tried to fix this > by having report_fork_failure_to_client eat any pending data before > responding: I've applied patches for the other two issues, but I'm having second thoughts about trying to hack around this one. The proposed patch doesn't really eliminate the problem, and in any case the message is not totally off base: the server did close the connection unexpectedly. It'd be nicer if users didn't have to look in the server log to find out why, but we can't guarantee that. However, I've developed a second concern about report_fork_failure_to_client, which is its habit of sending the fork failure message formatted according to 2.0 protocol. This causes libpq (and possibly other clients) to suppose that it's talking to a pre-7.4 server and try again in 2.0 protocol. So if the fork failure is transient, you have the problem of being unexpectedly and silently downgraded to 2.0 protocol. We could fix that by changing the function to send the message in 3.0 protocol always --- it would take more code but it's certainly doable. The trouble with that is that a pre-7.4 libpq would see the error message as garbage; and I'm not sure how pleasantly the JDBC driver handles it either, if it is trying to use 2.0 protocol. A more long-range point about it is that the next time we make a protocol version bump that affects the format of error messages, the problem comes right back. It'd be better if the message somehow indicated that the server hadn't made any attempt to match the client protocol version. I guess if we went up to 3.0 protocol, we could include a SQLSTATE value in the message and libpq could test that before making assumptions. Thoughts? regards, tom lane
I wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Sat, Aug 27, 2011 at 01:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Now, seeing as how NEGOTIATE_SSL_CODE has been understood by every build >>> since PG 7.0, I believe that this is dead code and we could remove it; >>> it seems exceedingly unlikely that any modern build of libpq will ever >>> be used to talk to a server that responds to that with "E". >> What will be the result if you do use the modern libpq against that? > I'll check it after I write the patch, but what I'd expect to happen is > that libpq would fail the connection and report the server's error > message, which would be something like "unrecognized protocol version > number". Anybody who did complain of this could be told to use > sslmode=disable when talking to the ancient server. Just for the archives' sake, what happens with the committed patch is either a successful non-SSL connection: $ psql "dbname=template1 sslmode=prefer host=localhost" NOTICE: Unrecognized variable client_encoding psql (9.2devel, server 6.5.3) WARNING: psql version 9.2, server version 6.5. Some psql features might not work. Type "help" for help. or if you tried to force SSL usage, you get this: $ psql "dbname=template1 sslmode=require host=localhost" psql: Unsupported frontend protocol. Unsupported frontend protocol.$ The reason for the repeated message is that libpq tries twice and appends the error messages to its buffer both times. I didn't think this was important enough to try to fix; and anyway I seem to recall that it's intentional that we append the messages from multiple connection attempts. BTW, this response starting with "U", together with the buffer flush bug, seems to explain some of the old reports in the archives, such as http://archives.postgresql.org/pgsql-hackers/2005-09/msg01106.php regards, tom lane
On Sat, Aug 27, 2011 at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > A more long-range point about it is that the next time we make a > protocol version bump that affects the format of error messages, > the problem comes right back. It'd be better if the message somehow > indicated that the server hadn't made any attempt to match the client > protocol version. I guess if we went up to 3.0 protocol, we could > include a SQLSTATE value in the message and libpq could test that before > making assumptions. Hmm. What do you think about the way ssh handles support and negotiation of cipher methods between client and server: in that case, I believe both client and server advertise what ciphers they are willing to accept. Changing the format of that negotiation procedure would be a pain, but hopefully that could be gotten pretty much right the first time. Right now, my understanding is libpq sends a version to the server, but doesn't really receive a list of supported formats in response, from this thread it seems what happens is if an old-style message is returned then it's presumed the server can't handle the new versions, which is not necessarily true. The server seems to written with the sensible assumption that it should use the oldest sensible message format to get the idea across to libpq of as many versions as possible, but libpq is taking a byproduct of the *way* the message/error is sent to learn something about the server's supported versions, and using that to affect more behavior. So perhaps one solution for the distant future is to respond with protocol versions so that libpq need not rely on guessing based on a particular message style it receives. Alternatively, could it make sense to just always report fork failure in the newest libpq version and broadcast failures in each protocol version's format, one after another? This presumes that old libpqs are forward-compatible to skip un-understood messages forever, though, which is saying a lot about the future...and that seems to be affecting ancient libpqs, as per your message. Please correct me; my understandings of the of the startup process have been hastily acquired from the documentation. -- fdr
Daniel Farina <daniel@heroku.com> writes: > On Sat, Aug 27, 2011 at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A more long-range point about it is that the next time we make a >> protocol version bump that affects the format of error messages, >> the problem comes right back. �It'd be better if the message somehow >> indicated that the server hadn't made any attempt to match the client >> protocol version. �I guess if we went up to 3.0 protocol, we could >> include a SQLSTATE value in the message and libpq could test that before >> making assumptions. > Hmm. What do you think about the way ssh handles support and > negotiation of cipher methods between client and server: in that case, > I believe both client and server advertise what ciphers they are > willing to accept. The difficulty here is that, having failed to fork off a subprocess from the postmaster, we really can't do much of anything beyond blasting out a predetermined failure message. We can *not* have the postmaster wait to see what protocol the client asked for, or we risk denial-of-service problems from malfunctioning or malicious clients. So there's basically no solution to be found by altering what it is that the client sends. > Alternatively, could it make sense to just always report fork failure > in the newest libpq version and broadcast failures in each protocol > version's format, one after another? This presumes that old libpqs > are forward-compatible to skip un-understood messages forever, though, > which is saying a lot about the future...and that seems to be > affecting ancient libpqs, as per your message. Yeah. We could possibly hope that the current format of error messages is sufficiently general that it'll never be obsoleted, or at least will always be a valid subset of future formats. regards, tom lane
On Sat, Aug 27, 2011 at 3:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniel Farina <daniel@heroku.com> writes: >> On Sat, Aug 27, 2011 at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> A more long-range point about it is that the next time we make a >>> protocol version bump that affects the format of error messages, >>> the problem comes right back. It'd be better if the message somehow >>> indicated that the server hadn't made any attempt to match the client >>> protocol version. I guess if we went up to 3.0 protocol, we could >>> include a SQLSTATE value in the message and libpq could test that before >>> making assumptions. > >> Hmm. What do you think about the way ssh handles support and >> negotiation of cipher methods between client and server: in that case, >> I believe both client and server advertise what ciphers they are >> willing to accept. > > The difficulty here is that, having failed to fork off a subprocess from > the postmaster, we really can't do much of anything beyond blasting out > a predetermined failure message. We can *not* have the postmaster wait > to see what protocol the client asked for, or we risk denial-of-service > problems from malfunctioning or malicious clients. So there's basically > no solution to be found by altering what it is that the client sends. Hmm. Well, for this requirement I don't think it's necessary to have the postmaster wait, luckily: the missing information on the client side is "what protocols does this postmaster support", and it's sniffing that out of the format a received message. Instead, the postmaster could send something as simple as a quite small static string burned in at compile-time that advertises to the receiving libpq what is supported, that way it doesn't have to guess. Other than a tiny amount of outbound traffic per bogus connection attempt (which I think is spent anyway to say "authentication denied" or in some cases "out of memory" for example) I think this avoids a problem with malicious clients. Such a versions-supported string probably has a smaller complexity that is vulnerable to change vs. the error protocol, and is explicit: libpq guessworking the supported messages on the server based on traffic sniffing seems pretty tortured. SSL might throw a small wrench into everything, though: does one want to send the supported protocol information when the channel is susceptible to MITM attacks, and *then* move to a secure channel? This seems like a pretty slim vulnerability (the ability for an attacker to poke at the protocol versions supported string, which presumably would give them all sorts of other nasty power already) as long as all other communications are SSLified. One could probably get around this by sending the version information twice, once before and once after SSL negotiation so the client has an opportunity to spot icky things going on. (Embellishment: if one has policy in pg_hba.conf as to what protocols are supported it may need to be a dynamic string influenced by the conf on startup or whatever, but I think this is a comparatively small detail.) -- fdr