Re: Rare SSL failures on eelpout

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Rare SSL failures on eelpout
Дата
Msg-id CA+hUKGJ55XHi0ptsJQjMU=LK4kDegF9koG=AXEzvaE=wMCRFSw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Rare SSL failures on eelpout  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Rare SSL failures on eelpout
Список pgsql-hackers
On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> >> That suggests that we could perhaps handle ECONNRESET both at startup
> >> packet send time (for certificate rejection, eelpout's case) and at
> >> initial query send (for idle timeout, bug #15598's case) by attempting
> >> to read.  Does that make sense?
>
> > Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> > failure means there is nothing left to *read*.
>
> After staring at the code for awhile, I am thinking that there may be
> a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
> specific to the OpenSSL version you're using on eelpout?  There is
> not anything we could do differently in libpq, AFAICS, because it's
> OpenSSL's responsibility to read any data that might be available.
>
> I also looked into the idea that we're doing something wrong on the
> server side, allowing the final error message to not get flushed out.
> A plausible theory there is that SSL_shutdown is returning a WANT_READ
> or WANT_WRITE error and we should retry it ... but that doesn't square
> with your observation upthread that it's returning SSL_ERROR_SSL.
>
> It's all very confusing, but I think there's a nontrivial chance
> that this is an OpenSSL bug, especially since we haven't been able
> to replicate it elsewhere.

Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
remember that very few animals are running the ssl tests; also it
requires particular timing to hit.

OK, here's something.  I can reproduce it quite easily on this
machine, and I can fix it like this:

diff --git a/src/interfaces/libpq/fe-connect.c
b/src/interfaces/libpq/fe-connect.c
index f29202db5f..e9c137f1bd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2705,6 +2705,7 @@ keep_going:
         /* We will come back to here until there is

   libpq_gettext("could not send startup packet: %s\n"),

   SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
                                        free(startpacket);
+                                       pqHandleSendFailure(conn);
                                        goto error_return;
                                }

If I add some printf debugging in there, I can see that block being
reached every hundred or so times I try to connect with a revoked
certificate, and with that extra call to pqHandleSendFailure() in
there the error comes out as it should:

psql: SSL error: sslv3 alert certificate revoked

Now I'm confused because we already have handling like that in
PQsendQuery(), so I can't explain bug #15598.

-- 
Thomas Munro
https://enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Paul Ramsey
Дата:
Сообщение: Re: Allowing extensions to supply operator-/function-specific info
Следующее
От: Bruno Hass
Дата:
Сообщение: [Proposal] TOAST'ing in slices