Обсуждение: [HACKERS] jacana hung after failing to acquire random number

Поиск
Список
Период
Сортировка

[HACKERS] jacana hung after failing to acquire random number

От
Andrew Dunstan
Дата:
jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make 
check". After starting the autovacuum launcher there are 120 messages on 
its log about "Could not acquire random number". Then nothing.


So I suspect the problem here is commit 
fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in 
detail.


Shouldn't we want the postmaster to shut down if it's not going to go 
further? Note that frogmouth, also mingw, which builds with openssl, 
doesn't have this issue.


cheers


andrew






Re: [HACKERS] jacana hung after failing to acquire random number

От
Michael Paquier
Дата:
On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
> check". After starting the autovacuum launcher there are 120 messages on its
> log about "Could not acquire random number". Then nothing.
>
>
> So I suspect the problem here is commit
> fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
> detail.
>
>
> Shouldn't we want the postmaster to shut down if it's not going to go
> further? Note that frogmouth, also mingw, which builds with openssl, doesn't
> have this issue.

Did you unlock it in some way at the end? Here is the shape of the
report for others:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15
And here is of course the interesting bit:
2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
random number
2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
random number
2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
random number

I am not seeing any problems with MSVC without openssl, so that's a
problem proper to MinGW. I am getting to wonder if it is actually a
good idea to cache the crypt context and then re-use it. Using a new
context all the time is definitely not performance-wise though. A
second difference are the missing CRYPT_MACHINE_KEYSET and
CRYP_NEWKEYSET. So, with something like the patch attached, do you see
improvements? What this patch does is to use a different context at
each call, and with the key container flags to allow proper access to
it (as there are winxp-only limitations here). I have tried to compile
with MinGW in my environment but my gcc compiler keep crashing, so I
cannot reproduce directly the problem I am afraid.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] jacana hung after failing to acquire random number

От
Heikki Linnakangas
Дата:
On 12/12/2016 05:58 AM, Michael Paquier wrote:
> On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>> jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
>> check". After starting the autovacuum launcher there are 120 messages on its
>> log about "Could not acquire random number". Then nothing.
>>
>>
>> So I suspect the problem here is commit
>> fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
>> detail.
>>
>>
>> Shouldn't we want the postmaster to shut down if it's not going to go
>> further? Note that frogmouth, also mingw, which builds with openssl, doesn't
>> have this issue.
>
> Did you unlock it in some way at the end? Here is the shape of the
> report for others:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15
> And here is of course the interesting bit:
> 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
> random number
> 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
> random number
> 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
> random number
>
> I am not seeing any problems with MSVC without openssl, so that's a
> problem proper to MinGW. I am getting to wonder if it is actually a
> good idea to cache the crypt context and then re-use it. Using a new
> context all the time is definitely not performance-wise though.

Actually, looking at the config.log on jacana, it's trying to use 
/dev/urandom:

configure:15028: checking for /dev/urandom
configure:15041: result: yes
configure:15054: checking which random number source to use
configure:15073: result: /dev/urandom

And looking closer at configure.in, I can see why:
  elif test "$PORTNAME" = x"win32" ; then    USE_WIN32_RANDOM=1

That test is broken. It looks like the x"$VAR" = x"constant" idiom, but 
the left side of the comparison doesn't have the 'x'. Oops.

Fixed that, let's see if it made jacana happy again.

This makes me wonder if we should work a bit harder to get a good error 
message, if acquiring a random number fails for any reason. This needs 
to work in the frontend as well backend, but we could still have an 
elog(LOG, ...) there, inside an #ifndef FRONTEND block.

- Heikki




Re: [HACKERS] jacana hung after failing to acquire random number

От
Michael Paquier
Дата:
On Mon, Dec 12, 2016 at 4:32 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> That test is broken. It looks like the x"$VAR" = x"constant" idiom, but the left side of the comparison doesn't have
the'x'. Oops.
 

Good catch.

> This makes me wonder if we should work a bit harder to get a good error
> message, if acquiring a random number fails for any reason. This needs to
> work in the frontend as well backend, but we could still have an elog(LOG,
> ...) there, inside an #ifndef FRONTEND block.

Yeah, this has been itching me as well. We could allocate an error
string in a psprintf()'d string and let the callers of
pg_strong_backend() use it as they are responsible for the error
handling. What do you think?
-- 
Michael



Re: [HACKERS] jacana hung after failing to acquire random number

От
Andrew Dunstan
Дата:

On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:
> On 12/12/2016 05:58 AM, Michael Paquier wrote:
>> On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net> 
>> wrote:
>>>
>>> jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
>>> check". After starting the autovacuum launcher there are 120 
>>> messages on its
>>> log about "Could not acquire random number". Then nothing.
>>>
>>>
>>> So I suspect the problem here is commit
>>> fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
>>> detail.
>>>
>>>
>>> Shouldn't we want the postmaster to shut down if it's not going to go
>>> further? Note that frogmouth, also mingw, which builds with openssl, 
>>> doesn't
>>> have this issue.
>>
>> Did you unlock it in some way at the end? Here is the shape of the
>> report for others:
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15 
>>
>> And here is of course the interesting bit:
>> 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
>> random number
>> 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
>> random number
>> 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
>> random number
>>
>> I am not seeing any problems with MSVC without openssl, so that's a
>> problem proper to MinGW. I am getting to wonder if it is actually a
>> good idea to cache the crypt context and then re-use it. Using a new
>> context all the time is definitely not performance-wise though.
>
> Actually, looking at the config.log on jacana, it's trying to use 
> /dev/urandom:
>
> configure:15028: checking for /dev/urandom
> configure:15041: result: yes
> configure:15054: checking which random number source to use
> configure:15073: result: /dev/urandom
>
> And looking closer at configure.in, I can see why:
>
>   elif test "$PORTNAME" = x"win32" ; then
>     USE_WIN32_RANDOM=1
>
> That test is broken. It looks like the x"$VAR" = x"constant" idiom, 
> but the left side of the comparison doesn't have the 'x'. Oops.
>
> Fixed that, let's see if it made jacana happy again.
>
> This makes me wonder if we should work a bit harder to get a good 
> error message, if acquiring a random number fails for any reason. This 
> needs to work in the frontend as well backend, but we could still have 
> an elog(LOG, ...) there, inside an #ifndef FRONTEND block.


I see you have now improved the messages in postmaster.c, which is good.

However, the bigger problem (ISTM) is that when this failed I had a 
system which was running but where every connection immediately failed:
   ============== creating temporary instance            ==============   ============== initializing database system
       ==============   ============== starting postmaster                    ==============
 
   pg_regress: postmaster did not respond within 120 seconds   Examine
c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.logfor the reason   make: ***
[check]Error 2
 

Should one or more of these errors be fatal? Or should we at least get 
pg_regress to try to shut down the postmaster if it can't connect after 
120 seconds?


[In answer to Michael's question above, I forcibly shut down the 
postmaster by hand. Otherwise it would still be running, and we would 
not have got the report on the buildfarm server.]

cheers

andrew




Re: [HACKERS] jacana hung after failing to acquire random number

От
Heikki Linnakangas
Дата:
On 12/12/2016 03:40 PM, Andrew Dunstan wrote:
>
>
> On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:
>> On 12/12/2016 05:58 AM, Michael Paquier wrote:
>>> On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew@dunslane.net>
>>> wrote:
>>>>
>>>> jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
>>>> check". After starting the autovacuum launcher there are 120
>>>> messages on its
>>>> log about "Could not acquire random number". Then nothing.
>>>>
>>>>
>>>> So I suspect the problem here is commit
>>>> fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
>>>> detail.
>>>>
>>>>
>>>> Shouldn't we want the postmaster to shut down if it's not going to go
>>>> further? Note that frogmouth, also mingw, which builds with openssl,
>>>> doesn't
>>>> have this issue.
>>>
>>> Did you unlock it in some way at the end? Here is the shape of the
>>> report for others:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15
>>>
>>> And here is of course the interesting bit:
>>> 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG:  could not acquire
>>> random number
>>> 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG:  could not acquire
>>> random number
>>> 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG:  could not acquire
>>> random number
>>>
>>> I am not seeing any problems with MSVC without openssl, so that's a
>>> problem proper to MinGW. I am getting to wonder if it is actually a
>>> good idea to cache the crypt context and then re-use it. Using a new
>>> context all the time is definitely not performance-wise though.
>>
>> Actually, looking at the config.log on jacana, it's trying to use
>> /dev/urandom:
>>
>> configure:15028: checking for /dev/urandom
>> configure:15041: result: yes
>> configure:15054: checking which random number source to use
>> configure:15073: result: /dev/urandom
>>
>> And looking closer at configure.in, I can see why:
>>
>>   elif test "$PORTNAME" = x"win32" ; then
>>     USE_WIN32_RANDOM=1
>>
>> That test is broken. It looks like the x"$VAR" = x"constant" idiom,
>> but the left side of the comparison doesn't have the 'x'. Oops.
>>
>> Fixed that, let's see if it made jacana happy again.
>>
>> This makes me wonder if we should work a bit harder to get a good
>> error message, if acquiring a random number fails for any reason. This
>> needs to work in the frontend as well backend, but we could still have
>> an elog(LOG, ...) there, inside an #ifndef FRONTEND block.
>
>
> I see you have now improved the messages in postmaster.c, which is good.

Well, I only wordsmithed them a bit, it still doesn't give much clue on 
why it failed. We should add more details to it.

> However, the bigger problem (ISTM) is that when this failed I had a
> system which was running but where every connection immediately failed:
>
>     ============== creating temporary instance            ==============
>     ============== initializing database system           ==============
>     ============== starting postmaster                    ==============
>
>     pg_regress: postmaster did not respond within 120 seconds
>     Examine c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log for the
reason
>     make: *** [check] Error 2
>
> Should one or more of these errors be fatal? Or should we at least get
> pg_regress to try to shut down the postmaster if it can't connect after
> 120 seconds?

Making it fatal, i.e. bringing down the server, doesn't seem like an 
improvement. If the failure is transient, you don't want to kill the 
whole server, when one connection attempt fails.

It would be nice to fail earlier if it's permanently failing, though. 
Like, if someone does "rm /dev/urandom". Perhaps we should perform one 
pg_strong_random() call at postmaster startup, and if that fails, refuse 
to start up.

- Heikki




Re: [HACKERS] jacana hung after failing to acquire random number

От
Heikki Linnakangas
Дата:
On 12/12/2016 03:40 PM, Andrew Dunstan wrote:
> Or should we at least get pg_regress to try to shut down the
> postmaster if it can't connect after 120 seconds?

I think that makes a lot of sense, independently of this random stuff.

- Heikki



Re: [HACKERS] jacana hung after failing to acquire random number

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 12/12/2016 03:40 PM, Andrew Dunstan wrote:
>> Should one or more of these errors be fatal? Or should we at least get
>> pg_regress to try to shut down the postmaster if it can't connect after
>> 120 seconds?

> Making it fatal, i.e. bringing down the server, doesn't seem like an 
> improvement. If the failure is transient, you don't want to kill the 
> whole server, when one connection attempt fails.

> It would be nice to fail earlier if it's permanently failing, though. 
> Like, if someone does "rm /dev/urandom". Perhaps we should perform one 
> pg_strong_random() call at postmaster startup, and if that fails, refuse 
> to start up.

That's sort of contradictory.  If you're worried about transient failures,
allowing a single failed try to cause postmaster startup failure isn't the
way to make things more robust.  Giving up after a bunch of failed tries
over a very short interval isn't much better.

I'm not sure how hard we need to work here.  The case at hand seems
to be one of simply not having gotten the bugs out of the initial
implementation, so maybe we shouldn't read too much into it.

I do agree that the buildfarm needs to be more robust against broken
postmasters, because finding bugs is its raison d' etre.  But I'm not
convinced that it's a good idea to have the postmaster itself conclude
that there's something wrong with its configured random-number source.
        regards, tom lane



Re: [HACKERS] jacana hung after failing to acquire random number

От
Andrew Dunstan
Дата:

On 12/12/2016 09:02 AM, Heikki Linnakangas wrote:
> On 12/12/2016 03:40 PM, Andrew Dunstan wrote:
>> Or should we at least get pg_regress to try to shut down the
>> postmaster if it can't connect after 120 seconds?
>
> I think that makes a lot of sense, independently of this random stuff.
>
>


I will add it to my long TODO list.

cheers

andrew