PG 10: could not generate random cancel key

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема PG 10: could not generate random cancel key
Дата
Msg-id CAEZATCXMtxbzSAvyKKk5uCRf9pNt4UV+F_5v=gLfJUuPxU4Ytg@mail.gmail.com
обсуждение исходный текст
Ответы Re: PG 10: could not generate random cancel key  (Michael Paquier <michael@paquier.xyz>)
Re: PG 10: could not generate random cancel key  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Last week I upgraded 15 servers from various pre-10 versions to 10.4.
At first everything looked OK, but then (around 4 days later) one of
them failed with this in the logs:

2018-07-14 01:53:35.840 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.233 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.245 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.553 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.581 BST  LOG:  could not generate random cancel key
2018-07-14 01:54:43.851 BST  WARNING:  worker took too long to start; canceled
2018-07-14 01:54:43.862 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.861 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.874 BST  LOG:  could not generate random cancel key
...

After that it would not accept any new connections until I restarted
postmaster a few hours later. Since then, it has been OK.

It was built using --with-openssl and strong random support enabled,
so it was OpenSSL's RAND_bytes() that failed for some reason. I
attempted to reproduce it with a small C program directly calling
RAND_bytes(), but it refused to fail, even if I disabled haveged and
ran my tests in an @reboot cron job. So this failure is evidently
quite rare, but the documentation for RAND_bytes() says it *can* fail
(returning 0) if it isn't seeded with enough entropy, in which case
more must be added, which we're not doing.

In addition, once it does fail, repeated calls to RAND_bytes() will
continue to fail if it isn't seeded with more data -- hence the
inability to start any new backends until after a postmaster restart,
which is not a very friendly failure mode.

The OpenSSL documentation suggests that we should use RAND_status()
[1] to check that the generator has been seeded with enough data:

    RAND_status() indicates whether or not the CSPRNG has been sufficiently
    seeded. If not, functions such as RAND_bytes(3) will fail.

and if not, RAND_poll() can be used to fix that:

    RAND_poll() uses the system's capabilities to seed the CSPRNG using
    random input obtained from polling various trusted entropy sources. The
    default choice of the entropy source can be modified at build time using
    the --with-rand-seed configure option, see also the NOTES section. A
    summary of the configure options can be displayed with the OpenSSL
    version(1) command.

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
    RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Thoughts?

Regards,
Dean


[1] https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html
[2] https://github.com/nodejs/node/blob/master/src/node_crypto.cc
[3] https://github.com/openssl/openssl/issues/4148

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: [HACKERS] WAL logging problem in 9.4.3?
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: Make foo=null a warning by default.