Re: BUG #15346: Replica fails to start after the crash

Поиск
Список
Период
Сортировка
От Alexander Kukushkin
Тема Re: BUG #15346: Replica fails to start after the crash
Дата
Msg-id CAFh8B==_6cY1f7rF4oxK+wjpWKaYzs96Q6Tn5=4QRWJiVnjmDg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15346: Replica fails to start after the crash  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: BUG #15346: Replica fails to start after the crash  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-bugs
Hi,

2018-08-29 14:10 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> Yeah, that's the pinpoint.  Do you know by chance what was the content
> of the control file for each standby you have upgraded to 9.6.10 before
> starting them with the new binaries?  You mentioned a cluster of three

No, I don't. Right after the upgrade they started normally and have
been working for a few days. I believe the controlfile was overwritten
a few hundred times before the instance crashed.

> nodes, so I guess that you have two standbys, and that one of them did
> not see the symptoms discussed here, while the other saw them.  Do you

The other node didn't crash and still working.

> still have the logs of the recovery just after starting the other
> standby with 9.4.10 which did not see the symptom?  All your standbys

I don't think it is really related to the minor upgrade. After the
upgrade the whole cluster was running for about 3 days.
Every day it generates about 2000 WAL segments, the total volume of
daily WALs is very close to the size of cluster, which is 38GB.


> are using the background worker which would cause the btree deletion
> code to be scanned, right?

Well, any open connection to the database will produce the same
result. In our case we are using Patroni for automatic failover, which
connects immediately after postgres has started and keeps this
connection permanently open. Background worker just appeared to be
faster than anything else.

> I am trying to work on a reproducer with a bgworker starting once
> recovery has been reached, without success yet.  Does your cluster
> generate some XLOG_PARAMETER_CHANGE records?  In some cases, 9.4.8 could
> have updated minRecoveryPoint to go backward, which is something that
> 8d68ee6 has been working on addressing.

No, it doesn't.

>
> Did you also try to use local WAL segments up where AB3/56BF3B68 is
> applied, and also have a restore_command so as extra WAL segment fetches
> from the archive would happen?

If there are no connections open, it applies a necessary amount of WAL
segments (with the help of restore_command off course) and reaches the
real consistency. After that, it is possible to connect and it doesn't
startup process doesn't abort anymore.


BTW, I am thinking that we should return InvalidTransactionId from the
btree_xlog_delete_get_latestRemovedXid if the index page we read from
disk is newer then xlog record we are currently processing. Please see
the patch attached.

--
Alexander Kukushkin

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #15346: Replica fails to start after the crash
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #15346: Replica fails to start after the crash