Re: Failed assertion on standby while shutdown

Поиск
Список
Период
Сортировка
От Maxim Orlov
Тема Re: Failed assertion on standby while shutdown
Дата
Msg-id 14d8bf0615489b5c92818974f56c373e@postgrespro.ru
обсуждение исходный текст
Ответ на Re: Failed assertion on standby while shutdown  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Ответы Re: Failed assertion on standby while shutdown  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Список pgsql-hackers
On 2021-03-22 16:40, Fujii Masao wrote:
> On 2021/03/20 2:25, Maxim Orlov wrote:
>> Hi, haсkers!
>> 
>> Recently, I was doing some experiments with primary/standby instances 
>> interaction. In certain conditions I’ve got and was able to reproduce 
>> crash on failed assertion.
>> 
>> The scenario is the following:
>> 1. start primary server
>> 2. start standby server by pg_basebackup -P -R -X stream -c fast 
>> -p5432 -D data
>> 3. apply some load to the primary server by pgbench -p5432 -i -s 150 
>> postgres
>> 4. kill primary server (with kill -9) and keep it down
>> 5. stop standby server by pg_ctl
>> 6. run standby server
>> 
>> Then any standby server termination will result in a failed assertion.
>> 
>> The log with a backtrace is following:
>> 
>> 2021-03-19 18:54:25.352 MSK [3508443] LOG:  received fast shutdown 
>> request
>> 2021-03-19 18:54:25.379 MSK [3508443] LOG:  aborting any active 
>> transactions
>> TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))", 
>> File: 
>> "/home/ziva/projects/pgpro/build-secondary/../postgrespro/src/backend/storage/lmgr/proc.c", 
>> Line: 592, PID: 3508452)
>> postgres: walreceiver (ExceptionalCondition+0xd0)[0x555555d0526f]
>> postgres: walreceiver (InitAuxiliaryProcess+0x31c)[0x555555b43e31]
>> postgres: walreceiver (AuxiliaryProcessMain+0x54f)[0x55555574ae32]
>> postgres: walreceiver (+0x530bff)[0x555555a84bff]
>> postgres: walreceiver (+0x531044)[0x555555a85044]
>> postgres: walreceiver (+0x530959)[0x555555a84959]
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ffff7a303c0]
>> /lib/x86_64-linux-gnu/libc.so.6(__select+0x1a)[0x7ffff72a40da]
>> postgres: walreceiver (+0x52bea4)[0x555555a7fea4]
>> postgres: walreceiver (PostmasterMain+0x129f)[0x555555a7f7c1]
>> postgres: walreceiver (+0x41ff1f)[0x555555973f1f]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7ffff71b30b3]
>> postgres: walreceiver (_start+0x2e)[0x55555561abfe]
>> 
>> After a brief investigation I found out that I can get this assert 
>> with 100% probability if I insert a sleep for about 5 sec into 
>> InitAuxiliaryProcess(void) in src/backend/storage/lmgr/proc.c:
>> 
>> diff --git a/src/backend/storage/lmgr/proc.c 
>> b/src/backend/storage/lmgr/proc.c
>> index 897045ee272..b5f365f426d 100644
>> --- a/src/backend/storage/lmgr/proc.c
>> +++ b/src/backend/storage/lmgr/proc.c
>> @@ -525,7 +525,7 @@ InitAuxiliaryProcess(void)
>> 
>>          if (MyProc != NULL)
>>                  elog(ERROR, "you already exist");
>> -
>> +       pg_usleep(5000000L);
>>          /*
>>           * We use the ProcStructLock to protect assignment and 
>> releasing of
>>           * AuxiliaryProcs entries.
> 
> Thanks for the report! I could reproduce this issue by adding that 
> sleep
> into InitAuxiliaryProcess().
> 
>> Maybe, this kinda behaviour would appear if a computer hosting 
>> instances is under significant side load, which cause delay to start 
>> db-instances under a heavy load.
>> 
>> Configuration for a primary server is default with "wal_level = 
>> logical"
>> 
>> Configuration for a standby server is default with "wal_level = 
>> logical" and "primary_conninfo = 'port=5432'"
>> 
>> I'm puzzled with this behavor. I'm pretty sure it is not what should 
>> be. Any ideas how this can be fixed?
> 
> ISTM that the cause of this issue is that the startup process exits
> without releasing the locks that it was holding when shutdown is
> requested. To address this issue, IMO the startup process should
> call ShutdownRecoveryTransactionEnvironment() at its exit.
> Attached is the POC patch that changes the startup process that way.
> 
> I've not tested the patch enough yet..
> 
> Regards,

Thank you for reply! As far as I understand, this is really the case. 
I've test your patch a bit. This annoying failed assertion is gone now.

I think I should test more and report later about results.

Should we put this patch to CF?

---
Best regards,
Maxim Orlov.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Add client connection check during the execution of the query
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Protect syscache from bloating with negative cache entries