Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records

Поиск
Список
Период
Сортировка
От Dmitriy Kuzmin
Тема Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records
Дата
Msg-id CAHLDt=_ts0A7Agn=hCpUh+RCFkxd+G6uuT=kcTfqFtGur0dp=A@mail.gmail.com
обсуждение исходный текст
Ответы Re: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-bugs
Greetings!

One of our clients experienced a crash of startup process with an error "invalid memory alloc request size 1073741824" on a hot standby, which ended in replica reinit.

According to logs, startup process crashed while trying to replay "Standby/LOCK" record with a huge list of locks(see attached replicalog_tail.tgz):

FATAL:  XX000: invalid memory alloc request size 1073741824
CONTEXT:  WAL redo at 7/327F9248 for Standby/LOCK: xid 1638575 db 7550635 rel 8500880 xid 1638575 db 7550635 rel 10324499...
LOCATION:  repalloc, mcxt.c:1075
BACKTRACE:  
        postgres: startup recovering 000000010000000700000033(repalloc+0x61) [0x8d7611]
        postgres: startup recovering 000000010000000700000033() [0x691c29]
        postgres: startup recovering 000000010000000700000033() [0x691c74]
        postgres: startup recovering 000000010000000700000033(lappend+0x16) [0x691e76]
        postgres: startup recovering 000000010000000700000033(StandbyAcquireAccessExclusiveLock+0xdd) [0x7786bd]
        postgres: startup recovering 000000010000000700000033(standby_redo+0x5d) [0x7789ed]
        postgres: startup recovering 000000010000000700000033(StartupXLOG+0x1055) [0x51d7c5]
        postgres: startup recovering 000000010000000700000033(StartupProcessMain+0xcd) [0x71d65d]
        postgres: startup recovering 000000010000000700000033(AuxiliaryProcessMain+0x40c) [0x52c7cc]
        postgres: startup recovering 000000010000000700000033() [0x71a62e]
        postgres: startup recovering 000000010000000700000033(PostmasterMain+0xe74) [0x71cf74]
        postgres: startup recovering 000000010000000700000033(main+0x70d) [0x4891ad]
        /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f0be2db3555]
LOG:  00000: startup process (PID 2650) exited with exit code 1

Looks like startup process at some point hits the MaxAllocSize limit(memutils.h), which forbids allocation of more than 1gb-1 bytes.

Judging by pg_waldump output, there was long running transaction on primary, that sequentially locked and modified a lot of tables. Right before the crash there was about 85k exclusively locked tables.

Trying to reproduce the issue, I found out that the problem is not so much in the number of locks, but in the duration of the transaction. Replaying "Standby/LOCK" records, the startup process eventually crashes with the mentioned error if a long transaction holds a large number of locks long enough.

I managed to reproduce the situation on 13.7, 14.4, 15beta3 and master using the following steps:
1) get primary and replica with the following settings: max_locks_per_transaction = '10000' and max_connections = '1000'
2) create 950k tables
3) lock them in AccessExclusive mode in transaction and leave it in "idle in transaction state"
4) make some side activity with pgbench (pgbench -R 100 -P 5 -T 7200 -c 1)
In about 20-30 minutes startup process crashes with the same error.

As far as I understand, there is fixed amount of AccessExclusive locks in this scenario. 950k exclusive locks acquired by "long running" transaction and no additional exclusive locks held by pgbench. But startup consumes more and more memory while replaying records, that contain exacly the same list of locks. Could it be a memory leak? If not, is there any way to improve this behavior?

If you're going to reproduce it, get primary and replica with enough RAM and simultaneously run on primary:
$ pgbench -i && pgbench -R 100 -P 5 -T 7200 -c 1 in one terminal
$ psql -f 950k_locks.sql in another terminal
and observe startup memory usage and replica's logs.

Best regards,
Dmitry Kuzmin
Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Chanderprabh Jain
Дата:
Сообщение: Re: BUG #15333: pg_dump error on large table -- "pg_dump: could not stat file...iso-8859-1 error"
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17607: Server process crashes when PLpgSQL function raises error in subtransaction