Diagnose memory leak in logical replication?

Поиск
Список
Период
Сортировка
От Raphaël Enrici
Тема Diagnose memory leak in logical replication?
Дата
Msg-id 086c8efd4338c40ec51f312bf160a824@yadle.net
обсуждение исходный текст
Список pgsql-general
Dear all,

Hope everyone is doing well.

While playing with wal2json and debezium we faced an issue where 
walsender processes eat more and more memory (several hundred megabytes 
per hour) until we restart them by stopping their consumption.

We first thought to something wrong in wal2json[1] but after more 
testing today we have the exact same situation with pgoutput plugin.

The environment is Debian buster with packages from apt.postgresql.org 
(currently using 12.2-2.pgdg100+1):

postgresql-12 - 12.2-2.pgdg100+1
postgresql-12-wal2json - 2.2-1.pgdg100+1
postgresql-client-12 - 12.2-2.pgdg100+1
postgresql-client-common - 213.pgdg100+1
postgresql-common - 213.pgdg100+1

The steps to create slot and consume wal2json are the following:

pg_recvlogical -d test_db -U test_u -p 5433 -W --slot test_slot 
--create-slot -P wal2json
pg_recvlogical -d test_db -U test_u -p 5433 -W --slot test_slot --start 
-f /dev/null

The steps done to create slots and consume with pgoutput plugin are the 
following:
In PostgreSQL:
CREATE PUBLICATION dbz_minimal_publication FOR TABLE public.xxx;

And then from the shell:
pg_recvlogical -d test_db --slot pgoutput_minimal_test_slot 
--create-slot -P pgoutput
pg_recvlogical -d test_db -U test_u -p 5433 -W --slot 
pgoutput_minimal_test_slot --start -h localhost -o proto_version=1 -o 
publication_names='dbz_minimal_publication' -f /dev/null


A few minutes after creating these slots (Thu Nov  5 12:55:45 2020)
Every 5.0s: ps eo user,pid,vsz,rss $(pgrep -f walsender)
USER       PID    VSZ   RSS
postgres  3080 11672352 916664  <- this is the walsender using wal2json 
plugin
postgres  5740 11533412 760332 <- this is the walsender using pgoutput 
plugin
...
postgres 14333 13011864 2246328 <- those are other walsender using 
wal2json plugin
postgres 14383 13011864 2244788
postgres 14384 13011864 2246316

Same thing a few hours later: Thu Nov  5 14:01:55 2020
Every 5.0s: ps eo user,pid,vsz,rss $(pgrep -f walsender)
USER       PID    VSZ   RSS
postgres  3080 12153832 1400920 <- this is the walsender using wal2json 
plugin
postgres  5740 12014716 1245524 <- this is the walsender using pgoutput 
plugin
...
postgres 14333 13493032 2728520
postgres 14383 13493032 2727012
postgres 14384 13493032 2728672

Per wal2json author suggestion and based on other posts on this list we 
attached gdb to one of the walsender process a few weeks ago and did the 
following:
p (long)MemoryContextStats((long)TopMemoryContext)

It seems we have a lot of ReorderBufferToastHash under the Logical 
decoding context:
TopMemoryContext: 435000 total in 9 blocks; 29864 free (17 chunks); 
405136 used
   Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 
21728 used
   pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 
free (0 chunks); 6736 used
   wal2json output context: 8192 total in 1 blocks; 7936 free (0 chunks); 
256 used
   RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 
1296 used
   MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 
used
     Replication command context: 82104 total in 5 blocks; 25848 free (12 
chunks); 56256 used
       Logical decoding context: 344176 total in 5 blocks; 26672 free (12 
chunks); 317504 used
         snapshot builder context: 524288 total in 7 blocks; 172928 free 
(24 chunks); 351360 used
         ReorderBuffer: 32768 total in 3 blocks; 28048 free (26 chunks); 
4720 used
           ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free 
(22 chunks); 495504 used
           ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free 
(22 chunks); 495504 used
           ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free 
(22 chunks); 495504 used
           ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free 
(22 chunks); 495504 used
...
        ReorderBufferByXid: 16384 total in 2 blocks; 3544 free (2 
chunks); 12840 used
           Tuples: 2080374888 total in 248 blocks (1259603 chunks); 
241400 free (265927 chunks); 2080133488 used
           TXN: 648 total in 0 blocks; 0 free (0 chunks); 648 used
           Change: 79881960 total in 9751 blocks; 74080 free (926 
chunks); 79807880 used
   Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 
7632 used

The full output can be found on github[2]

Please note that all of this seems to be related to the DB activity, 
simply doing a test with for example a pgbench in loop does not allow to 
reproduce this "leak".

Would someone point me in the right direction to identify what could be 
the reason for this behaviour so that we can try to fix this?

Thanks,
Raph
[1] https://github.com/eulerto/wal2json/issues/180
[2] https://github.com/eulerto/wal2json/files/5292603/memory.txt





В списке pgsql-general по дате отправления:

Предыдущее
От: Rob Sargent
Дата:
Сообщение: Re: JSONB order?
Следующее
От: Adrian Klaver
Дата:
Сообщение: Re: Fwd: JSONB order?