Обсуждение: BUG #16659: postgresql leaks memory or do not limit its usage

Поиск
Список
Период
Сортировка

BUG #16659: postgresql leaks memory or do not limit its usage

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16659
Logged by:          Alexander Zubkov
Email address:      zubkov318@gmail.com
PostgreSQL version: 13.0
Operating system:   postgres:13-alpine in docker on Linux 5.4.64
Description:

Hello,

I migrated a database from postgresql 10.14 in a container to
postgres:13-alpine in docker. After a migration I see that postgresql
processes alltogether eat memory without limits until OOM comes. First it
ate all available memory on the server couple of times, than I limited it to
8GB so that OOM is applied to it earlier. The memory usage graph looks like
a chainsaw:
https://i.ibb.co/LP1jgCw/Screenshot-2020-10-07-Processes-Grafana.png
The usage is bigger that 8G now because there are unrelated postgresql
databases in other containers and they all summed on this graph.
The version reported by the server is:
 PostgreSQL 13.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0)
9.3.0, 64-bit
The settings is mostly default. I have only changed shared buffers (20GB+
usage was before the change). Those all settings from postgresql.conf:

listen_addresses = '*'
max_connections = 100
shared_buffers = 1GB
dynamic_shared_memory_type = posix
max_wal_size = 1GB
min_wal_size = 80MB
log_timezone = 'UTC'
datestyle = 'iso, mdy'
timezone = 'UTC'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'

Old postgres had similar config, only without max_wal_size, min_wal_size.
Database was migrated with: pg_dump -h old ... | psql -h new ...
I have not tried other postgresql versions yet. Maybe I'll try downgrade to
some 12* version soon.

And this is how it looks like in log when OOM comes:

pg_1  | 2020-10-07 11:29:23.170 UTC [1] LOG:  server process (PID 3031) was
terminated by signal 9: Killed
pg_1  | 2020-10-07 11:29:23.170 UTC [1] DETAIL:  Failed process was running:
COMMIT
pg_1  | 2020-10-07 11:29:23.170 UTC [1] LOG:  terminating any other active
server processes
pg_1  | 2020-10-07 11:29:23.170 UTC [3032] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.170 UTC [3032] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.170 UTC [3032] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.170 UTC [3030] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.170 UTC [3030] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.170 UTC [3030] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.170 UTC [3029] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.170 UTC [3029] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.170 UTC [3029] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.170 UTC [3028] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.170 UTC [3028] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.170 UTC [3028] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.171 UTC [3026] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.171 UTC [3026] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.171 UTC [3026] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.171 UTC [3027] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.171 UTC [3027] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.171 UTC [3027] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.171 UTC [3023] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.171 UTC [3023] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.171 UTC [3023] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.171 UTC [3024] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.171 UTC [3024] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.171 UTC [3024] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.171 UTC [3020] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.171 UTC [3020] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.171 UTC [3020] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.185 UTC [3025] WARNING:  terminating connection
because of crash of another server process
pg_1  | 2020-10-07 11:29:23.185 UTC [3025] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
pg_1  | 2020-10-07 11:29:23.185 UTC [3025] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
pg_1  | 2020-10-07 11:29:23.308 UTC [1] LOG:  all server processes
terminated; reinitializing
pg_1  | 2020-10-07 11:29:23.781 UTC [3192] LOG:  database system was
interrupted; last known up at 2020-10-07 11:27:07 UTC
pg_1  | 2020-10-07 11:29:24.288 UTC [3192] LOG:  database system was not
properly shut down; automatic recovery in progress
pg_1  | 2020-10-07 11:29:24.360 UTC [3192] LOG:  redo starts at 6/454FA380
pg_1  | 2020-10-07 11:29:24.403 UTC [3192] LOG:  invalid record length at
6/45848B40: wanted 24, got 0
pg_1  | 2020-10-07 11:29:24.403 UTC [3192] LOG:  redo done at 6/45848B18
pg_1  | 2020-10-07 11:29:27.824 UTC [1] LOG:  database system is ready to
accept connections

What additional information can I provide to debug it?


Re: BUG #16659: postgresql leaks memory or do not limit its usage

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> I migrated a database from postgresql 10.14 in a container to
> postgres:13-alpine in docker. After a migration I see that postgresql
> processes alltogether eat memory without limits until OOM comes.

Not much to go on here.  If you have debug symbols available, maybe
you could attach gdb to one of the bloated processes and do

call MemoryContextStats(TopMemoryContext)

which'd produce a memory usage report on the server's stderr.

Another idea is to start the server under a smaller ulimit,
in hopes of getting ENOMEM failures before reaching the point
of triggering OOM kills.  That would also result in memory
usage reports, so you could get one even if it's a non-debug
build.

Otherwise, we'll have to ask for a reproducible test case ...

            regards, tom lane



Re: BUG #16659: postgresql leaks memory or do not limit its usage

От
Alexander Zubkov
Дата:
Hello,

No, I do not think there are gdb symbols in the docker images. So far
I tried to set vm.overcommit_memory to 2 so that postgres received an
error while allocating memory, but it was too stressful on the
production and it received the error much earlier, when it was using
little memory yet. So I do not think it shows some relevant
information, but anyway logs are attached.
I have also made several database migrations (dump/restore). 13-alpine
-> 12-alpine -> 12 -> 10. All variants except version 10 were "eating"
memory. Today I also migrated to version 11 and it also looks well. I
do not want to experiment on production further, so I am thinking if
there is some sort of postgres proxy to mirror queries to a test
database, or to save and replay them. So that I could continue testing
aside from the production service.

On Wed, Oct 7, 2020 at 7:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> PG Bug reporting form <noreply@postgresql.org> writes:
> > I migrated a database from postgresql 10.14 in a container to
> > postgres:13-alpine in docker. After a migration I see that postgresql
> > processes alltogether eat memory without limits until OOM comes.
>
> Not much to go on here.  If you have debug symbols available, maybe
> you could attach gdb to one of the bloated processes and do
>
> call MemoryContextStats(TopMemoryContext)
>
> which'd produce a memory usage report on the server's stderr.
>
> Another idea is to start the server under a smaller ulimit,
> in hopes of getting ENOMEM failures before reaching the point
> of triggering OOM kills.  That would also result in memory
> usage reports, so you could get one even if it's a non-debug
> build.
>
> Otherwise, we'll have to ask for a reproducible test case ...
>
>                         regards, tom lane

Вложения