Обсуждение: BUG #16707: Memory leak

Поиск
Список
Период
Сортировка

BUG #16707: Memory leak

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16707
Logged by:          Kurt Roeckx
Email address:      kurt@roeckx.be
PostgreSQL version: 12.4
Operating system:   Debian
Description:

Hi,

It seems that since I switched from PostgreSQL 9.5 to 12, I started to have
a memory leak. This results in PostgreSQL dying because it can no longer
allocate memory, in which case it restarts and most of the time everything
then just recovers, which is why I'm rather late in noticing the problem.

I migrated early April to 12.2-2.pgdg100+1, and I'm currently running
12.4-1.pgdg100+1.

I'm running 2 database on that that server, but I only seem to have issues
with 1 of them, only the processes that connect to that database seem to
have a memory leak.

The only database I have a problem with is the one used by synapse
(https://github.com/matrix-org/synapse).

My configuration has:
shared_buffers = 4096MB
work_mem = 64MB
maintenance_work_mem = 64MB

The VSZ started around 4.3 GB, now is at 5.4 GB, of which 5.1 is RSS. The
VSZ and RSS grow over time. There are 10 such processes, all connections to
the synapse database.
All other processes, including a connection the other database, still have a
VSZ of 4.3 GB. The stats collector is the exception with only 64MB VSZ.

I've put log_statement to "all" for a short while to see the queries that
were happening. The only type of query I see that's different to what all
the other databases do is that it uses "BEGIN ISOLATION LEVEL REPEATABLE
READ", while all the others just use "BEGIN". Not sure if that's related or
not.


Kurt


Re: BUG #16707: Memory leak

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> It seems that since I switched from PostgreSQL 9.5 to 12, I started to have
> a memory leak.

Could you maybe collect a memory map from one of the bloated processes?
It'd go something like

* attach to process with gdb

* "call MemoryContextStats(TopMemoryContext)"

* quit gdb

* look in postmaster log for memory dump

You probably need to have debug symbols installed for step 2 to work.

            regards, tom lane



Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote:
> * look in postmaster log for memory dump

TopMemoryContext: 154592 total in 8 blocks; 40296 free (89 chunks); 114296 used
  RI compare cache: 16384 total in 2 blocks; 6664 free (3 chunks); 9720 used
  RI query cache: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  RI constraint cache: 40888 total in 2 blocks; 2624 free (0 chunks); 38264 used
  CFuncHash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  Tsearch dictionary cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
  Tsearch parser cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
  Tsearch configuration cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
  Sequence values: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used
  Btree proof lookup cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
  RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
  MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used
  Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  smgr relation table: 65536 total in 4 blocks; 16664 free (14 chunks); 48872 used
  TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used
  Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
  Relcache by OID: 32768 total in 3 blocks; 11536 free (6 chunks); 21232 used
  CacheMemoryContext: 2340096 total in 18 blocks; 191496 free (53 chunks); 2148600 used
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_push_actions_staging_id
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: ev_extrem_id
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: ev_extrem_room
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_forward_extremities_event_id_room_id_key
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_push_actions_stream_ordering
    index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: event_push_actions_rm_tokens
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_push_actions_room_id_user_id
    index info: 3072 total in 2 blocks; 1104 free (1 chunks); 1968 used: event_id_user_id_profile_tag_uniqueness
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: room_stats_historical_end_ts
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: room_stats_historical_pkey
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: application_services_txns_id
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: application_services_txns_as_id_txn_id_key
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: event_push_summary_user_rm
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: presence_stream_user_id
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: presence_stream_id
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: device_inbox_stream_id
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: device_inbox_user_stream_id
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: user_directory_stream_pos_lock_key
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: stream_positions_idx
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: stats_incremental_position_lock_key
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: room_stats_current_pkey
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: application_services_state_pkey
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_statistic_relid_att_inh_index
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: appservice_stream_position_lock_key
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: federation_stream_position_instance
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: users_in_public_rooms_r_idx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: users_in_public_rooms_u_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_user_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_room_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_search_user_idx
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: user_directory_search_fts_idx
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: user_ips_user_token_ip_unique_index
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_ips_last_seen_only
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_ips_last_seen
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_signature_stream_idx
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: local_current_membership_room_idx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: local_current_membership_idx
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: server_signature_keys_server_name_key_id_key
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_stats_current_pkey
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: device_uniqueness
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_remote_cache_unique_id
    index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: threepid_validation_token_session_id
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: threepid_validation_token_pkey
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: destinations_pkey
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: stream_ordering_to_exterm_rm_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: stream_ordering_to_exterm_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: received_transactions_ts
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used:
received_transactions_transaction_id_origin_key
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: e2e_cross_signing_keys_idx
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: server_keys_json_uniqueness
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: monthly_active_users_time_stamp
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: monthly_active_users_users
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used:
current_state_events_room_id_type_state_key_key
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: current_state_events_event_id_key
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: event_push_summary_stream_ordering_lock_key
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_70339161_index
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_federation_inbox_sender_id
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: account_data_stream_id
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: account_data_uniqueness
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: group_attestations_renewals_v_idx
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: group_attestations_renewals_u_idx
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: group_attestations_renewals_g_idx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: blocked_rooms_idx
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: ui_auth_sessions_session_id_key
    index info: 1024 total in 1 blocks; 64 free (0 chunks); 960 used: room_aliases_id
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: room_aliases_room_alias_key
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: ex_outlier_stream_pkey
    index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: e2e_room_keys_with_version_idx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: room_account_data_stream_id
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: room_account_data_uniqueness
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_daily_visits_ts_idx
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_daily_visits_uts_idx
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: private_user_data_max_stream_id_lock_key
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_outbound_last_success_unique_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: device_federation_outbox_id
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_federation_outbox_destination_id
    index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: device_lists_outbound_pokes_stream
    index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_outbound_pokes_user
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: device_lists_outbound_pokes_id
    index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: e2e_cross_signing_signatures2_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_stats_historical_end_ts
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_stats_historical_pkey
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_15165859_index
    index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: event_txn_id_ts
    index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: event_txn_id_txn_id
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_txn_id_event_id
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: remote_profile_cache_time
    index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: remote_profile_cache_user_id
    TS dictionary: 1024 total in 1 blocks; 696 free (0 chunks); 328 used: simple
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_search_event_id_idx
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_search_ev_ridx
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: event_search_fts_idx
    TS dictionary: 8192 total in 4 blocks; 2864 free (9 chunks); 5328 used: english_stem
    index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: room_tag_uniqueness
    index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: profiles_user_id_key
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: e2e_device_keys_json_uniqueness
    index info: 3072 total in 2 blocks; 984 free (1 chunks); 2088 used:
remote_media_repository_thumbn_media_origin_id_width_height_met
    index info: 3072 total in 2 blocks; 1048 free (1 chunks); 2024 used:
remote_media_cache_thumbnails_media_origin_media_id_thumbna_key
    index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: e2e_room_keys_versions_idx
    150 more child contexts containing 287744 total in 275 blocks; 108096 free (155 chunks); 179648 used
  WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used
  PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
  MdSmgr: 32768 total in 3 blocks; 19048 free (15 chunks); 13720 used
  LOCALLOCK hash: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used

Which was for this process:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 10000  2.6 16.3 5547172 5374656 ?     Ss   Nov08  54:10 postgres: synapse synapse [local] idle



Kurt




Re: BUG #16707: Memory leak

От
Tom Lane
Дата:
Kurt Roeckx <kurt@roeckx.be> writes:
> On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote:
>> * look in postmaster log for memory dump

> ...
> Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used

> Which was for this process:
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> postgres 10000  2.6 16.3 5547172 5374656 ?     Ss   Nov08  54:10 postgres: synapse synapse [local] idle

Hm.  It would seem that whatever you're leaking was not allocated via
palloc.  Have you got any extensions loaded into that backend?

It's also worth noting that if you've got 4GB of shared buffers,
a total process vsize of 5.3GB doesn't seem all that far out of
line.  I'm not quite convinced that you have a leak at all,
as opposed to processes gradually touching more and more of the
shared buffer arena.

            regards, tom lane



Re: BUG #16707: Memory leak

От
Andres Freund
Дата:
Hi,

On 2020-11-09 17:20:37 -0500, Tom Lane wrote:
> Kurt Roeckx <kurt@roeckx.be> writes:
> > Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used
> 
> > Which was for this process:
> > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > postgres 10000  2.6 16.3 5547172 5374656 ?     Ss   Nov08  54:10 postgres: synapse synapse [local] idle
> 
> Hm.  It would seem that whatever you're leaking was not allocated via
> palloc.  Have you got any extensions loaded into that backend?
> 
> It's also worth noting that if you've got 4GB of shared buffers,
> a total process vsize of 5.3GB doesn't seem all that far out of
> line.  I'm not quite convinced that you have a leak at all,
> as opposed to processes gradually touching more and more of the
> shared buffer arena.

As this is on a halfway recent linux, I suggest doing something like

$ grep ^Rss /proc/$pid/status
RssAnon:        6664 kB
RssFile:       69512 kB
RssShmem:       15788 kB

Anon are allocations and some other small stuff, RssFile is memory
mapped files, shmem is shared memory (but 0 when huge pages are used).

Greetings,

Andres Freund



Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Mon, Nov 09, 2020 at 05:20:37PM -0500, Tom Lane wrote:
> Kurt Roeckx <kurt@roeckx.be> writes:
> > On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote:
> >> * look in postmaster log for memory dump
> 
> > ...
> > Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used
> 
> > Which was for this process:
> > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > postgres 10000  2.6 16.3 5547172 5374656 ?     Ss   Nov08  54:10 postgres: synapse synapse [local] idle
> 
> Hm.  It would seem that whatever you're leaking was not allocated via
> palloc.  Have you got any extensions loaded into that backend?

synapse=> \dx
                 List of installed extensions
  Name   | Version |   Schema   |         Description
---------+---------+------------+------------------------------
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(1 row)

> It's also worth noting that if you've got 4GB of shared buffers,
> a total process vsize of 5.3GB doesn't seem all that far out of
> line.  I'm not quite convinced that you have a leak at all,
> as opposed to processes gradually touching more and more of the
> shared buffer arena.

Top says the shared size is now 4.1 GB, and has said so for a
while. On the other hand, the virtual size keeps growning. The
virtual size shouldn't change much over time. The resident size
will grow until the shared buffer is fully used, then should stay
around that value.

ps now says:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 10000  2.2 16.6 5632572 5468232 ?     Ss   Nov08  61:09 postgres: synapse synapse [local] idle

At some point all 32 GB of RAM will get used, it starts to swap
out, then malloc starts to fail. (I've set Linux to not
allow overcommit (vm.overcommit_memory = 2), otherwise it would be
the OOM killer doing it's thing.

This is what an other process looks like:
postgres  8966  0.2  3.8 4387236 1276104 ?     Ss   Nov08   7:10 postgres: kurt certs [local] idle

top says that that process has 1.2 GB shared RAM, which also
matches it resident size.


Kurt




Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:
> As this is on a halfway recent linux, I suggest doing something like
> 
> $ grep ^Rss /proc/$pid/status
> RssAnon:        6664 kB
> RssFile:       69512 kB
> RssShmem:       15788 kB

RssAnon:         1197064 kB
RssFile:           27420 kB
RssShmem:        4248052 kB


Kurt




Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Tue, Nov 10, 2020 at 09:11:20AM +0100, Kurt Roeckx wrote:
> On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:
> > As this is on a halfway recent linux, I suggest doing something like
> > 
> > $ grep ^Rss /proc/$pid/status
> > RssAnon:        6664 kB
> > RssFile:       69512 kB
> > RssShmem:       15788 kB
> 
> RssAnon:         1197064 kB
> RssFile:           27420 kB
> RssShmem:        4248052 kB

Maybe this is useful:
$ grep kB /proc/10000/status
VmPeak:  5654956 kB
VmSize:  5637004 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   5479104 kB
VmRSS:   5472668 kB
RssAnon:         1197196 kB
RssFile:           27420 kB
RssShmem:        4248052 kB
VmData:  1192724 kB
VmStk:       132 kB
VmExe:      5388 kB
VmLib:     68200 kB
VmPTE:     10932 kB
VmSwap:        0 kB
HugetlbPages:          0 kB


Kurt




Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Tue, Nov 10, 2020 at 09:19:34AM +0100, Kurt Roeckx wrote:
> On Tue, Nov 10, 2020 at 09:11:20AM +0100, Kurt Roeckx wrote:
> > On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:
> > > As this is on a halfway recent linux, I suggest doing something like
> > > 
> > > $ grep ^Rss /proc/$pid/status
> > > RssAnon:        6664 kB
> > > RssFile:       69512 kB
> > > RssShmem:       15788 kB
> > 
> > RssAnon:         1197064 kB
> > RssFile:           27420 kB
> > RssShmem:        4248052 kB
> 
> Maybe this is useful:
> $ grep kB /proc/10000/status
> VmPeak:  5654956 kB
> VmSize:  5637004 kB
> VmLck:         0 kB
> VmPin:         0 kB
> VmHWM:   5479104 kB
> VmRSS:   5472668 kB
> RssAnon:         1197196 kB
> RssFile:           27420 kB
> RssShmem:        4248052 kB
> VmData:  1192724 kB
> VmStk:       132 kB
> VmExe:      5388 kB
> VmLib:     68200 kB
> VmPTE:     10932 kB
> VmSwap:        0 kB
> HugetlbPages:          0 kB

The same thing, about 9 hours later:
VmPeak:  5882112 kB
VmSize:  5857276 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   5722668 kB
VmRSS:   5697940 kB
RssAnon:         1414988 kB
RssFile:           27472 kB
RssShmem:        4255480 kB
VmData:  1412996 kB
VmStk:       132 kB
VmExe:      5388 kB
VmLib:     68200 kB
VmPTE:     11372 kB
VmSwap:        0 kB
HugetlbPages:          0 kB



Kurt




Re: BUG #16707: Memory leak

От
Andres Freund
Дата:
Hi,

On 2020-11-10 09:11:20 +0100, Kurt Roeckx wrote:
> On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:
> > As this is on a halfway recent linux, I suggest doing something like
> > 
> > $ grep ^Rss /proc/$pid/status
> > RssAnon:        6664 kB
> > RssFile:       69512 kB
> > RssShmem:       15788 kB
> 
> RssAnon:         1197064 kB
> RssFile:           27420 kB
> RssShmem:        4248052 kB

Ok, so it's actual allocations that are the problem. What kind of
queries is this workload running?

There's one known (slow) memory leak in the JIT code / LLVM. Could you
check if the issue vanishes if you disable JIT (jit = 0)?

Otherwise it might be useful to collect stack traces for memory
allocations. You could try something like 'heaptrack' or add a perf
probe on malloc, and do a perf profile.

E.g. something like
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 -a malloc
perf record -e probe_libc:malloc --call-graph dwarf -p $pid_of_problematic_process

Regards,

Andres



Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
I took a copy of /proc/$pid/smaps about 2 hours apart, and a diff
shows:
-5604495f1000-5604886fd000 rw-p 00000000 00:00 0 [heap]
-Size:            1033264 kB
+5604495f1000-56048cc1c000 rw-p 00000000 00:00 0 [heap]
+Size:            1104044 kB
 KernelPageSize:        4 kB
 MMUPageSize:           4 kB
-Rss:             1030296 kB
-Pss:             1030296 kB
+Rss:             1101288 kB
+Pss:             1101288 kB
 Shared_Clean:          0 kB
 Shared_Dirty:          0 kB
 Private_Clean:         0 kB
-Private_Dirty:   1030296 kB
-Referenced:       998992 kB
-Anonymous:       1030296 kB
+Private_Dirty:   1101288 kB
+Referenced:      1067220 kB
+Anonymous:       1101288 kB
 LazyFree:              0 kB
 AnonHugePages:         0 kB
 ShmemPmdMapped:        0 kB

So the heap grew with 70 MB.

The new results from status:
VmPeak:  5928312 kB
VmSize:  5928056 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   5772336 kB
VmRSS:   5772084 kB
RssAnon:         1485980 kB
RssFile:           27472 kB
RssShmem:        4258632 kB
VmData:  1483776 kB
VmStk:       132 kB
VmExe:      5388 kB
VmLib:     68200 kB
VmPTE:     11512 kB
VmSwap:        0 kB
HugetlbPages:          0 kB

Both seem to say that the heap grew with 70 MB in about 2 hours.
With 10 processes growing at this rate, it grows with about 8 GB a
day.


Kurt




Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Tue, Nov 10, 2020 at 11:35:17AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-11-10 09:11:20 +0100, Kurt Roeckx wrote:
> > On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote:
> > > As this is on a halfway recent linux, I suggest doing something like
> > > 
> > > $ grep ^Rss /proc/$pid/status
> > > RssAnon:        6664 kB
> > > RssFile:       69512 kB
> > > RssShmem:       15788 kB
> > 
> > RssAnon:         1197064 kB
> > RssFile:           27420 kB
> > RssShmem:        4248052 kB
> 
> Ok, so it's actual allocations that are the problem. What kind of
> queries is this workload running?

I really have no idea, but I'll try to get an idea if the jit
thing doesn't work.

> There's one known (slow) memory leak in the JIT code / LLVM. Could you
> check if the issue vanishes if you disable JIT (jit = 0)?

I've just restarted it with jit = 0.

> Otherwise it might be useful to collect stack traces for memory
> allocations. You could try something like 'heaptrack' or add a perf
> probe on malloc, and do a perf profile.
> 
> E.g. something like
> perf probe -x /lib/x86_64-linux-gnu/libc.so.6 -a malloc
> perf record -e probe_libc:malloc --call-graph dwarf -p $pid_of_problematic_process

Let's first see what happens with jit disabled. If I still see it,
I'll try that.


Kurt




Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:
> > There's one known (slow) memory leak in the JIT code / LLVM. Could you
> > check if the issue vanishes if you disable JIT (jit = 0)?
> 
> I've just restarted it with jit = 0.

It's been about 3 hours since the restart, and it looks much
better, it seems to be solved.


Kurt




Re: BUG #16707: Memory leak

От
Andres Freund
Дата:
Hi,

On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:
> On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:
> > > There's one known (slow) memory leak in the JIT code / LLVM. Could you
> > > check if the issue vanishes if you disable JIT (jit = 0)?
> > 
> > I've just restarted it with jit = 0.
> 
> It's been about 3 hours since the restart, and it looks much
> better, it seems to be solved.

Hm, darn. Any chance you could check if the leak is present if you turn
on jit again, but disable inlining with jit_inline_above_cost=-1? If
that still fixes the leak I think I know the issue / have a reproducer
already...

- Andres



Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:
> > On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:
> > > > There's one known (slow) memory leak in the JIT code / LLVM. Could you
> > > > check if the issue vanishes if you disable JIT (jit = 0)?
> > > 
> > > I've just restarted it with jit = 0.
> > 
> > It's been about 3 hours since the restart, and it looks much
> > better, it seems to be solved.
> 
> Hm, darn. Any chance you could check if the leak is present if you turn
> on jit again, but disable inlining with jit_inline_above_cost=-1? If
> that still fixes the leak I think I know the issue / have a reproducer
> already...

So after about 14 hours, it shows:
$ grep kB /proc/32117/status
VmPeak:  4412436 kB
VmSize:  4389988 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   4261196 kB
VmRSS:   4239936 kB
RssAnon:            6112 kB
RssFile:            7340 kB
RssShmem:        4226484 kB
VmData:     5412 kB
VmStk:       132 kB
VmExe:      5388 kB
VmLib:     13032 kB
VmPTE:      8532 kB
VmSwap:        0 kB
HugetlbPages:          0 kB

The munin graph is also very flat now, so with jit=0, it seems
fixed.

I've just restarted it with
jit=1
jit_inline_above_cost=-1

I will let you know.


Kurt




Re: BUG #16707: Memory leak

От
Kurt Roeckx
Дата:
On Wed, Nov 11, 2020 at 10:49:18AM +0100, Kurt Roeckx wrote:
> > Hm, darn. Any chance you could check if the leak is present if you turn
> > on jit again, but disable inlining with jit_inline_above_cost=-1? If
> > that still fixes the leak I think I know the issue / have a reproducer
> > already...
> I've just restarted it with
> jit=1
> jit_inline_above_cost=-1

So after more than 2 hours, I see no leak with those settings.


Kurt




Re: BUG #16707: Memory leak

От
"laurent.mignon@acsone.eu"
Дата:
Hi,

We also encounter this issue on our postgresql12 database. Disabling jit
inlining seems to reduce drastically the memory leak. But after 4 days
without restart (compared to the 3 restart / day before the change)  it
seems that we still have a leak. The RAM is full and the swap almost full.. 
I hope that a fix will land quickly since this issue has a huge impact on
our business.

Regards



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



Re: BUG #16707: Memory leak

От
Andres Freund
Дата:
Hi,


On 2020-11-19 09:45:50 -0700, laurent.mignon@acsone.eu wrote:
> We also encounter this issue on our postgresql12 database. Disabling jit
> inlining seems to reduce drastically the memory leak. But after 4 days
> without restart (compared to the 3 restart / day before the change)  it
> seems that we still have a leak. The RAM is full and the swap almost
> full..

In that case it is a separate issue from Kurt's. Could you follow the
instructions from
https://www.postgresql.org/message-id/1999972.1604957673%40sss.pgh.pa.us
and
https://www.postgresql.org/message-id/20201110043127.sfkyvvjqy7x3er5k%40alap3.anarazel.de
providing the output here?

> I hope that a fix will land quickly since this issue has a huge impact on
> our business.

It's unfortunately a nontrivial issue - I have a prototype for a fix,
but it's complicated and not yet complete. And unfortunately I've since
injured my hand and am now typing very slowly...


Btw, I'll be more likely to see your reply promptly if you CC me in your
reply.

Greetings,

Andres Freund



Re: BUG #16707: Memory leak

От
"Andres Freund"
Дата:
Hi,

On Wed, Apr 7, 2021, at 09:28, Jaime Casanova wrote:
> On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote:
> > > On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote:
> > > > > There's one known (slow) memory leak in the JIT code / LLVM. Could you
> > > > > check if the issue vanishes if you disable JIT (jit = 0)?
> > > > 
> > > > I've just restarted it with jit = 0.
> > > 
> > > It's been about 3 hours since the restart, and it looks much
> > > better, it seems to be solved.
> > 
> > Hm, darn. Any chance you could check if the leak is present if you turn
> > on jit again, but disable inlining with jit_inline_above_cost=-1? If
> > that still fixes the leak I think I know the issue / have a reproducer
> > already...
> > 
> 
> I was bit by this too while testing something.  I thought it could have been
> a problem caused by the support of llvm 12? but given that this report
> is for pg12, it seems this is older than that.

I hope to fix it soon - I have two different mostly working fixes. I'll get back to it once the dust around the freeze
settles.

Andres