Обсуждение: BUG #16707: Memory leak
The following bug has been logged on the website: Bug reference: 16707 Logged by: Kurt Roeckx Email address: kurt@roeckx.be PostgreSQL version: 12.4 Operating system: Debian Description: Hi, It seems that since I switched from PostgreSQL 9.5 to 12, I started to have a memory leak. This results in PostgreSQL dying because it can no longer allocate memory, in which case it restarts and most of the time everything then just recovers, which is why I'm rather late in noticing the problem. I migrated early April to 12.2-2.pgdg100+1, and I'm currently running 12.4-1.pgdg100+1. I'm running 2 database on that that server, but I only seem to have issues with 1 of them, only the processes that connect to that database seem to have a memory leak. The only database I have a problem with is the one used by synapse (https://github.com/matrix-org/synapse). My configuration has: shared_buffers = 4096MB work_mem = 64MB maintenance_work_mem = 64MB The VSZ started around 4.3 GB, now is at 5.4 GB, of which 5.1 is RSS. The VSZ and RSS grow over time. There are 10 such processes, all connections to the synapse database. All other processes, including a connection the other database, still have a VSZ of 4.3 GB. The stats collector is the exception with only 64MB VSZ. I've put log_statement to "all" for a short while to see the queries that were happening. The only type of query I see that's different to what all the other databases do is that it uses "BEGIN ISOLATION LEVEL REPEATABLE READ", while all the others just use "BEGIN". Not sure if that's related or not. Kurt
PG Bug reporting form <noreply@postgresql.org> writes: > It seems that since I switched from PostgreSQL 9.5 to 12, I started to have > a memory leak. Could you maybe collect a memory map from one of the bloated processes? It'd go something like * attach to process with gdb * "call MemoryContextStats(TopMemoryContext)" * quit gdb * look in postmaster log for memory dump You probably need to have debug symbols installed for step 2 to work. regards, tom lane
On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote: > * look in postmaster log for memory dump TopMemoryContext: 154592 total in 8 blocks; 40296 free (89 chunks); 114296 used RI compare cache: 16384 total in 2 blocks; 6664 free (3 chunks); 9720 used RI query cache: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used RI constraint cache: 40888 total in 2 blocks; 2624 free (0 chunks); 38264 used CFuncHash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used Tsearch dictionary cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used Tsearch parser cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used Tsearch configuration cache: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used Sequence values: 8192 total in 1 blocks; 1584 free (0 chunks); 6608 used Btree proof lookup cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 65536 total in 4 blocks; 16664 free (14 chunks); 48872 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used Relcache by OID: 32768 total in 3 blocks; 11536 free (6 chunks); 21232 used CacheMemoryContext: 2340096 total in 18 blocks; 191496 free (53 chunks); 2148600 used index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_push_actions_staging_id index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: ev_extrem_id index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: ev_extrem_room index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_forward_extremities_event_id_room_id_key index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_push_actions_stream_ordering index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: event_push_actions_rm_tokens index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: event_push_actions_room_id_user_id index info: 3072 total in 2 blocks; 1104 free (1 chunks); 1968 used: event_id_user_id_profile_tag_uniqueness index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: room_stats_historical_end_ts index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: room_stats_historical_pkey index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: application_services_txns_id index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: application_services_txns_as_id_txn_id_key index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: event_push_summary_user_rm index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: presence_stream_user_id index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: presence_stream_id index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: device_inbox_stream_id index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: device_inbox_user_stream_id index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: user_directory_stream_pos_lock_key index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: stream_positions_idx index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: stats_incremental_position_lock_key index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: room_stats_current_pkey index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: application_services_state_pkey index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: pg_statistic_relid_att_inh_index index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: appservice_stream_position_lock_key index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: federation_stream_position_instance index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: users_in_public_rooms_r_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: users_in_public_rooms_u_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_user_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_room_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_directory_search_user_idx index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: user_directory_search_fts_idx index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: user_ips_user_token_ip_unique_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_ips_last_seen_only index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_ips_last_seen index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_signature_stream_idx index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: local_current_membership_room_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: local_current_membership_idx index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: server_signature_keys_server_name_key_id_key index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_stats_current_pkey index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: device_uniqueness index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_remote_cache_unique_id index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: threepid_validation_token_session_id index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: threepid_validation_token_pkey index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: destinations_pkey index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: stream_ordering_to_exterm_rm_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: stream_ordering_to_exterm_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: received_transactions_ts index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: received_transactions_transaction_id_origin_key index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: e2e_cross_signing_keys_idx index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: server_keys_json_uniqueness index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: monthly_active_users_time_stamp index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: monthly_active_users_users index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: current_state_events_room_id_type_state_key_key index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: current_state_events_event_id_key index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: event_push_summary_stream_ordering_lock_key index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_70339161_index index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_federation_inbox_sender_id index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: account_data_stream_id index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: account_data_uniqueness index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: group_attestations_renewals_v_idx index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: group_attestations_renewals_u_idx index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: group_attestations_renewals_g_idx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: blocked_rooms_idx index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: ui_auth_sessions_session_id_key index info: 1024 total in 1 blocks; 64 free (0 chunks); 960 used: room_aliases_id index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: room_aliases_room_alias_key index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: ex_outlier_stream_pkey index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: e2e_room_keys_with_version_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: room_account_data_stream_id index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: room_account_data_uniqueness index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_daily_visits_ts_idx index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_daily_visits_uts_idx index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: private_user_data_max_stream_id_lock_key index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_outbound_last_success_unique_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: device_federation_outbox_id index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_federation_outbox_destination_id index info: 2048 total in 2 blocks; 920 free (0 chunks); 1128 used: device_lists_outbound_pokes_stream index info: 2048 total in 2 blocks; 648 free (2 chunks); 1400 used: device_lists_outbound_pokes_user index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: device_lists_outbound_pokes_id index info: 2048 total in 2 blocks; 592 free (1 chunks); 1456 used: e2e_cross_signing_signatures2_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: user_stats_historical_end_ts index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: user_stats_historical_pkey index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_15165859_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: event_txn_id_ts index info: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used: event_txn_id_txn_id index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_txn_id_event_id index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: remote_profile_cache_time index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used: remote_profile_cache_user_id TS dictionary: 1024 total in 1 blocks; 696 free (0 chunks); 328 used: simple index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_search_event_id_idx index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: event_search_ev_ridx index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: event_search_fts_idx TS dictionary: 8192 total in 4 blocks; 2864 free (9 chunks); 5328 used: english_stem index info: 2048 total in 2 blocks; 624 free (1 chunks); 1424 used: room_tag_uniqueness index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: profiles_user_id_key index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: e2e_device_keys_json_uniqueness index info: 3072 total in 2 blocks; 984 free (1 chunks); 2088 used: remote_media_repository_thumbn_media_origin_id_width_height_met index info: 3072 total in 2 blocks; 1048 free (1 chunks); 2024 used: remote_media_cache_thumbnails_media_origin_media_id_thumbna_key index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: e2e_room_keys_versions_idx 150 more child contexts containing 287744 total in 275 blocks; 108096 free (155 chunks); 179648 used WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 32768 total in 3 blocks; 19048 free (15 chunks); 13720 used LOCALLOCK hash: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used Which was for this process: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle Kurt
Kurt Roeckx <kurt@roeckx.be> writes: > On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote: >> * look in postmaster log for memory dump > ... > Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used > Which was for this process: > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle Hm. It would seem that whatever you're leaking was not allocated via palloc. Have you got any extensions loaded into that backend? It's also worth noting that if you've got 4GB of shared buffers, a total process vsize of 5.3GB doesn't seem all that far out of line. I'm not quite convinced that you have a leak at all, as opposed to processes gradually touching more and more of the shared buffer arena. regards, tom lane
Hi, On 2020-11-09 17:20:37 -0500, Tom Lane wrote: > Kurt Roeckx <kurt@roeckx.be> writes: > > Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used > > > Which was for this process: > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle > > Hm. It would seem that whatever you're leaking was not allocated via > palloc. Have you got any extensions loaded into that backend? > > It's also worth noting that if you've got 4GB of shared buffers, > a total process vsize of 5.3GB doesn't seem all that far out of > line. I'm not quite convinced that you have a leak at all, > as opposed to processes gradually touching more and more of the > shared buffer arena. As this is on a halfway recent linux, I suggest doing something like $ grep ^Rss /proc/$pid/status RssAnon: 6664 kB RssFile: 69512 kB RssShmem: 15788 kB Anon are allocations and some other small stuff, RssFile is memory mapped files, shmem is shared memory (but 0 when huge pages are used). Greetings, Andres Freund
On Mon, Nov 09, 2020 at 05:20:37PM -0500, Tom Lane wrote: > Kurt Roeckx <kurt@roeckx.be> writes: > > On Mon, Nov 09, 2020 at 04:34:33PM -0500, Tom Lane wrote: > >> * look in postmaster log for memory dump > > > ... > > Grand total: 3575000 bytes in 533 blocks; 596232 free (450 chunks); 2978768 used > > > Which was for this process: > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > postgres 10000 2.6 16.3 5547172 5374656 ? Ss Nov08 54:10 postgres: synapse synapse [local] idle > > Hm. It would seem that whatever you're leaking was not allocated via > palloc. Have you got any extensions loaded into that backend? synapse=> \dx List of installed extensions Name | Version | Schema | Description ---------+---------+------------+------------------------------ plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language (1 row) > It's also worth noting that if you've got 4GB of shared buffers, > a total process vsize of 5.3GB doesn't seem all that far out of > line. I'm not quite convinced that you have a leak at all, > as opposed to processes gradually touching more and more of the > shared buffer arena. Top says the shared size is now 4.1 GB, and has said so for a while. On the other hand, the virtual size keeps growning. The virtual size shouldn't change much over time. The resident size will grow until the shared buffer is fully used, then should stay around that value. ps now says: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND postgres 10000 2.2 16.6 5632572 5468232 ? Ss Nov08 61:09 postgres: synapse synapse [local] idle At some point all 32 GB of RAM will get used, it starts to swap out, then malloc starts to fail. (I've set Linux to not allow overcommit (vm.overcommit_memory = 2), otherwise it would be the OOM killer doing it's thing. This is what an other process looks like: postgres 8966 0.2 3.8 4387236 1276104 ? Ss Nov08 7:10 postgres: kurt certs [local] idle top says that that process has 1.2 GB shared RAM, which also matches it resident size. Kurt
On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote: > As this is on a halfway recent linux, I suggest doing something like > > $ grep ^Rss /proc/$pid/status > RssAnon: 6664 kB > RssFile: 69512 kB > RssShmem: 15788 kB RssAnon: 1197064 kB RssFile: 27420 kB RssShmem: 4248052 kB Kurt
On Tue, Nov 10, 2020 at 09:11:20AM +0100, Kurt Roeckx wrote: > On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote: > > As this is on a halfway recent linux, I suggest doing something like > > > > $ grep ^Rss /proc/$pid/status > > RssAnon: 6664 kB > > RssFile: 69512 kB > > RssShmem: 15788 kB > > RssAnon: 1197064 kB > RssFile: 27420 kB > RssShmem: 4248052 kB Maybe this is useful: $ grep kB /proc/10000/status VmPeak: 5654956 kB VmSize: 5637004 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 5479104 kB VmRSS: 5472668 kB RssAnon: 1197196 kB RssFile: 27420 kB RssShmem: 4248052 kB VmData: 1192724 kB VmStk: 132 kB VmExe: 5388 kB VmLib: 68200 kB VmPTE: 10932 kB VmSwap: 0 kB HugetlbPages: 0 kB Kurt
On Tue, Nov 10, 2020 at 09:19:34AM +0100, Kurt Roeckx wrote: > On Tue, Nov 10, 2020 at 09:11:20AM +0100, Kurt Roeckx wrote: > > On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote: > > > As this is on a halfway recent linux, I suggest doing something like > > > > > > $ grep ^Rss /proc/$pid/status > > > RssAnon: 6664 kB > > > RssFile: 69512 kB > > > RssShmem: 15788 kB > > > > RssAnon: 1197064 kB > > RssFile: 27420 kB > > RssShmem: 4248052 kB > > Maybe this is useful: > $ grep kB /proc/10000/status > VmPeak: 5654956 kB > VmSize: 5637004 kB > VmLck: 0 kB > VmPin: 0 kB > VmHWM: 5479104 kB > VmRSS: 5472668 kB > RssAnon: 1197196 kB > RssFile: 27420 kB > RssShmem: 4248052 kB > VmData: 1192724 kB > VmStk: 132 kB > VmExe: 5388 kB > VmLib: 68200 kB > VmPTE: 10932 kB > VmSwap: 0 kB > HugetlbPages: 0 kB The same thing, about 9 hours later: VmPeak: 5882112 kB VmSize: 5857276 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 5722668 kB VmRSS: 5697940 kB RssAnon: 1414988 kB RssFile: 27472 kB RssShmem: 4255480 kB VmData: 1412996 kB VmStk: 132 kB VmExe: 5388 kB VmLib: 68200 kB VmPTE: 11372 kB VmSwap: 0 kB HugetlbPages: 0 kB Kurt
Hi, On 2020-11-10 09:11:20 +0100, Kurt Roeckx wrote: > On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote: > > As this is on a halfway recent linux, I suggest doing something like > > > > $ grep ^Rss /proc/$pid/status > > RssAnon: 6664 kB > > RssFile: 69512 kB > > RssShmem: 15788 kB > > RssAnon: 1197064 kB > RssFile: 27420 kB > RssShmem: 4248052 kB Ok, so it's actual allocations that are the problem. What kind of queries is this workload running? There's one known (slow) memory leak in the JIT code / LLVM. Could you check if the issue vanishes if you disable JIT (jit = 0)? Otherwise it might be useful to collect stack traces for memory allocations. You could try something like 'heaptrack' or add a perf probe on malloc, and do a perf profile. E.g. something like perf probe -x /lib/x86_64-linux-gnu/libc.so.6 -a malloc perf record -e probe_libc:malloc --call-graph dwarf -p $pid_of_problematic_process Regards, Andres
I took a copy of /proc/$pid/smaps about 2 hours apart, and a diff shows: -5604495f1000-5604886fd000 rw-p 00000000 00:00 0 [heap] -Size: 1033264 kB +5604495f1000-56048cc1c000 rw-p 00000000 00:00 0 [heap] +Size: 1104044 kB KernelPageSize: 4 kB MMUPageSize: 4 kB -Rss: 1030296 kB -Pss: 1030296 kB +Rss: 1101288 kB +Pss: 1101288 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB -Private_Dirty: 1030296 kB -Referenced: 998992 kB -Anonymous: 1030296 kB +Private_Dirty: 1101288 kB +Referenced: 1067220 kB +Anonymous: 1101288 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB So the heap grew with 70 MB. The new results from status: VmPeak: 5928312 kB VmSize: 5928056 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 5772336 kB VmRSS: 5772084 kB RssAnon: 1485980 kB RssFile: 27472 kB RssShmem: 4258632 kB VmData: 1483776 kB VmStk: 132 kB VmExe: 5388 kB VmLib: 68200 kB VmPTE: 11512 kB VmSwap: 0 kB HugetlbPages: 0 kB Both seem to say that the heap grew with 70 MB in about 2 hours. With 10 processes growing at this rate, it grows with about 8 GB a day. Kurt
On Tue, Nov 10, 2020 at 11:35:17AM -0800, Andres Freund wrote: > Hi, > > On 2020-11-10 09:11:20 +0100, Kurt Roeckx wrote: > > On Mon, Nov 09, 2020 at 08:31:27PM -0800, Andres Freund wrote: > > > As this is on a halfway recent linux, I suggest doing something like > > > > > > $ grep ^Rss /proc/$pid/status > > > RssAnon: 6664 kB > > > RssFile: 69512 kB > > > RssShmem: 15788 kB > > > > RssAnon: 1197064 kB > > RssFile: 27420 kB > > RssShmem: 4248052 kB > > Ok, so it's actual allocations that are the problem. What kind of > queries is this workload running? I really have no idea, but I'll try to get an idea if the jit thing doesn't work. > There's one known (slow) memory leak in the JIT code / LLVM. Could you > check if the issue vanishes if you disable JIT (jit = 0)? I've just restarted it with jit = 0. > Otherwise it might be useful to collect stack traces for memory > allocations. You could try something like 'heaptrack' or add a perf > probe on malloc, and do a perf profile. > > E.g. something like > perf probe -x /lib/x86_64-linux-gnu/libc.so.6 -a malloc > perf record -e probe_libc:malloc --call-graph dwarf -p $pid_of_problematic_process Let's first see what happens with jit disabled. If I still see it, I'll try that. Kurt
On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote: > > There's one known (slow) memory leak in the JIT code / LLVM. Could you > > check if the issue vanishes if you disable JIT (jit = 0)? > > I've just restarted it with jit = 0. It's been about 3 hours since the restart, and it looks much better, it seems to be solved. Kurt
Hi, On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote: > On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote: > > > There's one known (slow) memory leak in the JIT code / LLVM. Could you > > > check if the issue vanishes if you disable JIT (jit = 0)? > > > > I've just restarted it with jit = 0. > > It's been about 3 hours since the restart, and it looks much > better, it seems to be solved. Hm, darn. Any chance you could check if the leak is present if you turn on jit again, but disable inlining with jit_inline_above_cost=-1? If that still fixes the leak I think I know the issue / have a reproducer already... - Andres
On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote: > Hi, > > On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote: > > On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote: > > > > There's one known (slow) memory leak in the JIT code / LLVM. Could you > > > > check if the issue vanishes if you disable JIT (jit = 0)? > > > > > > I've just restarted it with jit = 0. > > > > It's been about 3 hours since the restart, and it looks much > > better, it seems to be solved. > > Hm, darn. Any chance you could check if the leak is present if you turn > on jit again, but disable inlining with jit_inline_above_cost=-1? If > that still fixes the leak I think I know the issue / have a reproducer > already... So after about 14 hours, it shows: $ grep kB /proc/32117/status VmPeak: 4412436 kB VmSize: 4389988 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 4261196 kB VmRSS: 4239936 kB RssAnon: 6112 kB RssFile: 7340 kB RssShmem: 4226484 kB VmData: 5412 kB VmStk: 132 kB VmExe: 5388 kB VmLib: 13032 kB VmPTE: 8532 kB VmSwap: 0 kB HugetlbPages: 0 kB The munin graph is also very flat now, so with jit=0, it seems fixed. I've just restarted it with jit=1 jit_inline_above_cost=-1 I will let you know. Kurt
On Wed, Nov 11, 2020 at 10:49:18AM +0100, Kurt Roeckx wrote: > > Hm, darn. Any chance you could check if the leak is present if you turn > > on jit again, but disable inlining with jit_inline_above_cost=-1? If > > that still fixes the leak I think I know the issue / have a reproducer > > already... > I've just restarted it with > jit=1 > jit_inline_above_cost=-1 So after more than 2 hours, I see no leak with those settings. Kurt
Hi, We also encounter this issue on our postgresql12 database. Disabling jit inlining seems to reduce drastically the memory leak. But after 4 days without restart (compared to the 3 restart / day before the change) it seems that we still have a leak. The RAM is full and the swap almost full.. I hope that a fix will land quickly since this issue has a huge impact on our business. Regards -- Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
Hi, On 2020-11-19 09:45:50 -0700, laurent.mignon@acsone.eu wrote: > We also encounter this issue on our postgresql12 database. Disabling jit > inlining seems to reduce drastically the memory leak. But after 4 days > without restart (compared to the 3 restart / day before the change) it > seems that we still have a leak. The RAM is full and the swap almost > full.. In that case it is a separate issue from Kurt's. Could you follow the instructions from https://www.postgresql.org/message-id/1999972.1604957673%40sss.pgh.pa.us and https://www.postgresql.org/message-id/20201110043127.sfkyvvjqy7x3er5k%40alap3.anarazel.de providing the output here? > I hope that a fix will land quickly since this issue has a huge impact on > our business. It's unfortunately a nontrivial issue - I have a prototype for a fix, but it's complicated and not yet complete. And unfortunately I've since injured my hand and am now typing very slowly... Btw, I'll be more likely to see your reply promptly if you CC me in your reply. Greetings, Andres Freund
Hi, On Wed, Apr 7, 2021, at 09:28, Jaime Casanova wrote: > On Tue, Nov 10, 2020 at 05:38:08PM -0800, Andres Freund wrote: > > Hi, > > > > On 2020-11-10 23:45:16 +0100, Kurt Roeckx wrote: > > > On Tue, Nov 10, 2020 at 08:50:39PM +0100, Kurt Roeckx wrote: > > > > > There's one known (slow) memory leak in the JIT code / LLVM. Could you > > > > > check if the issue vanishes if you disable JIT (jit = 0)? > > > > > > > > I've just restarted it with jit = 0. > > > > > > It's been about 3 hours since the restart, and it looks much > > > better, it seems to be solved. > > > > Hm, darn. Any chance you could check if the leak is present if you turn > > on jit again, but disable inlining with jit_inline_above_cost=-1? If > > that still fixes the leak I think I know the issue / have a reproducer > > already... > > > > I was bit by this too while testing something. I thought it could have been > a problem caused by the support of llvm 12? but given that this report > is for pg12, it seems this is older than that. I hope to fix it soon - I have two different mostly working fixes. I'll get back to it once the dust around the freeze settles. Andres