Обсуждение: Postgresql Host Swapping Hard With Abundant Free Memory
Greetings, We have been having some pretty painful site outages related to heavy swapping, even though the system may appear to haveas much as 10GB of free memory. It seems that as soon as the system hits ~22GB (of 32GB) of memory usage it starts to swap. As soon as we go below ~22GB,swap is released. During the worst outages we see: heavy swapping (10-15GB) heavy disk IO (up to 600ms) heavy CPU load: 4.0 (load over 100+ with 26 cores) available memory: (6-8GB) Here's the host information, it's a dedicated physical host, not a VM. ubuntu-14.04.1 LTS Linux db004 3.13.0-34-generic #60-Ubuntu SMP Wed Aug 13 15:45:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux 24 cores, 32GB RAM, 32GB swapfile RAID 10 with SSDs Postgres version: postgresql-9.3.5 postgis-2.1.4 extension Here's our /etc/sysctl.conf: kernel.shmmax = 16850395136 kernel.shmall = 4113866 vm.overcommit_memory = 2 vm.overcommit_ratio = 50 vm.swappiness = 0 vm.dirty_ratio = 10 ## maximum 3.2GB dirty cache size vm.dirty_background_ratio = 5 ## ratio at which disk begins to flush cache (1.6GB) We are setting our shared_buffers, work_mem and maintenance_work_mem very high which I suspected is reason for swapping. The huge mystery to me is that during heavy swap there appears to be ~8-10GB free memory in cache. We wondered whether that unused memory in cache was being consumed by our memory settings in postgresql.conf: shared_buffers = 8GB ## 25% of system memory (32GB) maintenance_work_mem = 8GB ## autovacuuming on, 3 workers (defaults) work_mem = 256MB ## we have as many as 170 connections We have average around 170 connections and have been moving everything to pgbouncer to reduce this count. Rough estimate of cost with work_mem set to 256MB: 170 x 256MB = 43510MB (43GB) Is it possible that the high work_mem setting is causing the connections to hold on to the extra available memory? This is our assumption so we plan on dialing down work_mem and maintenance_work_mem to sane values. We have also ordered more RAM (increase from 32GB to 96GB) but I would like to understand what is happening. Here are the values we are going to change: maintenance_work_mem = 1GB work_mem = 64MB One last thing, we disabled THP (tranparent huge pages) because we were seeing compaction errors: [db003.prod:~] root% egrep 'compact_(fail|stall)' /proc/vmstat compact_stall 34682729 compact_fail 32915396 Here are our competing theories of why we are swapping as much as 10-15GB with 6GB-10GB of free memory: 1.) maintenance_work_mem and work_mem are set too high causing postgres to allocate too much memory 2.) The 6GB - 10GB of free memory observed during swapping is postgresql's shared buffer 3.) Linux filesystem caching is tuned incorrectly, maybe SSD related? 4.) This is a NUMA related issue similar to the Mysql Swap Insanity issue: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html http://www.postgresql.org/message-id/CAGTBQpacrSDcN10rTwRbH+AGm2_y0Qao6CJDoyvEp504iFbdrw@mail.gmail.com Note, we zone_reclaim_mode appears disabled on our database host. root% cat /proc/sys/vm/zone_reclaim_mode 0 Hoping someone here can help us sort this out because it's a huge mystery for us. We are going to add more RAM but I'm trying to understand what is happening. Thank you kindly. Christian Gough
On 4/17/15 1:25 PM, Christian Gough wrote: > Here are our competing theories of why we are swapping as much as 10-15GB with 6GB-10GB of free memory: How are you determining that you're swapping? How are you measuring 'free memory'? > 1.) maintenance_work_mem and work_mem are set too high causing postgres to allocate too much memory There's not much that uses maintenance_work_mem. (AUTO)VACUUM, CREATE INDEX... I think that's it. Also, I believe it's internally limited to 1G. As for work_mem, it's hard to say. As your math shows, if *every* connection suddenly allocated a full work_mem then you'd be in trouble. But that would be a pretty extreme situation. Even though a backend can actually have multiple work_mem consuming operations in use at once, so it could theoretically use several times work_mem, in reality I think it's very rare for every backend to use even a fraction of work_mem. But, there's no need to theorize here; work_mem will be reported as allocated to each backend and will not be considered free memory by the OS. > 2.) The 6GB - 10GB of free memory observed during swapping is postgresql's shared buffer If anything, I'd expect the OS to avoid swapping a shared memory segment (if not flat-out refuse to swap it). Shared memory certainly shouldn't be counted as free either. The only tricky bit here is many OSes will report shared memory as part of the memory footprint *for each backend*. So if you have 170 backends and 8GB shared buffers, it could look like you're using 13.6TB of memory (which you're obviously not). > 3.) Linux filesystem caching is tuned incorrectly, maybe SSD related? Possibly, though I don't think it'd be SSD related... > 4.) This is a NUMA related issue similar to the Mysql Swap Insanity issue: Since you have that turned off, I don't think so. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com