Обсуждение: CPU load spikes when CentOS tries to reclaim 'cached' memory

Поиск
Список
Период
Сортировка

CPU load spikes when CentOS tries to reclaim 'cached' memory

От
vlasmarias
Дата:
For the past few days, we've been seeing unexpected extremely high CPU spikes
in our system. We observed the following: the 'free' memory would go down to
lower than 300 MB; at that point, 'cached' slowly starts to go down, and
then CPU starts to go way up.

It's almost as if the OS was not releasing 'cached' memory fast enough for
Postgres. Is that analysis correct? Is there a way to fix this?

Here's the session:

 04:58:37 load average: 2.37, free: 532, cached: 22852
 04:58:57 load average: 1.91, free: 451, cached: 22859
 04:59:17 load average: 1.82, free: 469, cached: 22866
 04:59:57 load average: 1.57, free: 387, cached: 22884
 05:00:17 load average: 3.03, free: 574, cached: 22632
 05:00:37 load average: 5.51, free: 268, cached: 22267
 05:00:58 load average: 21.44, free: 237, cached: 21704
 05:01:18 load average: 42.98, free: 243, cached: 21061
 05:01:38 load average: 63.38, free: 266, cached: 20410
 05:01:58 load average: 78.69, free: 315, cached: 20135
 05:02:19 load average: 89.82, free: 214, cached: 20034
 05:02:39 load average: 99.06, free: 253, cached: 19873
 05:02:59 load average: 105.60, free: 390, cached: 19497
 05:03:20 load average: 110.10, free: 394; cached: 19290

Here are the pertinent machine and OS and Postgres details:
RAM: 32 gigs
CPU: 24 cores;  Intel(R) Xeon(R) CPU X7460  @ 2.66GHz
RAID 10
vm.swappiness=0
 PostgreSQL 9.1.11 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-3), 64-bit
 Linux ps2db 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux
postgres=# SELECT name, current_setting(name), source
postgres-#   FROM pg_settings
postgres-#   WHERE source NOT IN ('default', 'override');
             name             |        current_setting        |
source
------------------------------+-------------------------------+----------------------
 application_name             | psql                          | client
 archive_command              | /bin/true                     |
configuration file
 archive_mode                 | on                            |
configuration file
 autovacuum_analyze_threshold | 50                            |
configuration file
 autovacuum_freeze_max_age    | 800000000                     |
configuration file
 autovacuum_naptime           | 5min                          |
configuration file
 autovacuum_vacuum_threshold  | 50                            |
configuration file
 bytea_output                 | escape                        |
configuration file
 checkpoint_completion_target | 0.7                           |
configuration file
 checkpoint_segments          | 128                           |
configuration file
 checkpoint_timeout           | 15min                         |
configuration file
 checkpoint_warning           | 30s                           |
configuration file
 client_encoding              | UTF8                          | client
 constraint_exclusion         | partition                     |
configuration file
 cpu_index_tuple_cost         | 0.005                         |
configuration file
 cpu_operator_cost            | 0.0025                        |
configuration file
 cpu_tuple_cost               | 0.01                          |
configuration file
 custom_variable_classes      | pg_stat_statements            |
configuration file
 DateStyle                    | ISO, MDY                      |
configuration file
 default_statistics_target    | 100                           |
configuration file
 default_text_search_config   | pg_catalog.english            |
configuration file
 effective_cache_size         | 16GB                          |
configuration file
 effective_io_concurrency     | 1                             |
configuration file
 enable_material              | off                           |
configuration file
 escape_string_warning        | on                            |
configuration file
 hot_standby                  | on                            |
configuration file
 lc_messages                  | C                             |
configuration file
 lc_monetary                  | en_US.UTF-8                   |
configuration file
 lc_numeric                   | en_US.UTF-8                   |
configuration file
 lc_time                      | en_US.UTF-8                   |
configuration file
 listen_addresses             | *                             |
configuration file
 log_autovacuum_min_duration  | 0                             |
configuration file
 log_checkpoints              | on                            |
configuration file
 log_connections              | on                            |
configuration file
 log_destination              | csvlog                        |
configuration file
 log_directory                | pg_log                        |
configuration file
 log_disconnections           | on                            |
configuration file
 log_filename                 | postgresql.log.ps2db.%H       |
configuration file
 log_line_prefix              | %t [%d] [%u] [%p]: [%l-1] %h  |
configuration file
 log_lock_waits               | on                            |
configuration file
 log_min_duration_statement   | 0                             |
configuration file
 log_rotation_age             | 1h                            |
configuration file
 log_temp_files               | 0                             |
configuration file
 log_timezone                 | Canada/Pacific                | environment
variable
 log_truncate_on_rotation     | on                            |
configuration file
 logging_collector            | on                            |
configuration file
 maintenance_work_mem         | 1GB                           |
configuration file
 max_connections              | 500                           |
configuration file
 max_locks_per_transaction    | 512                           |
configuration file
 max_stack_depth              | 2MB                           | environment
variable
 max_standby_streaming_delay  | 90min                         |
configuration file
 max_wal_senders              | 6                             |
configuration file
 pg_stat_statements.max       | 10000                         |
configuration file
 pg_stat_statements.track     | all                           |
configuration file
 port                         | 5432                          |
configuration file
 random_page_cost             | 4                             |
configuration file
 shared_buffers               | 6GB                           |
configuration file
 shared_preload_libraries     | pg_stat_statements            |
configuration file
 standard_conforming_strings  | off                           |
configuration file
 stats_temp_directory         | /ram_postgres_stats           |
configuration file
 temp_buffers                 | 16MB                          |
configuration file
 TimeZone                     | Canada/Pacific                | environment
variable
 wal_keep_segments            | 64                            |
configuration file
 wal_level                    | hot_standby                   |
configuration file
 work_mem                     | 8MB                           |
configuration file
(65 rows)





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/CPU-load-spikes-when-CentOS-tries-to-reclaim-cached-memory-tp5806122.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Jeff Janes
Дата:

On Wed, Jun 4, 2014 at 5:27 PM, vlasmarias <vlasmarias@contigo.com> wrote:
For the past few days, we've been seeing unexpected extremely high CPU spikes
in our system. We observed the following: the 'free' memory would go down to
lower than 300 MB; at that point, 'cached' slowly starts to go down, and
then CPU starts to go way up.

It's almost as if the OS was not releasing 'cached' memory fast enough for
Postgres. Is that analysis correct? Is there a way to fix this?

This sounds like a kernel problem, probably either the zone reclaim issue, or the transparent huge pages issue.

I don't know the exact details off the top of my head, but both have been discussed a lot on both this list and the pgsql-hackers list.


 

Here's the session:

 04:58:37 load average: 2.37, free: 532, cached: 22852
 04:58:57 load average: 1.91, free: 451, cached: 22859
 04:59:17 load average: 1.82, free: 469, cached: 22866
 04:59:57 load average: 1.57, free: 387, cached: 22884

What tool is that?  I'm not familiar with this output format.


 
 max_connections              | 500                

While this is probably fundamentally a kernel problem, you are not doing yourself any favors by allowing 500 connections to a machine with 24 cores.  High numbers of connections can trigger poor kernel behavior.

Cheers,

Jeff

Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Merlin Moncure
Дата:
On Thu, Jun 5, 2014 at 10:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> This sounds like a kernel problem, probably either the zone reclaim issue,
> or the transparent huge pages issue.

I at first thought maybe same, but I don't think THP was introduced
until 2.6.38...OP is running 2.6.32-431.11.2.el6.x86_6.  Maybe it's
NUMA related, but would not be idiomatic of NUMA issues as I
understand them (poor memory utilization/high IO utilization).  Would
be a very cheap/easy thing to try though.

Is this server virtualized?

merlin


Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Deron
Дата:
We saw very similar issues with a CentOS server with 40 cores (32 virtualized) when moving from a physical server to a virtual server (I think it had 128GB RAM).   Never had the problem on a physical server.  We checked the same things as noted here, but never found a bug.   We really thought it had something to do with NUMA zone reclaim, but could never prove that.    In our case it was all kernel time in the guest, all CPUs at 100%.   Sometimes it would last for a few seconds or minutes.  Sometimes we would go days without a problem, and then it would completely tank.  

If you figure out what is going on, I would like to know  (especially if it is virtualized).

Deron



On Thu, Jun 5, 2014 at 12:23 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Jun 5, 2014 at 10:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> This sounds like a kernel problem, probably either the zone reclaim issue,
> or the transparent huge pages issue.

I at first thought maybe same, but I don't think THP was introduced
until 2.6.38...OP is running 2.6.32-431.11.2.el6.x86_6.  Maybe it's
NUMA related, but would not be idiomatic of NUMA issues as I
understand them (poor memory utilization/high IO utilization).  Would
be a very cheap/easy thing to try though.

Is this server virtualized?

merlin


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Merlin Moncure
Дата:
On Thu, Jun 5, 2014 at 2:47 PM, Deron <fecastle@gmail.com> wrote:
> We saw very similar issues with a CentOS server with 40 cores (32
> virtualized) when moving from a physical server to a virtual server (I think
> it had 128GB RAM).   Never had the problem on a physical server.  We checked
> the same things as noted here, but never found a bug.   We really thought it
> had something to do with NUMA zone reclaim, but could never prove that.
> In our case it was all kernel time in the guest, all CPUs at 100%.
> Sometimes it would last for a few seconds or minutes.  Sometimes we would go
> days without a problem, and then it would completely tank.
>
> If you figure out what is going on, I would like to know  (especially if it
> is virtualized).

There is a class of problems in virutalized enviroment that come from
over-aggressive reclaiming of memory from the guest to the host.  When
the guest tries to access the 'unpinned' memory it will manifest as
high latency memory reads and show up as high user time.  That may or
may not be the case here.

What we'd need from the OP to get a better diagnosis is:
*) top/sar output showing if the load average is due to high user,sys, or iowait
*) is/isnot virtualized as noted above
*) captured 'perf' snapshot during slowdown, particularly if we are
seeing high user space loads.  For example, we could be looking at
high spinlock activity (which seems unlikely given how the problem is
described but is something to rule out for sure).

merlin


Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Vincent Lasmarias
Дата:
Thanks for the informative responses and suggestions. My responses below:

* Sorry for the double post. I posted the original message using my gmail
account and got a "is not a member of any of the restrict_post groups"
response and when I didn't see it for a day, I ended up wondering if it was
due to my use of a gmail account - so I tried using my company email account
instead to post an updated version of the original post.

* Our system is not virtualized.

* Jeff, the output format of the load and free/cached memory did not come
from a tool but came from my script. My script does OEuptime; free ­m¹, and
then another script massages the data to only grab date, 1-minute load
average, free, and cached.

* For the 'top' outputs, I don't have the actual 'top' output, but I have
the following:

2014-05-30 00:01:01 procs -----------memory---------- ---swap-- -----io----
--system-- -----cpu-----
2014-05-30 00:01:01  r  b   swpd   free   buff  cache   si   so    bi    bo
in   cs us sy id wa st
(...)
2014-05-30 04:59:52  1  0      0 621976  92340 23431340    0    0     0
16756 3538 3723  7  2 91  0  0
2014-05-30 04:59:53  1  0      0 671896  92340 23431364    0    0     0
236 1933  825  2  1 97  0  0
2014-05-30 04:59:54  2  0      0 542964  92340 23433596    0    0  2148
300 3751 1394  6  1 92  0  0
2014-05-30 04:59:55  0  0      0 566140  92340 23433616    0    0     0
192 3485 1465  6  1 94  0  0
2014-05-30 04:59:56  2  0      0 614348  92340 23433760    0    0     0
424 3238 4278  4  1 95  0  0
2014-05-30 04:59:57  4  0      0 408812  92340 23433792    0    0     8
944 6249 12512 12  2 86  0  0
2014-05-30 04:59:58  3  0      0 471716  92356 23434012    0    0     0
440 9028 4164 13  1 86  0  0
2014-05-30 04:59:59  4  0      0 380988  92356 23434428    0    0     0
248 10009 10967 15  3 83  0  0
2014-05-30 05:00:00  6  0      0 462052  92356 23434904    0    0     0
960 7260 9242 12  2 85  0  0
2014-05-30 05:00:01  5  0      0 409796  92360 23435224    0    0    96
1860 11475 95765 18  7 75  0  0
2014-05-30 05:00:02 10  0      0 221464  92360 23433868    0    0   428
10800 13933 128264 23  9 67  0  0
2014-05-30 05:00:03 12  0      0 231444  91956 23355644    0    0     0
1480 26651 10817 10 35 54  0  0
2014-05-30 05:00:04 11  0      0 385672  91508 23254120    0    0     0
3096 30849 44776 22 28 50  0  0
2014-05-30 05:00:05  9  0      0 408932  91508 23270216    0    0     0
1996 21925 24978 12 26 63  0  0
2014-05-30 05:00:06 10  0      0 373992  91508 23270580    0    0     0
2160 25778 5994 11 31 58  0  0
2014-05-30 05:00:07  5  0      0 457900  91508 23270688    0    0     0
6080 25185 11705 14 21 65  0  0
2014-05-30 05:00:08  0  0      0 658300  91508 23270804    0    0     0
2089 5989 5849 11  2 88  0  0
2014-05-30 05:00:09  1  0      0 789972  91508 23270928    0    0     0
2508 2346 2550  2  1 97  0  0
2014-05-30 05:00:10  0  0      0 845736  91508 23274260    0    0    12
1728 2109 1494  2  1 97  0  0
2014-05-30 05:00:11  3  0      0 686352  91516 23274644    0    0     8
2100 4039 5288  6  2 92  0  0
2014-05-30 05:00:12 11  1      0 447636  91516 23274808    0    0   520
1436 10299 50523 24  7 68  1  0
2014-05-30 05:00:13 13  0      0 352380  91516 23276220    0    0  1060
816 18283 18682 15 36 48  1  0
2014-05-30 05:00:14 12  0      0 356720  88880 23179276    0    0   704
868 16193 140313 36 12 51  1  0
2014-05-30 05:00:15  5  0      0 513784  88880 23173344    0    0  2248
748 12350 21178 30  6 62  2  0
2014-05-30 05:00:16  2  0      0 623020  88884 23175808    0    0  1568
500 5841 4999 12  2 86  1  0
2014-05-30 05:00:17  5  0      0 590488  88884 23175844    0    0    24
584 6573 4905 14  2 84  0  0
2014-05-30 05:00:18  3  0      0 632408  88884 23176116    0    0     0
496 6846 4358 14  2 84  0  0
2014-05-30 05:00:19  5  0      0 596716  88884 23176948    0    0   656
668 7135 5262 14  3 83  0  0
2014-05-30 05:00:20  6  0      0 558692  88884 23179964    0    0  2816
1012 8566 7742 17  4 79  0  0
2014-05-30 05:00:21  7  1      0 476580  88888 23181200    0    0  1272
968 11240 14308 23  6 71  1  0
2014-05-30 05:00:22  8  0      0 695396  88888 23183028    0    0   728
1128 9751 7121 22  4 74  1  0
2014-05-30 05:00:23  9  0      0 536084  88888 23199080    0    0   392
1024 12523 22269 26  6 68  0  0
2014-05-30 05:00:24 13  0      0 416296  88888 23200416    0    0    40
1000 16319 61822 29 21 51  0  0
2014-05-30 05:00:25 14  0      0 386904  88888 23200704    0    0    24
816 20850 4424 16 38 46  0  0
2014-05-30 05:00:26 17  0      0 334688  88896 23201028    0    0    24
1000 26758 16934 24 46 30  0  0
2014-05-30 05:00:27 18  0      0 307304  88896 23193928    0    0     0
1068 27051 67778 21 46 33  0  0
2014-05-30 05:00:28 20  1      0 295560  88896 23162456    0    0     0
860 31012 27787 15 67 18  0  0
2014-05-30 05:00:29 22  1      0 281272  88896 23153312    0    0    16
928 28899 2857  9 78 13  0  0
2014-05-30 05:00:30 26  0      0 400804  87976 22979324    0    0     0
1536 37689 4368  9 88  3  0  0
2014-05-30 05:00:31 27  0      0 395588  87976 22979412    0    0     0
1564 29195 4305  8 92  0  0  0
2014-05-30 05:00:32 25  0      0 353176  87976 22979592    0    0    24
9404 29845 15845 13 85  2  0  0
2014-05-30 05:00:33 28  0      0 318680  87976 22979776    0    0     0
1588 28097 3372  9 91  0  0  0
2014-05-30 05:00:34 27  0      0 304676  87352 22969136    0    0     0
1480 29387 4330 10 90  0  0  0
2014-05-30 05:00:35 27  0      0 337960  79784 22900220    0    0    48
924 38334 8253 13 86  1  0  0
2014-05-30 05:00:36 29  0      0 297308  79788 22898608    0    0     0
952 30865 3067 10 90  0  0  0
2014-05-30 05:00:37 33  0      0 282612  79624 22801728    0    0     8
17169 29197 6855 13 87  0  0  0
2014-05-30 05:00:38 32  0      0 224140  79624 22717680    0    0     0
908 29640 6579 14 86  0  0  0
2014-05-30 05:00:39 29  0      0 304456  79624 22661712    0    0     0
696 31838 3563 10 90  0  0  0
2014-05-30 05:00:40 32  0      0 322080  79624 22657328    0    0     0
396 29206 2274  4 96  0  0  0
2014-05-30 05:00:41 33  0      0 309184  79624 22649172    0    0     0
660 27084 2385  4 96  0  0  0
2014-05-30 05:00:42 34  0      0 285348  79624 22646216    0    0     0
4316 26418 2571  3 97  0  0  0
2014-05-30 05:00:43 37  0      0 273732  79624 22632920    0    0     0
740 26728 2663  3 97  0  0  0
2014-05-30 05:00:44 38  0      0 276304  79624 22632640    0    0     0
464 26376 2593  3 97  0  0  0
2014-05-30 05:00:45 41  0      0 257480  79624 22624828    0    0     0
448 26505 3516  4 96  0  0  0
2014-05-30 05:00:46 32  0      0 305108  79608 22559216    0    0     0
304 29610 6162  8 92  0  0  0
2014-05-30 05:00:47 46  0      0 310504  79592 22487756    0    0     0
360 30559 10591 12 88  0  0  0
2014-05-30 05:00:48 47  0      0 286236  79596 22478168    0    0     0
312 25733 4250  5 95  0  0  0
2014-05-30 05:00:49 48  0      0 300284  79596 22477684    0    0     0
356 26464 4690  5 95  0  0  0
2014-05-30 05:00:50 52  0      0 276896  79588 22449560    0    0   124
280 26238 4132  4 96  0  0  0
2014-05-30 05:00:51 60  0      0 234836  79568 22406932    0    0     4
344 26446 5055  5 95  0  0  0
2014-05-30 05:00:52 62  0      0 247304  79564 22351916    0    0    12
452 26807 4694  3 97  0  0  0
2014-05-30 05:00:53 63  0      0 231892  79564 22347368    0    0     0
308 25378 4376  3 97  0  0  0
2014-05-30 05:00:54 67  0      0 236056  79564 22309368    0    0     0
156 25737 4022  3 97  0  0  0
2014-05-30 05:00:55 66  0      0 232984  79564 22286336    0    0     0
216 25393 3874  2 98  0  0  0
2014-05-30 05:00:56 67  0      0 240720  79560 22267736    0    0     0
588 25944 4678  2 98  0  0  0
2014-05-30 05:00:57 70  0      0 242836  79540 22232068    0    0     0
16800 26058 4607  3 97  0  0  0
2014-05-30 05:00:58 72  0      0 234944  79548 22224948    0    0     0
608 25589 4687  2 98  0  0  0
2014-05-30 05:00:59 73  0      0 236064  79536 22173496    0    0     0
188 25747 4530  3 97  0  0  0
2014-05-30 05:01:00 77  0      0 232708  79524 22135168    0    0     0
304 25546 5247  3 97  0  0  0
2014-05-30 05:01:01 72  0      0 269328  79528 22117528    0    0    24
396 27545 8488  5 95  0  0  0
2014-05-30 05:01:02 83  0      0 220892  79496 22043024    0    0     0
3280 28665 9805  7 94  0  0  0
2014-05-30 05:01:03 86  0      0 224004  79488 21995400    0    0     0
440 26338 6090  3 97  0  0  0
2014-05-30 05:01:04 90  0      0 249684  79476 21932468    0    0     0
408 26341 5834  3 97  0  0  0
2014-05-30 05:01:05 91  0      0 257336  79464 21883060    0    0     0
380 26272 5717  2 98  0  0  0
2014-05-30 05:01:06 98  0      0 242896  79468 21878016    0    0     0
608 25628 5648  2 98  0  0  0
2014-05-30 05:01:07 94  0      0 237276  79468 21876148    0    0     0
908 25041 5883  3 98  0  0  0
2014-05-30 05:01:08 99  0      0 225832  79488 21858572    0    0    24
504 25271 5913  3 97  0  0  0
2014-05-30 05:01:09 94  0      0 245796  79460 21812404    0    0     0
264 25106 6189  3 97  0  0  0
2014-05-30 05:01:10 94  0      0 246268  79460 21811144    0    0     0
188 25087 5989  2 98  0  0  0
2014-05-30 05:01:11 94  0      0 232900  79456 21775152    0    0     0
168 24965 5949  3 97  0  0  0
2014-05-30 05:01:12 100  0      0 227900  79428 21737032    0    0    32
276 25560 6798  4 96  0  0  0
2014-05-30 05:01:13 98  0      0 253644  79396 21713908    0    0     0
552 27440 9052  5 95  0  0  0
2014-05-30 05:01:14 104  0      0 269648  79376 21634336    0    0     0
540 26100 6633  3 97  0  0  0
2014-05-30 05:01:15 104  0      0 259436  79376 21622164    0    0     0
368 25094 6417  2 98  0  0  0
2014-05-30 05:01:16 105  0      0 262596  79372 21616292    0    0     0
200 24995 6276  2 98  0  0  0
2014-05-30 05:01:17 109  0      0 232172  79360 21583800    0    0     0
388 25112 6570  3 97  0  0  0
2014-05-30 05:01:18 109  0      0 231628  79372 21566604    0    0     0
364 25221 6644  2 98  0  0  0
2014-05-30 05:01:19 110  0      0 223920  79372 21532992    0    0     0
340 25383 6874  3 97  0  0  0
2014-05-30 05:01:20 111  0      0 223028  79368 21501868    0    0     0
288 25369 6465  2 98  0  0  0
2014-05-30 05:01:21 113  0      0 211676  79352 21434584    0    0     0
240 24939 6607  4 96  0  0  0
2014-05-30 05:01:22 114  0      0 211020  79300 21327264    0    0     0
308 25390 7239  5 95  0  0  0
2014-05-30 05:01:23 110  0      0 213524  79256 21215612    0    0    40
336 25494 7878  7 93  0  0  0
2014-05-30 05:01:24 114  0      0 222748  79220 21107976    0    0     0
276 25257 7032  7 93  0  0  0
2014-05-30 05:01:25 115  0      0 262004  79160 21012468    0    0     0
300 25986 6746  8 92  0  0  0
(...)

* For the OEperf¹ outputs, I¹m still waiting for a time where I can gather
it. Unfortunately, this is on a Production system, and we had to quickly
look for a workaround ­ the workaround we have found is: when the OEfree¹
memory drops below a certain threshold, we run: /bin/sync && /bin/echo 3 >
/proc/sys/vm/drop_caches . Since we¹ve been doing that, we haven¹t had the
high load issue.

* I forgot to mention that the debugging info I posted came from our slave
server (the master and slave have the same specs, but different OS version ­
master has 2.6.18-371.3.1.el5 #1 SMP Thu Dec 5 12:47:02 EST 2013). We
actually first saw the issue on our master server and not the slave server.
To get the master going, we edited our code to move our heaviest queries so
that they hit only the slave. After we did that, the slave started having
the high load issues, but with one difference - on the slave, 'cached' was
high, but as can be seen in the master debugging session below, on the
master, cached was low (corresponds to what we had for shared_buffers), and
when 'free' dropped, it looks like the OS managed to grab the memory from
'used' (do note that we will be adding more memory to our servers shortly):

09:24 load average: 8.76 used: 32166 free: 31851 shared: 314 buffers: 0
cached: 8827
09:25 load average: 6.30 used: 31840 free: 325 shared: 0 buffers: 13 cached:
8851
09:26 load average: 95.83 used: 17883 free: 14282 shared:  0 buffers: 2
cached: 8563
(...)
10:00 load average: 4.45 used: 31945 free: 220 shared: 0 buffers: 15 cached:
8862
10:01 load average: 4.56 used: 31983 free: 182 shared: 0 buffers: 15 cached:
8844
10:02 load average: 5.34 used: 31983 free: 183 shared: 0 buffers: 14 cached:
8832
10:03 load average: 6.57 used: 31987 free: 179 shared: 0 buffers: 9 cached:
8709
10:04 load average: 71.21 used: 18095 free: 14071 shared: 0 buffers: 1
cached: 8556
(...)
10:40 load average: 5.38 used: 31970 free: 196 shared: 0 buffers: 12 cached:
8929
10:41 load average: 6.10 used: 31889 free: 276 shared: 0 buffers: 13 cached:
8804
10:42 load average: 6.94 used: 31984 free: 182 shared: 0 buffers: 2 cached:
8761
10:43 load average: 13.90 used: 31777 free: 389 shared: 0 buffers: 2 cached:
8555
10:44 load average: 54.30 used: 18894 free: 13272 shared: 0 buffers: 4
cached: 8592
(...)
11:21 load average: 5.54 used: 31985 free: 181 shared: 0 buffers: 11 cached:
8764
11:22 load average: 5.15 used: 31768 free: 397 shared: 0 buffers: 10 cached:
8721
11:23 load average: 5.62 used: 31901 free: 265 shared: 0 buffers: 11 cached:
8742
11:24 load average: 4.80 used: 31969 free: 196 shared: 0 buffers: 9 cached:
8675
11:25 load average: 53.74 used: 18644 free: 13522 shared: 0 buffers: 1
cached: 8578

More detailed output from our master server:

2014-05-29 00:01:01 procs -----------memory---------- ---swap-- -----io----
--system-- -----cpu------
2014-05-29 00:01:01  r  b   swpd   free   buff  cache   si   so    bi    bo
in   cs us sy id wa st
(...)
2014-05-29 09:24:41 10  0   7044 364616  14048 9055060    0    0     4  1908
7115 11287 29  1 70  0  0
2014-05-29 09:24:42  7  0   7044 360160  14072 9057648    0    0  2064   796
7512 11746 34  2 64  0  0
2014-05-29 09:24:43  2  0   7044 339576  14076 9057644    0    0    12   824
7143 11633 25  1 74  0  0
2014-05-29 09:24:44  3  0   7044 333120  14080 9058096    0    0     4   696
5762 7558 17  1 82  0  0
2014-05-29 09:24:45  2  0   7044 332500  14080 9058096    0    0     0   592
5138 6043 13  1 86  0  0
2014-05-29 09:24:46  9  0   7044 322008  14092 9058500    0    0    12  1800
6545 10885 22  2 76  0  0
2014-05-29 09:24:47  6  0   7044 316428  14124 9058468    0    0    24   900
7371 12690 34  2 64  0  0
2014-05-29 09:24:48  8  0   7044 309608  14128 9058984    0    0    28   920
7088 10422 23  1 76  0  0
2014-05-29 09:24:49  1  0   7044 309236  14128 9058984    0    0     8   856
6898 10685 24  1 75  0  0
2014-05-29 09:24:50  4  0   7044 295140  14128 9059392    0    0    16   904
6977 11955 25  2 73  0  0
2014-05-29 09:24:51  3  0   7044 294700  14148 9059372    0    0    20  2176
6471 9320 19  1 80  0  0
2014-05-29 09:24:52  1  0   7044 293392  14172 9059816    0    0    72   804
6836 9151 18  1 81  0  0
2014-05-29 09:24:53  7  0   7044 297460  14176 9059812    0    0    12   760
5914 8573 22  2 76  0  0
2014-05-29 09:24:54  2  0   7044 305768  14180 9060228    0    0     4   680
6204 9062 21  1 78  0  0
2014-05-29 09:24:55  4  0   7044 299824  14188 9062396    0    0  2072   800
6533 9405 21  1 78  0  0
2014-05-29 09:24:56  7  0   7044 352600  14216 9062884    0    0   212  1612
7084 10758 28  2 70  0  0
2014-05-29 09:24:57 10  0   7044 350244  14232 9062868    0    0    16   844
7873 14913 45  4 52  0  0
2014-05-29 09:24:58  6  0   7044 348260  14236 9063268    0    0     4   616
6029 8907 36  2 62  0  0
2014-05-29 09:24:59  6  0   7044 335488  14248 9063256    0    0    28   840
7111 10687 27  1 72  0  0
2014-05-29 09:25:00  9  0   7044 343672  14248 9063612    0    0    16  1016
7768 13125 29  2 70  0  0
2014-05-29 09:25:01  8  0   7044 341564  14284 9063576    0    0     4  2156
6498 8663 18  1 82  0  0
2014-05-29 09:25:02  8  0   7044 333756  14336 9064004    0    0    12   952
5848 7911 15  1 84  0  0
2014-05-29 09:25:03  8  1   7044 277940  14352 9063988    0    0    40   784
6502 11541 23  2 75  0  0
2014-05-29 09:25:04  7  0   7044 192320  14376 9064564    0    0    84   896
7373 18005 33  3 63  0  0
2014-05-29 09:25:05  2  0   7044 311492  14376 9064564    0    0     0   984
6858 10804 24  1 75  0  0
2014-05-29 09:25:06 12  0   7044 305456  14380 9064980    0    0    12  3200
6969 10565 25  2 73  0  0
2014-05-29 09:25:07  5  0   7044 303596  14420 9064940    0    0    40  1348
7110 10512 28  1 71  0  0
2014-05-29 09:25:08  4  0   7044 299124  14420 9065316    0    0     0  1040
7804 12488 26  1 72  0  0
2014-05-29 09:25:09  5  0   7044 281336  14428 9067484    0    0  2056   720
6217 9241 23  2 75  0  0
2014-05-29 09:25:10  8  0   7044 269892  14436 9067916    0    0    32  1000
7188 10796 24  1 75  0  0
2014-05-29 09:25:11  7  0   7044 267784  14436 9067916    0    0     8  1960
7967 12888 31  1 68  0  0
2014-05-29 09:25:12  9  0   7044 232936  14460 9068380    0    0    24  1112
6582 10024 23  1 76  0  0
2014-05-29 09:25:13  4  0   7044 231076  14504 9068640    0    0   100   912
6939 12254 26  2 71  0  0
2014-05-29 09:25:14 10  0   7044 189040  14588 9068788    0    0   188  1080
6915 14296 32  4 64  0  0
2014-05-29 09:25:15  3  0   7044 185320  14596 9068780    0    0    16   448
3805 5329 26  2 73  0  0
2014-05-29 09:25:16  3  0   7044 181352  14604 9069276    0    0     8  1540
4239 5568 20  1 79  0  0
2014-05-29 09:25:17  3  0   7044 179616  14632 9069248    0    0    20   436
3957 4246 12  0 87  0  0
2014-05-29 09:25:18 14  0   7044 177756  14640 9069488    0    0     8   616
4961 6912 11  1 88  0  0
2014-05-29 09:25:19 31  0   7044 177316  14656 9069472    0    0    24  1992
11500 36077 75  4 21  0  0
2014-05-29 09:25:20  6  0   7044 165436  14524 9055920    0    0    28  1280
8124 25478 47  4 50  0  0
2014-05-29 09:25:21  3  0   7044 169144  14504 9054308    0    0    24  1696
6682 8512 20  1 79  0  0
2014-05-29 09:25:22  9  0   7044 173172  14524 9056504    0    0  2060   780
7241 10328 27  3 70  0  0
2014-05-29 09:25:23 10  0   7044 162756  14528 9056500    0    0    12   744
6599 10688 38  3 59  0  0
2014-05-29 09:25:24 12  0   7044 173760  14260 9044720    0    0    96  1196
6781 12568 44  9 47  0  0
2014-05-29 09:25:25 30  0   7044 168648  14040 9030188    0    0    12   664
6627 12749 55  7 38  0  0
2014-05-29 09:25:26 14  1   7044 176392  13932 9024712    0    0    16  1864
7724 17989 76  6 17  0  0
2014-05-29 09:25:27 19  0   7288 154564  13468 9003416    0    4     8  1836
7381 16152 54 17 28  0  0
2014-05-29 09:25:28 36  1   7288 154860  12084 8973284    0    0   588   736
6623 13916 42 42 16  0  0
2014-05-29 09:25:30 40  0   7288 154372  11492 8957252    0    0   580   896
5525 11853 47 49  4  0  0
2014-05-29 09:25:32 58  1   7288 154556  10724 8915168    0    0   328  2960
6951 8010 33 65  2  0  0
2014-05-29 09:25:34 44  1   7288 172180  10636 8901096    0    0   264  1860
4392 5677 23 76  1  0  0
2014-05-29 09:25:35 36  1   7288 154224  10272 8892344    0    0   772   664
4171 7064 40 58  2  0  0
2014-05-29 09:25:36 34  0   7288 154108   9984 8885292    0    0   808   224
2222 3110 17 81  2  0  0
2014-05-29 09:25:37 20  0   7320 154612   9844 8881288    0   24     0  1544
1423 1115  9 81 10  0  0
2014-05-29 09:25:38 23  0   7320 154860   9420 8874228    0    8   120   876
1206 3054  1 80 19  0  0
2014-05-29 09:25:39 23  1   7320 166132   9244 8872204    0    0   164   112
2078 2616 13 82  4  0  0
2014-05-29 09:25:40 46  1   7320 154612   8908 8861160    0    0   448    80
1716 10859 10 73 16  1  0
2014-05-29 09:25:41 42  0   7448 154312   7804 8848784    0  244    36  2124
2149 3477 11 89  0  0  0
2014-05-29 09:25:42 36  0   7448 154416   6576 8834556    0    0    92  1104
3077 6462 26 74  0  0  0
2014-05-29 09:25:43 64  2   7448 154644   5492 8821736    0    0   140   332
2175 3466 11 78 10  0  0
2014-05-29 09:25:44 49  0   7448 154364   4608 8811244    0    0   128    80
1939 2115 10 90  0  0  0
2014-05-29 09:25:49 34  0   7448 155792   2932 8783364    0    0  1624  2112
11642 12972 15 84  1  0  0
2014-05-29 09:25:50 27  0   7448 164700   2744 8781504    0  128   228   240
1585 2297  9 89  2  0  0
2014-05-29 09:25:51 29  0   7692 154540   2516 8778032    0    0     0   244
1842 5052 10 86  4  0  0
2014-05-29 09:25:55 35  0   7812 160476   1784 8766552    0  332   384   648
6206 5487  6 92  2  0  0
2014-05-29 09:25:56 30  1   7812 160380   1696 8765552    0    0     4     0
1000  412  0 89 11  0  0
2014-05-29 09:25:57 27  1   7812 154352   1900 8768940    0    0  5184   496
1197  946  0 89  9  1  0
2014-05-29 09:25:58 19  0   7816 154136   1764 8767872    0    0     0     0
1021  297  0 83 16  0  0
2014-05-29 09:25:59 19  0   8024 154564   1724 8765880    0    0    44  1888
1116 1024  0 80 20  0  0
2014-05-29 09:26:03 42  1  10676 154140   1332 8760968    0   32   932   344
4871 10038  1 82 17  0  0
2014-05-29 09:26:04 39  0  10676 154284   1332 8758704    0    0    12    68
1194  560  1 99  0  0  0
2014-05-29 09:26:18 67  1  10312 11751464   1388 8764704  480 2756  9836
5932 26274 32175  6 94  0  0  0
2014-05-29 09:26:19 82  1  10312 13208352   1464 8766044    0  120  1632
772 3121 8808 17 83  0  0  0
2014-05-29 09:26:20 54  1  10228 14452424   1600 8766600   64    0  1628
404 3528 7679 15 84  0  0  0
2014-05-29 09:26:21 12  2  10212 14493080   2060 8768192    0   32  2176
1664 5697 38657 24 57 17  2  0
2014-05-29 09:26:22 14  1  10212 14195536   3056 8772540    0    0  2720
624 7115 93905 38 10 48  3  0
2014-05-29 09:26:23 24  1  10212 14144780   3576 8775768    0    0  3184
900 7854 104139 36  6 54  3  0



Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Steven Crandell
Дата:
I saw similar behavior a while back on a new PG 9.1 instance (also coincidentally with high max_connections) running Centos 2.6.32 (also not virtualized).
In particular the bit about free memory seeming to drop faster than the system could reclaim cache caught my eye.

After lots of research focused on NUMA as well as bumping the box up to 80 cores and 512GB RAM for a DB that was then only something like 650GB(?) I found and experimented with increasing vm.min_free_kbytes.
This was intended to give the kernel enough free memory to gobble up in a hurry when it needed it, without having to stop what it was doing to actively reclaim cached memory for a waiting process.
When the kernel was getting into this active cache reclamation business I saw performance troughs that correlated with CPU spikes.  
Loads in the hundreds, free memory nearly exhausted, the whole bit.

Here's a little more info about vm.min_free_kbytes:

In my case, increasing vm.min_free_kbytes helped a bit but wasn't the game changer I expected it to be.

Ultimately we tracked all of these problems back to a poor hibernate connection pooler implementation.  If memory serves, the pooler code was getting sideways at GC and then looping which in turn was sending connection requests and statement binds in large bursts.
Actually proving the connection pooler to be the problem was a torturous process however, as all symptoms of this problem seemed to express themselves on the DB.

Once the pooler was fixed the symptoms on the DB server vanished and haven't been back even as the application load has increased.

I'm happy to share more details if any of that seems to line up with your situation.








On Thu, Jun 5, 2014 at 4:57 PM, Vincent Lasmarias <vlasmarias@contigo.com> wrote:
Thanks for the informative responses and suggestions. My responses below:

* Sorry for the double post. I posted the original message using my gmail
account and got a "is not a member of any of the restrict_post groups"
response and when I didn't see it for a day, I ended up wondering if it was
due to my use of a gmail account - so I tried using my company email account
instead to post an updated version of the original post.

* Our system is not virtualized.

* Jeff, the output format of the load and free/cached memory did not come
from a tool but came from my script. My script does OEuptime; free ­m¹, and
then another script massages the data to only grab date, 1-minute load
average, free, and cached.

* For the 'top' outputs, I don't have the actual 'top' output, but I have
the following:

2014-05-30 00:01:01 procs -----------memory---------- ---swap-- -----io----
--system-- -----cpu-----
2014-05-30 00:01:01  r  b   swpd   free   buff  cache   si   so    bi    bo
in   cs us sy id wa st
(...)
2014-05-30 04:59:52  1  0      0 621976  92340 23431340    0    0     0
16756 3538 3723  7  2 91  0  0
2014-05-30 04:59:53  1  0      0 671896  92340 23431364    0    0     0
236 1933  825  2  1 97  0  0
2014-05-30 04:59:54  2  0      0 542964  92340 23433596    0    0  2148
300 3751 1394  6  1 92  0  0
2014-05-30 04:59:55  0  0      0 566140  92340 23433616    0    0     0
192 3485 1465  6  1 94  0  0
2014-05-30 04:59:56  2  0      0 614348  92340 23433760    0    0     0
424 3238 4278  4  1 95  0  0
2014-05-30 04:59:57  4  0      0 408812  92340 23433792    0    0     8
944 6249 12512 12  2 86  0  0
2014-05-30 04:59:58  3  0      0 471716  92356 23434012    0    0     0
440 9028 4164 13  1 86  0  0
2014-05-30 04:59:59  4  0      0 380988  92356 23434428    0    0     0
248 10009 10967 15  3 83  0  0
2014-05-30 05:00:00  6  0      0 462052  92356 23434904    0    0     0
960 7260 9242 12  2 85  0  0
2014-05-30 05:00:01  5  0      0 409796  92360 23435224    0    0    96
1860 11475 95765 18  7 75  0  0
2014-05-30 05:00:02 10  0      0 221464  92360 23433868    0    0   428
10800 13933 128264 23  9 67  0  0
2014-05-30 05:00:03 12  0      0 231444  91956 23355644    0    0     0
1480 26651 10817 10 35 54  0  0
2014-05-30 05:00:04 11  0      0 385672  91508 23254120    0    0     0
3096 30849 44776 22 28 50  0  0
2014-05-30 05:00:05  9  0      0 408932  91508 23270216    0    0     0
1996 21925 24978 12 26 63  0  0
2014-05-30 05:00:06 10  0      0 373992  91508 23270580    0    0     0
2160 25778 5994 11 31 58  0  0
2014-05-30 05:00:07  5  0      0 457900  91508 23270688    0    0     0
6080 25185 11705 14 21 65  0  0
2014-05-30 05:00:08  0  0      0 658300  91508 23270804    0    0     0
2089 5989 5849 11  2 88  0  0
2014-05-30 05:00:09  1  0      0 789972  91508 23270928    0    0     0
2508 2346 2550  2  1 97  0  0
2014-05-30 05:00:10  0  0      0 845736  91508 23274260    0    0    12
1728 2109 1494  2  1 97  0  0
2014-05-30 05:00:11  3  0      0 686352  91516 23274644    0    0     8
2100 4039 5288  6  2 92  0  0
2014-05-30 05:00:12 11  1      0 447636  91516 23274808    0    0   520
1436 10299 50523 24  7 68  1  0
2014-05-30 05:00:13 13  0      0 352380  91516 23276220    0    0  1060
816 18283 18682 15 36 48  1  0
2014-05-30 05:00:14 12  0      0 356720  88880 23179276    0    0   704
868 16193 140313 36 12 51  1  0
2014-05-30 05:00:15  5  0      0 513784  88880 23173344    0    0  2248
748 12350 21178 30  6 62  2  0
2014-05-30 05:00:16  2  0      0 623020  88884 23175808    0    0  1568
500 5841 4999 12  2 86  1  0
2014-05-30 05:00:17  5  0      0 590488  88884 23175844    0    0    24
584 6573 4905 14  2 84  0  0
2014-05-30 05:00:18  3  0      0 632408  88884 23176116    0    0     0
496 6846 4358 14  2 84  0  0
2014-05-30 05:00:19  5  0      0 596716  88884 23176948    0    0   656
668 7135 5262 14  3 83  0  0
2014-05-30 05:00:20  6  0      0 558692  88884 23179964    0    0  2816
1012 8566 7742 17  4 79  0  0
2014-05-30 05:00:21  7  1      0 476580  88888 23181200    0    0  1272
968 11240 14308 23  6 71  1  0
2014-05-30 05:00:22  8  0      0 695396  88888 23183028    0    0   728
1128 9751 7121 22  4 74  1  0
2014-05-30 05:00:23  9  0      0 536084  88888 23199080    0    0   392
1024 12523 22269 26  6 68  0  0
2014-05-30 05:00:24 13  0      0 416296  88888 23200416    0    0    40
1000 16319 61822 29 21 51  0  0
2014-05-30 05:00:25 14  0      0 386904  88888 23200704    0    0    24
816 20850 4424 16 38 46  0  0
2014-05-30 05:00:26 17  0      0 334688  88896 23201028    0    0    24
1000 26758 16934 24 46 30  0  0
2014-05-30 05:00:27 18  0      0 307304  88896 23193928    0    0     0
1068 27051 67778 21 46 33  0  0
2014-05-30 05:00:28 20  1      0 295560  88896 23162456    0    0     0
860 31012 27787 15 67 18  0  0
2014-05-30 05:00:29 22  1      0 281272  88896 23153312    0    0    16
928 28899 2857  9 78 13  0  0
2014-05-30 05:00:30 26  0      0 400804  87976 22979324    0    0     0
1536 37689 4368  9 88  3  0  0
2014-05-30 05:00:31 27  0      0 395588  87976 22979412    0    0     0
1564 29195 4305  8 92  0  0  0
2014-05-30 05:00:32 25  0      0 353176  87976 22979592    0    0    24
9404 29845 15845 13 85  2  0  0
2014-05-30 05:00:33 28  0      0 318680  87976 22979776    0    0     0
1588 28097 3372  9 91  0  0  0
2014-05-30 05:00:34 27  0      0 304676  87352 22969136    0    0     0
1480 29387 4330 10 90  0  0  0
2014-05-30 05:00:35 27  0      0 337960  79784 22900220    0    0    48
924 38334 8253 13 86  1  0  0
2014-05-30 05:00:36 29  0      0 297308  79788 22898608    0    0     0
952 30865 3067 10 90  0  0  0
2014-05-30 05:00:37 33  0      0 282612  79624 22801728    0    0     8
17169 29197 6855 13 87  0  0  0
2014-05-30 05:00:38 32  0      0 224140  79624 22717680    0    0     0
908 29640 6579 14 86  0  0  0
2014-05-30 05:00:39 29  0      0 304456  79624 22661712    0    0     0
696 31838 3563 10 90  0  0  0
2014-05-30 05:00:40 32  0      0 322080  79624 22657328    0    0     0
396 29206 2274  4 96  0  0  0
2014-05-30 05:00:41 33  0      0 309184  79624 22649172    0    0     0
660 27084 2385  4 96  0  0  0
2014-05-30 05:00:42 34  0      0 285348  79624 22646216    0    0     0
4316 26418 2571  3 97  0  0  0
2014-05-30 05:00:43 37  0      0 273732  79624 22632920    0    0     0
740 26728 2663  3 97  0  0  0
2014-05-30 05:00:44 38  0      0 276304  79624 22632640    0    0     0
464 26376 2593  3 97  0  0  0
2014-05-30 05:00:45 41  0      0 257480  79624 22624828    0    0     0
448 26505 3516  4 96  0  0  0
2014-05-30 05:00:46 32  0      0 305108  79608 22559216    0    0     0
304 29610 6162  8 92  0  0  0
2014-05-30 05:00:47 46  0      0 310504  79592 22487756    0    0     0
360 30559 10591 12 88  0  0  0
2014-05-30 05:00:48 47  0      0 286236  79596 22478168    0    0     0
312 25733 4250  5 95  0  0  0
2014-05-30 05:00:49 48  0      0 300284  79596 22477684    0    0     0
356 26464 4690  5 95  0  0  0
2014-05-30 05:00:50 52  0      0 276896  79588 22449560    0    0   124
280 26238 4132  4 96  0  0  0
2014-05-30 05:00:51 60  0      0 234836  79568 22406932    0    0     4
344 26446 5055  5 95  0  0  0
2014-05-30 05:00:52 62  0      0 247304  79564 22351916    0    0    12
452 26807 4694  3 97  0  0  0
2014-05-30 05:00:53 63  0      0 231892  79564 22347368    0    0     0
308 25378 4376  3 97  0  0  0
2014-05-30 05:00:54 67  0      0 236056  79564 22309368    0    0     0
156 25737 4022  3 97  0  0  0
2014-05-30 05:00:55 66  0      0 232984  79564 22286336    0    0     0
216 25393 3874  2 98  0  0  0
2014-05-30 05:00:56 67  0      0 240720  79560 22267736    0    0     0
588 25944 4678  2 98  0  0  0
2014-05-30 05:00:57 70  0      0 242836  79540 22232068    0    0     0
16800 26058 4607  3 97  0  0  0
2014-05-30 05:00:58 72  0      0 234944  79548 22224948    0    0     0
608 25589 4687  2 98  0  0  0
2014-05-30 05:00:59 73  0      0 236064  79536 22173496    0    0     0
188 25747 4530  3 97  0  0  0
2014-05-30 05:01:00 77  0      0 232708  79524 22135168    0    0     0
304 25546 5247  3 97  0  0  0
2014-05-30 05:01:01 72  0      0 269328  79528 22117528    0    0    24
396 27545 8488  5 95  0  0  0
2014-05-30 05:01:02 83  0      0 220892  79496 22043024    0    0     0
3280 28665 9805  7 94  0  0  0
2014-05-30 05:01:03 86  0      0 224004  79488 21995400    0    0     0
440 26338 6090  3 97  0  0  0
2014-05-30 05:01:04 90  0      0 249684  79476 21932468    0    0     0
408 26341 5834  3 97  0  0  0
2014-05-30 05:01:05 91  0      0 257336  79464 21883060    0    0     0
380 26272 5717  2 98  0  0  0
2014-05-30 05:01:06 98  0      0 242896  79468 21878016    0    0     0
608 25628 5648  2 98  0  0  0
2014-05-30 05:01:07 94  0      0 237276  79468 21876148    0    0     0
908 25041 5883  3 98  0  0  0
2014-05-30 05:01:08 99  0      0 225832  79488 21858572    0    0    24
504 25271 5913  3 97  0  0  0
2014-05-30 05:01:09 94  0      0 245796  79460 21812404    0    0     0
264 25106 6189  3 97  0  0  0
2014-05-30 05:01:10 94  0      0 246268  79460 21811144    0    0     0
188 25087 5989  2 98  0  0  0
2014-05-30 05:01:11 94  0      0 232900  79456 21775152    0    0     0
168 24965 5949  3 97  0  0  0
2014-05-30 05:01:12 100  0      0 227900  79428 21737032    0    0    32
276 25560 6798  4 96  0  0  0
2014-05-30 05:01:13 98  0      0 253644  79396 21713908    0    0     0
552 27440 9052  5 95  0  0  0
2014-05-30 05:01:14 104  0      0 269648  79376 21634336    0    0     0
540 26100 6633  3 97  0  0  0
2014-05-30 05:01:15 104  0      0 259436  79376 21622164    0    0     0
368 25094 6417  2 98  0  0  0
2014-05-30 05:01:16 105  0      0 262596  79372 21616292    0    0     0
200 24995 6276  2 98  0  0  0
2014-05-30 05:01:17 109  0      0 232172  79360 21583800    0    0     0
388 25112 6570  3 97  0  0  0
2014-05-30 05:01:18 109  0      0 231628  79372 21566604    0    0     0
364 25221 6644  2 98  0  0  0
2014-05-30 05:01:19 110  0      0 223920  79372 21532992    0    0     0
340 25383 6874  3 97  0  0  0
2014-05-30 05:01:20 111  0      0 223028  79368 21501868    0    0     0
288 25369 6465  2 98  0  0  0
2014-05-30 05:01:21 113  0      0 211676  79352 21434584    0    0     0
240 24939 6607  4 96  0  0  0
2014-05-30 05:01:22 114  0      0 211020  79300 21327264    0    0     0
308 25390 7239  5 95  0  0  0
2014-05-30 05:01:23 110  0      0 213524  79256 21215612    0    0    40
336 25494 7878  7 93  0  0  0
2014-05-30 05:01:24 114  0      0 222748  79220 21107976    0    0     0
276 25257 7032  7 93  0  0  0
2014-05-30 05:01:25 115  0      0 262004  79160 21012468    0    0     0
300 25986 6746  8 92  0  0  0
(...)

* For the OEperf¹ outputs, I¹m still waiting for a time where I can gather
it. Unfortunately, this is on a Production system, and we had to quickly
look for a workaround ­ the workaround we have found is: when the OEfree¹
memory drops below a certain threshold, we run: /bin/sync && /bin/echo 3 >
/proc/sys/vm/drop_caches . Since we¹ve been doing that, we haven¹t had the
high load issue.

* I forgot to mention that the debugging info I posted came from our slave
server (the master and slave have the same specs, but different OS version ­
master has 2.6.18-371.3.1.el5 #1 SMP Thu Dec 5 12:47:02 EST 2013). We
actually first saw the issue on our master server and not the slave server.
To get the master going, we edited our code to move our heaviest queries so
that they hit only the slave. After we did that, the slave started having
the high load issues, but with one difference - on the slave, 'cached' was
high, but as can be seen in the master debugging session below, on the
master, cached was low (corresponds to what we had for shared_buffers), and
when 'free' dropped, it looks like the OS managed to grab the memory from
'used' (do note that we will be adding more memory to our servers shortly):

09:24 load average: 8.76 used: 32166 free: 31851 shared: 314 buffers: 0
cached: 8827
09:25 load average: 6.30 used: 31840 free: 325 shared: 0 buffers: 13 cached:
8851
09:26 load average: 95.83 used: 17883 free: 14282 shared:  0 buffers: 2
cached: 8563
(...)
10:00 load average: 4.45 used: 31945 free: 220 shared: 0 buffers: 15 cached:
8862
10:01 load average: 4.56 used: 31983 free: 182 shared: 0 buffers: 15 cached:
8844
10:02 load average: 5.34 used: 31983 free: 183 shared: 0 buffers: 14 cached:
8832
10:03 load average: 6.57 used: 31987 free: 179 shared: 0 buffers: 9 cached:
8709
10:04 load average: 71.21 used: 18095 free: 14071 shared: 0 buffers: 1
cached: 8556
(...)
10:40 load average: 5.38 used: 31970 free: 196 shared: 0 buffers: 12 cached:
8929
10:41 load average: 6.10 used: 31889 free: 276 shared: 0 buffers: 13 cached:
8804
10:42 load average: 6.94 used: 31984 free: 182 shared: 0 buffers: 2 cached:
8761
10:43 load average: 13.90 used: 31777 free: 389 shared: 0 buffers: 2 cached:
8555
10:44 load average: 54.30 used: 18894 free: 13272 shared: 0 buffers: 4
cached: 8592
(...)
11:21 load average: 5.54 used: 31985 free: 181 shared: 0 buffers: 11 cached:
8764
11:22 load average: 5.15 used: 31768 free: 397 shared: 0 buffers: 10 cached:
8721
11:23 load average: 5.62 used: 31901 free: 265 shared: 0 buffers: 11 cached:
8742
11:24 load average: 4.80 used: 31969 free: 196 shared: 0 buffers: 9 cached:
8675
11:25 load average: 53.74 used: 18644 free: 13522 shared: 0 buffers: 1
cached: 8578

More detailed output from our master server:

2014-05-29 00:01:01 procs -----------memory---------- ---swap-- -----io----
--system-- -----cpu------
2014-05-29 00:01:01  r  b   swpd   free   buff  cache   si   so    bi    bo
in   cs us sy id wa st
(...)
2014-05-29 09:24:41 10  0   7044 364616  14048 9055060    0    0     4  1908
7115 11287 29  1 70  0  0
2014-05-29 09:24:42  7  0   7044 360160  14072 9057648    0    0  2064   796
7512 11746 34  2 64  0  0
2014-05-29 09:24:43  2  0   7044 339576  14076 9057644    0    0    12   824
7143 11633 25  1 74  0  0
2014-05-29 09:24:44  3  0   7044 333120  14080 9058096    0    0     4   696
5762 7558 17  1 82  0  0
2014-05-29 09:24:45  2  0   7044 332500  14080 9058096    0    0     0   592
5138 6043 13  1 86  0  0
2014-05-29 09:24:46  9  0   7044 322008  14092 9058500    0    0    12  1800
6545 10885 22  2 76  0  0
2014-05-29 09:24:47  6  0   7044 316428  14124 9058468    0    0    24   900
7371 12690 34  2 64  0  0
2014-05-29 09:24:48  8  0   7044 309608  14128 9058984    0    0    28   920
7088 10422 23  1 76  0  0
2014-05-29 09:24:49  1  0   7044 309236  14128 9058984    0    0     8   856
6898 10685 24  1 75  0  0
2014-05-29 09:24:50  4  0   7044 295140  14128 9059392    0    0    16   904
6977 11955 25  2 73  0  0
2014-05-29 09:24:51  3  0   7044 294700  14148 9059372    0    0    20  2176
6471 9320 19  1 80  0  0
2014-05-29 09:24:52  1  0   7044 293392  14172 9059816    0    0    72   804
6836 9151 18  1 81  0  0
2014-05-29 09:24:53  7  0   7044 297460  14176 9059812    0    0    12   760
5914 8573 22  2 76  0  0
2014-05-29 09:24:54  2  0   7044 305768  14180 9060228    0    0     4   680
6204 9062 21  1 78  0  0
2014-05-29 09:24:55  4  0   7044 299824  14188 9062396    0    0  2072   800
6533 9405 21  1 78  0  0
2014-05-29 09:24:56  7  0   7044 352600  14216 9062884    0    0   212  1612
7084 10758 28  2 70  0  0
2014-05-29 09:24:57 10  0   7044 350244  14232 9062868    0    0    16   844
7873 14913 45  4 52  0  0
2014-05-29 09:24:58  6  0   7044 348260  14236 9063268    0    0     4   616
6029 8907 36  2 62  0  0
2014-05-29 09:24:59  6  0   7044 335488  14248 9063256    0    0    28   840
7111 10687 27  1 72  0  0
2014-05-29 09:25:00  9  0   7044 343672  14248 9063612    0    0    16  1016
7768 13125 29  2 70  0  0
2014-05-29 09:25:01  8  0   7044 341564  14284 9063576    0    0     4  2156
6498 8663 18  1 82  0  0
2014-05-29 09:25:02  8  0   7044 333756  14336 9064004    0    0    12   952
5848 7911 15  1 84  0  0
2014-05-29 09:25:03  8  1   7044 277940  14352 9063988    0    0    40   784
6502 11541 23  2 75  0  0
2014-05-29 09:25:04  7  0   7044 192320  14376 9064564    0    0    84   896
7373 18005 33  3 63  0  0
2014-05-29 09:25:05  2  0   7044 311492  14376 9064564    0    0     0   984
6858 10804 24  1 75  0  0
2014-05-29 09:25:06 12  0   7044 305456  14380 9064980    0    0    12  3200
6969 10565 25  2 73  0  0
2014-05-29 09:25:07  5  0   7044 303596  14420 9064940    0    0    40  1348
7110 10512 28  1 71  0  0
2014-05-29 09:25:08  4  0   7044 299124  14420 9065316    0    0     0  1040
7804 12488 26  1 72  0  0
2014-05-29 09:25:09  5  0   7044 281336  14428 9067484    0    0  2056   720
6217 9241 23  2 75  0  0
2014-05-29 09:25:10  8  0   7044 269892  14436 9067916    0    0    32  1000
7188 10796 24  1 75  0  0
2014-05-29 09:25:11  7  0   7044 267784  14436 9067916    0    0     8  1960
7967 12888 31  1 68  0  0
2014-05-29 09:25:12  9  0   7044 232936  14460 9068380    0    0    24  1112
6582 10024 23  1 76  0  0
2014-05-29 09:25:13  4  0   7044 231076  14504 9068640    0    0   100   912
6939 12254 26  2 71  0  0
2014-05-29 09:25:14 10  0   7044 189040  14588 9068788    0    0   188  1080
6915 14296 32  4 64  0  0
2014-05-29 09:25:15  3  0   7044 185320  14596 9068780    0    0    16   448
3805 5329 26  2 73  0  0
2014-05-29 09:25:16  3  0   7044 181352  14604 9069276    0    0     8  1540
4239 5568 20  1 79  0  0
2014-05-29 09:25:17  3  0   7044 179616  14632 9069248    0    0    20   436
3957 4246 12  0 87  0  0
2014-05-29 09:25:18 14  0   7044 177756  14640 9069488    0    0     8   616
4961 6912 11  1 88  0  0
2014-05-29 09:25:19 31  0   7044 177316  14656 9069472    0    0    24  1992
11500 36077 75  4 21  0  0
2014-05-29 09:25:20  6  0   7044 165436  14524 9055920    0    0    28  1280
8124 25478 47  4 50  0  0
2014-05-29 09:25:21  3  0   7044 169144  14504 9054308    0    0    24  1696
6682 8512 20  1 79  0  0
2014-05-29 09:25:22  9  0   7044 173172  14524 9056504    0    0  2060   780
7241 10328 27  3 70  0  0
2014-05-29 09:25:23 10  0   7044 162756  14528 9056500    0    0    12   744
6599 10688 38  3 59  0  0
2014-05-29 09:25:24 12  0   7044 173760  14260 9044720    0    0    96  1196
6781 12568 44  9 47  0  0
2014-05-29 09:25:25 30  0   7044 168648  14040 9030188    0    0    12   664
6627 12749 55  7 38  0  0
2014-05-29 09:25:26 14  1   7044 176392  13932 9024712    0    0    16  1864
7724 17989 76  6 17  0  0
2014-05-29 09:25:27 19  0   7288 154564  13468 9003416    0    4     8  1836
7381 16152 54 17 28  0  0
2014-05-29 09:25:28 36  1   7288 154860  12084 8973284    0    0   588   736
6623 13916 42 42 16  0  0
2014-05-29 09:25:30 40  0   7288 154372  11492 8957252    0    0   580   896
5525 11853 47 49  4  0  0
2014-05-29 09:25:32 58  1   7288 154556  10724 8915168    0    0   328  2960
6951 8010 33 65  2  0  0
2014-05-29 09:25:34 44  1   7288 172180  10636 8901096    0    0   264  1860
4392 5677 23 76  1  0  0
2014-05-29 09:25:35 36  1   7288 154224  10272 8892344    0    0   772   664
4171 7064 40 58  2  0  0
2014-05-29 09:25:36 34  0   7288 154108   9984 8885292    0    0   808   224
2222 3110 17 81  2  0  0
2014-05-29 09:25:37 20  0   7320 154612   9844 8881288    0   24     0  1544
1423 1115  9 81 10  0  0
2014-05-29 09:25:38 23  0   7320 154860   9420 8874228    0    8   120   876
1206 3054  1 80 19  0  0
2014-05-29 09:25:39 23  1   7320 166132   9244 8872204    0    0   164   112
2078 2616 13 82  4  0  0
2014-05-29 09:25:40 46  1   7320 154612   8908 8861160    0    0   448    80
1716 10859 10 73 16  1  0
2014-05-29 09:25:41 42  0   7448 154312   7804 8848784    0  244    36  2124
2149 3477 11 89  0  0  0
2014-05-29 09:25:42 36  0   7448 154416   6576 8834556    0    0    92  1104
3077 6462 26 74  0  0  0
2014-05-29 09:25:43 64  2   7448 154644   5492 8821736    0    0   140   332
2175 3466 11 78 10  0  0
2014-05-29 09:25:44 49  0   7448 154364   4608 8811244    0    0   128    80
1939 2115 10 90  0  0  0
2014-05-29 09:25:49 34  0   7448 155792   2932 8783364    0    0  1624  2112
11642 12972 15 84  1  0  0
2014-05-29 09:25:50 27  0   7448 164700   2744 8781504    0  128   228   240
1585 2297  9 89  2  0  0
2014-05-29 09:25:51 29  0   7692 154540   2516 8778032    0    0     0   244
1842 5052 10 86  4  0  0
2014-05-29 09:25:55 35  0   7812 160476   1784 8766552    0  332   384   648
6206 5487  6 92  2  0  0
2014-05-29 09:25:56 30  1   7812 160380   1696 8765552    0    0     4     0
1000  412  0 89 11  0  0
2014-05-29 09:25:57 27  1   7812 154352   1900 8768940    0    0  5184   496
1197  946  0 89  9  1  0
2014-05-29 09:25:58 19  0   7816 154136   1764 8767872    0    0     0     0
1021  297  0 83 16  0  0
2014-05-29 09:25:59 19  0   8024 154564   1724 8765880    0    0    44  1888
1116 1024  0 80 20  0  0
2014-05-29 09:26:03 42  1  10676 154140   1332 8760968    0   32   932   344
4871 10038  1 82 17  0  0
2014-05-29 09:26:04 39  0  10676 154284   1332 8758704    0    0    12    68
1194  560  1 99  0  0  0
2014-05-29 09:26:18 67  1  10312 11751464   1388 8764704  480 2756  9836
5932 26274 32175  6 94  0  0  0
2014-05-29 09:26:19 82  1  10312 13208352   1464 8766044    0  120  1632
772 3121 8808 17 83  0  0  0
2014-05-29 09:26:20 54  1  10228 14452424   1600 8766600   64    0  1628
404 3528 7679 15 84  0  0  0
2014-05-29 09:26:21 12  2  10212 14493080   2060 8768192    0   32  2176
1664 5697 38657 24 57 17  2  0
2014-05-29 09:26:22 14  1  10212 14195536   3056 8772540    0    0  2720
624 7115 93905 38 10 48  3  0
2014-05-29 09:26:23 24  1  10212 14144780   3576 8775768    0    0  3184
900 7854 104139 36  6 54  3  0



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

От
Merlin Moncure
Дата:
On Thu, Jun 5, 2014 at 6:57 PM, Vincent Lasmarias
<vlasmarias@contigo.com> wrote:
> Thanks for the informative responses and suggestions. My responses below:
>
> * Sorry for the double post. I posted the original message using my gmail
> account and got a "is not a member of any of the restrict_post groups"
> response and when I didn't see it for a day, I ended up wondering if it was
> due to my use of a gmail account - so I tried using my company email account
> instead to post an updated version of the original post.
>
> * Our system is not virtualized.
>
> * Jeff, the output format of the load and free/cached memory did not come
> from a tool but came from my script. My script does OEuptime; free ­m¹, and
> then another script massages the data to only grab date, 1-minute load
> average, free, and cached.
>
> * For the 'top' outputs, I don't have the actual 'top' output, but I have
> the following:

All right; top is pretty clearly indicating a problem in the kernel.
Maybe this is numa or something else.  If you can reproduce this on
the 'newer' kernel, it might be useful to try and get a 'perf' going;
it was introduced in 2.6.31.  This might fail for one reason or
another though but if you could get it rigged it might give some clues
as to where in the kernel you're getting bound up.

Long term, looking into upgrading the O/S might be a smart move.

merlin