Обсуждение: heavy load-high cpu itilization
Dear all first of all congratulations on your greak work here since from time to time i 've found many answers to my problems. unfortunately for this specific problem i didnt find much relevant information, so i would ask for your guidance dealing with the following situation: we have a dedicated server (8.4.4, redhat) with 24 cpus and 36 GB or RAM. i would say that the traffic in the server is huge and the cpu utilization is pretty high too (avg ~ 75% except during the nights when is it much lower). i am trying to tune the server a little bit to handle this problem. the incoming data in the database are about 30-40 GB /day. at first the checkpoint_segments were set to 50, the checkpoint_timeout at 15 min and the checkpoint_completion_target was 0.5 sec. i noticed that the utilization of the server was higher when it was close to making a checkpoint and since the parameter of full_page_writes is ON , i changed the parameters mentioned above to (i did that after reading a lot of stuff online): checkpoint_segments->250 checkpoint_timeout->40min checkpoint_completion_target -> 0.8 but the cpu utilization is not significantly lower. another parameter i will certainly change is the wal_buffers which is now set at 64KB and i plan to make it 16MB. can this parameter cause a significant percentage of the problem? are there any suggestions what i can do to tune better the server? i can provide any information you find relevant for the configuration of the server, the OS, the storage etc thank you in advance -- View this message in context: http://postgresql.1045698.n5.nabble.com/heavy-load-high-cpu-itilization-tp4631760p4631760.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Mon, Jul 25, 2011 at 12:00 PM, Filippos <filippos.kal@gmail.com> wrote: > Dear all > > first of all congratulations on your greak work here since from time to time > i 've found many answers to my problems. unfortunately for this specific > problem i didnt find much relevant information, so i would ask for your > guidance dealing with the following situation: > > we have a dedicated server (8.4.4, redhat) with 24 cpus and 36 GB or RAM. i There are known data eating bugs in 8.4.4 you should upgrade to 8.4.latest as soon as possible. > would say that the traffic in the server is huge and the cpu utilization is > pretty high too (avg ~ 75% except during the nights when is it much lower). > i am trying to tune the server a little bit to handle this problem. the > incoming data in the database are about 30-40 GB /day. So you're either CPU or IO bound. We need to see which. Look at these two pages: http://wiki.postgresql.org/wiki/Guide_to_reporting_problems http://wiki.postgresql.org/wiki/SlowQueryQuestions to get started. > at first the checkpoint_segments were set to 50, the checkpoint_timeout at > 15 min and the checkpoint_completion_target was 0.5 sec. checkpoint_completion_target is not in seconds, it's a percentage to have completely by the time the next checkpoint arrives. a checkpoint completion target of 1.0 means that the bg writer should write out data fast enough to flush everything out of WAL to the disks right as you reach checkpoint timeout. the more aggressive this is the more of the data will already be flushed to disk when the timeout occurs. However, this comes at the expense of more IO overall as multiple updates to the same block result in multiple writes instead of just one. > i noticed that the utilization of the server was higher when it was close to > making a checkpoint and since the parameter of full_page_writes is ON , i > changed the parameters mentioned above to (i did that after reading a lot of > stuff online): > checkpoint_segments->250 > checkpoint_timeout->40min > checkpoint_completion_target -> 0.8 > > but the cpu utilization is not significantly lower. another parameter i will > certainly change is the wal_buffers which is now set at 64KB and i plan to > make it 16MB. can this parameter cause a significant percentage of the > problem? Most of the work done by checkpointing / background writing is IO intensive, not CPU intensive. > are there any suggestions what i can do to tune better the server? i can > provide any information you find relevant for the configuration of the > server, the OS, the storage etc First you need to more accurately identify the problem. Tools like iostat, vmstat, top, and so forth can help you figure out if the problem is that you're IO bound or CPU bound. It's also possible you've got a thundering herd issue where there's too many processes all trying to vie for the limited number of cores at the same time. If you've got more than 30k to 50k context switches per second in vmstat it's likely you're getting too many things trying to run at once.
thx a lot for your answer. i will provide some stats, so if you could help me figure out the source of the problem that would be great -top -c Tasks: 1220 total, 49 running, 1171 sleeping, 0 stopped, 0 zombie Cpu(s): 84.1%us, 2.8%sy, 0.0%ni, 12.3%id, 0.1%wa, 0.1%hi, 0.6%si, 0.0%st Mem: 98846996k total, 98632044k used, 214952k free, 134320k buffers Swap: 50331640k total, 116312k used, 50215328k free, 89445208k cached -SELECT count(procpid) FROM pg_stat_activity -> 422 -SELECT count(procpid) FROM pg_stat_activity WHERE (NOW() - query_start) > INTERVAL '1 MINUTES' AND current_query = '<IDLE>' -> 108 -SELECT count(procpid) FROM pg_stat_activity WHERE (NOW() - query_start) > INTERVAL '5 MINUTES' AND current_query = '<IDLE>' -> 45 -vmstat -n 1 10 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 41 1 116300 347008 134176 89608912 0 0 143 210 0 0 11 1 88 0 0 20 0 116300 423556 134116 89581840 0 0 8336 3038 11118 21139 81 5 13 0 0 24 0 116300 412904 134108 89546840 0 0 8488 9025 10621 22921 81 4 15 0 0 23 0 116300 409388 134084 89513728 0 0 8320 548 11386 20226 82 4 14 0 0 34 0 116300 403688 134088 89509520 0 0 6336 0 9552 20994 83 3 14 0 0 22 1 116300 337972 134104 89518624 0 0 8792 28 8980 20455 83 4 13 0 0 37 0 116300 303956 134116 89528720 0 0 8440 536 9644 20492 84 3 13 0 0 17 1 116300 293212 134112 89532816 0 0 5864 8240 9527 19771 85 3 12 0 0 14 0 116300 282168 134116 89540720 0 0 7772 752 10141 21780 84 3 13 0 0 44 0 116300 278684 134100 89536080 0 0 7352 555 9856 21539 85 2 13 0 0 -vmstat -s 98846992 total memory 98685392 used memory 40342200 active memory 52644588 inactive memory 161604 free memory 129960 buffer memory 89421936 swap cache 50331640 total swap 116300 used swap 50215340 free swap 2258553017 non-nice user cpu ticks 1125281 nice user cpu ticks 146638389 system cpu ticks 17789847697 idle cpu ticks 83090716 IO-wait cpu ticks 5045742 IRQ cpu ticks 38895985 softirq cpu ticks 0 stolen cpu ticks 29142450583 pages paged in 42731005078 pages paged out 39784 pages swapped in 3395187 pages swapped out 1338370564 interrupts 1176640487 CPU context switches 1305704895 boot time 24471946 forks (after 30 sec) -vmstat -s 98846992 total memory 98367312 used memory 39959952 active memory 52957104 inactive memory 479684 free memory 129720 buffer memory 89410640 swap cache 50331640 total swap 116296 used swap 50215344 free swap 2258645091 non-nice user cpu ticks 1125282 nice user cpu ticks 146640181 system cpu ticks 17789863186 idle cpu ticks 83090856 IO-wait cpu ticks 5045855 IRQ cpu ticks 38896749 softirq cpu ticks 0 stolen cpu ticks 29142861271 pages paged in 42731249289 pages paged out 39784 pages swapped in 3395187 pages swapped out 1338808821 interrupts 1177463384 CPU context switches 1305704895 boot time 24472003 forks from the above -> context switches /s = (1177463384 - 1176640487)/30 = 27429 thx in advance for any advice -- View this message in context: http://postgresql.1045698.n5.nabble.com/heavy-load-high-cpu-itilization-tp4647751p4651856.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
thx a lot for your answer. i will provide some stats, so if you could help me figure out the source of the problem that would be great -*top -c* Tasks: 1220 total, 49 running, 1171 sleeping, 0 stopped, 0 zombie Cpu(s): *84.1%us*, 2.8%sy, 0.0%ni, 12.3%id, 0.1%wa, 0.1%hi, 0.6%si, 0.0%st Mem: 98846996k total, 98632044k used, 214952k free, 134320k buffers Swap: 50331640k total, 116312k used, 50215328k free, 89445208k cached -SELECT count(procpid) FROM pg_stat_activity -> *422* -SELECT count(procpid) FROM pg_stat_activity WHERE (NOW() - query_start) > INTERVAL '1 MINUTES' AND current_query = '<IDLE>' -> *108* -SELECT count(procpid) FROM pg_stat_activity WHERE (NOW() - query_start) > INTERVAL '5 MINUTES' AND current_query = '<IDLE>' -> *45* -*vmstat -n 1 10* procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 41 1 116300 347008 134176 89608912 0 0 143 210 0 0 11 1 88 0 0 20 0 116300 423556 134116 89581840 0 0 8336 3038 11118 21139 81 5 13 0 0 24 0 116300 412904 134108 89546840 0 0 8488 9025 10621 22921 81 4 15 0 0 23 0 116300 409388 134084 89513728 0 0 8320 548 11386 20226 82 4 14 0 0 34 0 116300 403688 134088 89509520 0 0 6336 0 9552 20994 83 3 14 0 0 22 1 116300 337972 134104 89518624 0 0 8792 28 8980 20455 83 4 13 0 0 37 0 116300 303956 134116 89528720 0 0 8440 536 9644 20492 84 3 13 0 0 17 1 116300 293212 134112 89532816 0 0 5864 8240 9527 19771 85 3 12 0 0 14 0 116300 282168 134116 89540720 0 0 7772 752 10141 21780 84 3 13 0 0 44 0 116300 278684 134100 89536080 0 0 7352 555 9856 21539 85 2 13 0 0 -*vmstat -s* 98846992 total memory 98685392 used memory 40342200 active memory 52644588 inactive memory 161604 free memory 129960 buffer memory 89421936 swap cache 50331640 total swap 116300 used swap 50215340 free swap 2258553017 non-nice user cpu ticks 1125281 nice user cpu ticks 146638389 system cpu ticks 17789847697 idle cpu ticks 83090716 IO-wait cpu ticks 5045742 IRQ cpu ticks 38895985 softirq cpu ticks 0 stolen cpu ticks 29142450583 pages paged in 42731005078 pages paged out 39784 pages swapped in 3395187 pages swapped out 1338370564 interrupts 1176640487 CPU context switches 1305704895 boot time 24471946 forks (after 30 sec) -*vmstat -s* 98846992 total memory 98367312 used memory 39959952 active memory 52957104 inactive memory 479684 free memory 129720 buffer memory 89410640 swap cache 50331640 total swap 116296 used swap 50215344 free swap 2258645091 non-nice user cpu ticks 1125282 nice user cpu ticks 146640181 system cpu ticks 17789863186 idle cpu ticks 83090856 IO-wait cpu ticks 5045855 IRQ cpu ticks 38896749 softirq cpu ticks 0 stolen cpu ticks 29142861271 pages paged in 42731249289 pages paged out 39784 pages swapped in 3395187 pages swapped out 1338808821 interrupts 1177463384 CPU context switches 1305704895 boot time 24472003 forks from the above -> context switches /s = (1177463384 - 1176640487)/30 = *27429* thx in advance for any advice -- View this message in context: http://postgresql.1045698.n5.nabble.com/heavy-load-high-cpu-itilization-tp4647751p4650542.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Jul 30, 2011, at 3:02 PM, Filippos wrote: > thx a lot for your answer. > i will provide some stats, so if you could help me figure out the source of > the problem that would be great > > -*top -c* > Tasks: 1220 total, 49 running, 1171 sleeping, 0 stopped, 0 zombie > Cpu(s): *84.1%us*, 2.8%sy, 0.0%ni, 12.3%id, 0.1%wa, 0.1%hi, 0.6%si, > 0.0%st > Mem: 98846996k total, 98632044k used, 214952k free, 134320k buffers > Swap: 50331640k total, 116312k used, 50215328k free, 89445208k cached 84% CPU isn't horrible, and you do have idle CPU time available. So you don't look to be too CPU-bound, although you needto keep in mind that one process might be CPU intensive and taking a long time to run, thereby blocking other processesthat depend on it's results. > -SELECT count(procpid) FROM pg_stat_activity -> *422* > -SELECT count(procpid) FROM pg_stat_activity WHERE (NOW() - query_start) > > INTERVAL '1 MINUTES' AND current_query = '<IDLE>' -> *108* > -SELECT count(procpid) FROM pg_stat_activity WHERE (NOW() - query_start) > > INTERVAL '5 MINUTES' AND current_query = '<IDLE>' -> *45* It would be good to look at getting some connection pooling happening. Your vmstat output shows you generally have CPU available. Can you provide some output from iostat -xk 2? > -*vmstat -n 1 10* > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id > wa st > 41 1 116300 347008 134176 89608912 0 0 143 210 0 0 11 1 88 > 0 0 > 20 0 116300 423556 134116 89581840 0 0 8336 3038 11118 21139 81 5 > 13 0 0 > 24 0 116300 412904 134108 89546840 0 0 8488 9025 10621 22921 81 4 > 15 0 0 > 23 0 116300 409388 134084 89513728 0 0 8320 548 11386 20226 82 4 > 14 0 0 > 34 0 116300 403688 134088 89509520 0 0 6336 0 9552 20994 83 3 > 14 0 0 > 22 1 116300 337972 134104 89518624 0 0 8792 28 8980 20455 83 4 > 13 0 0 > 37 0 116300 303956 134116 89528720 0 0 8440 536 9644 20492 84 3 > 13 0 0 > 17 1 116300 293212 134112 89532816 0 0 5864 8240 9527 19771 85 3 > 12 0 0 > 14 0 116300 282168 134116 89540720 0 0 7772 752 10141 21780 84 3 > 13 0 0 > 44 0 116300 278684 134100 89536080 0 0 7352 555 9856 21539 85 2 > 13 0 0 > > -*vmstat -s* > 98846992 total memory > 98685392 used memory > 40342200 active memory > 52644588 inactive memory > 161604 free memory > 129960 buffer memory > 89421936 swap cache > 50331640 total swap > 116300 used swap > 50215340 free swap > 2258553017 non-nice user cpu ticks > 1125281 nice user cpu ticks > 146638389 system cpu ticks > 17789847697 idle cpu ticks > 83090716 IO-wait cpu ticks > 5045742 IRQ cpu ticks > 38895985 softirq cpu ticks > 0 stolen cpu ticks > 29142450583 pages paged in > 42731005078 pages paged out > 39784 pages swapped in > 3395187 pages swapped out > 1338370564 interrupts > 1176640487 CPU context switches > 1305704895 boot time > 24471946 forks > > (after 30 sec) > -*vmstat -s* > 98846992 total memory > 98367312 used memory > 39959952 active memory > 52957104 inactive memory > 479684 free memory > 129720 buffer memory > 89410640 swap cache > 50331640 total swap > 116296 used swap > 50215344 free swap > 2258645091 non-nice user cpu ticks > 1125282 nice user cpu ticks > 146640181 system cpu ticks > 17789863186 idle cpu ticks > 83090856 IO-wait cpu ticks > 5045855 IRQ cpu ticks > 38896749 softirq cpu ticks > 0 stolen cpu ticks > 29142861271 pages paged in > 42731249289 pages paged out > 39784 pages swapped in > 3395187 pages swapped out > 1338808821 interrupts > 1177463384 CPU context switches > 1305704895 boot time > 24472003 forks > > from the above -> context switches /s = (1177463384 - 1176640487)/30 = > *27429* > > thx in advance for any advice > > > > > -- > View this message in context: http://postgresql.1045698.n5.nabble.com/heavy-load-high-cpu-itilization-tp4647751p4650542.html > Sent from the PostgreSQL - performance mailing list archive at Nabble.com. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net