Обсуждение: 回复: postgres cpu 100% need help

Поиск
Список
Период
Сортировка

回复: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
Dear sir:
         Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp  server.


Here is the output when I strace the one of process when its cpu usages grow up to 100%:

recvfrom(10, "P\0\0\0J\0 \t\tselect o_count from tsh"..., 8192, 0, NULL, NULL) = 120
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
lseek(16, 0, SEEK_END)                  = 180543488
....more and more  lseek(16, 0, SEEK_END)                  = 180543488 Occasionally select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1o_count\0\0\0\0\0\0\0\0"..., 75, 0, NULL, 0) = 75
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183

ps:my postgres config:
 autovacuum_analyze_scale_factor     | 0.1           
 autovacuum_analyze_threshold        | 50            
 autovacuum_freeze_max_age           | 200000000     
 autovacuum_max_workers                   | 3             
 autovacuum_multixact_freeze_max_age | 400000000     
 autovacuum_naptime                        | 1min          
 autovacuum_vacuum_cost_delay        | 20ms          
 autovacuum_vacuum_cost_limit        | -1            
 autovacuum_vacuum_scale_factor      | 0.2           
 autovacuum_vacuum_threshold         | 50            
 checkpoint_segments                        | 32            
 checkpoint_timeout                           | 5min          
 effective_cache_size                            | 64GB          
 maintenance_work_mem                   | 500MB         
 max_connections                                | 2000          
 max_files_per_process                         | 1000          
 work_mem                                           | 20MB          
 shared_buffers                                     | 25GB          

Are those problems with the configuration?
  




qijia.wang@melot.cn
 
发件人: 657985552@qq.com
发送时间: 2015-10-26 16:14
收件人: 王祺佳
主题: postgres cpu 100% need help
hello everyone:
         I postgresql version is 9.3.5 ,My database is oltp  server normally cpu usage is 3%~10% . Every sql is return in 20ms.Concurrently activety sql is less then 5
 but  sometimes my database cpu grows 90%+ ,the simple select sql use  2000+ms   .

 select count(*) from pg_stat_activity where state<>'idle';
 count 
-------
  126

 I strace -p 86181(the one of process when cpu is 100%)
recvfrom(10, "Q\0\0\0?select * from   tshow.p_hpa"..., 8192, 0, NULL, NULL) = 64
lseek(8, 0, SEEK_END)                   = 32768
lseek(7, 0, SEEK_END)                   = 1155072
lseek(7, 0, SEEK_END)                   = 1155072
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)


zhiwei.li

Re: ??: postgres cpu 100% need help

От
Bill Moran
Дата:
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:

> Dear sir:
>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help.
Hereis the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu
usageof its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its
processcan suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5
andthe database is oltp  server. 

9.3.5 is pretty old, you should probably schedule an upgrade.

>  shared_buffers                                     | 25GB

Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.

If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.

--
Bill Moran


Re: ??: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help.
Hereis the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu
usageof its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its
processcan suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5
andthe database is oltp  server. 
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>>  shared_buffers                                     | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.

possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)

maybe some other things I'm not thinking of.

merlin


Re: [GENERAL] 回复: postgres cpu 100% need help

От
Jeff Janes
Дата:
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
         Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp  server.

From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop.  What is the query it is running, and what is the execution plan for it?  To get that, run:

explain (analyze, buffers) select o_count from ....

Cheers,

Jeff

Re: [GENERAL] Re: [GENERAL] 回复: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
>>
>> Dear sir:
>>          Recently a wired question about postgresql database really
>> bothered me a lot, so i really need your help. Here is the problem, in the
>> most situations the postgre database work very well,  Average 3500tps/s per
>> day, the cpu usage of its process is 3%~10% and every query can be responsed
>> in less than 20ms, but sometimes the cpu usages of its process can suddenly
>> grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My
>> postgresql version is 9.3.5 and the database is oltp  server.
>
>
> From all the lseeks on the same file, it looks like you are doing a
> sequential scan on a table for the inside of a nested loop.  What is the
> query it is running, and what is the execution plan for it?  To get that,
> run:

this smells like thp compaction:


Re: [GENERAL] Re: [GENERAL] 回复: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Tue, Oct 27, 2015 at 4:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
>>>
>>> Dear sir:
>>>          Recently a wired question about postgresql database really
>>> bothered me a lot, so i really need your help. Here is the problem, in the
>>> most situations the postgre database work very well,  Average 3500tps/s per
>>> day, the cpu usage of its process is 3%~10% and every query can be responsed
>>> in less than 20ms, but sometimes the cpu usages of its process can suddenly
>>> grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My
>>> postgresql version is 9.3.5 and the database is oltp  server.
>>
>>
>> From all the lseeks on the same file, it looks like you are doing a
>> sequential scan on a table for the inside of a nested loop.  What is the
>> query it is running, and what is the execution plan for it?  To get that,
>> run:
>
> this smells like thp compaction:

drat -- sorry for partial email.
check this link:
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

if that turns out not to be productive, need to check 'perf top', and
also rule out problems with individual queries.

merlin


Re: ??: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
Thank you for your reply. i will try to  change it  and  Waiting for it to appear again.the sql is :
 select count(t.*) into o_count from tshow.res_room_weight t,tshow.res_room_info r 
  where t.subcatlg_id=:i_title_id 
  and t.roomid = r.actorid
  and r.levels>=0;

tshow=> \d res_room_info;
                 Table "tshow.res_room_info"
     Column     |            Type             |   Modifiers   
----------------+-----------------------------+---------------
 actorid        | integer                     | not null
 nickname       | text                        | not null
 livetype       | integer                     | 
 people_inroom  | integer                     | 
 poster         | character varying(128)      | 
 actor_level    | integer                     | 
 operatorid     | integer                     | 
 jointime       | timestamp without time zone | 
 signtime       | timestamp without time zone | 
 levels         | integer                     | 
 note           | text                        | 
 leavereason    | text                        | 
 register_city  | integer                     | 
 vedio_level    | integer                     | 
 is_good        | integer                     | default 0
 is_display     | integer                     | 
 live_starttime | timestamp without time zone | 
 live_endtime   | timestamp without time zone | 
 next_starttime | timestamp without time zone | 
 max_count      | integer                     | default 40000
 is_recommend   | integer                     | 
 icon           | integer                     | 
 rich_level     | integer                     | 
 type           | integer                     | 
 room_mode      | integer                     | 
 room_theme     | text                        | 
 portrait       | text                        | 
 gender         | integer                     | default 0
 tag            | text                        | 
 live_poster    | text                        | 
 family_id      | integer                     | 
 room_lock      | integer                     | default 0
Indexes:
    "res_room_info_pkey" PRIMARY KEY, btree (actorid)
    "idx_res_room_info_cityid" btree (register_city)

tshow=> \d tshow.res_room_weight
             Table "tshow.res_room_weight"
    Column    |            Type             | Modifiers 
--------------+-----------------------------+-----------
 subcatlg_id  | integer                     | not null
 roomid       | integer                     | not null
 weight       | integer                     | default 0
 is_recommend | integer                     | 
 update_time  | timestamp without time zone | 
 product_id   | integer                     | default 1
 create_time  | timestamp without time zone | 
Indexes:
    "res_room_weight_pkey" PRIMARY KEY, btree (subcatlg_id, roomid) CLUSTER

tshow=> select count(*) from tshow.res_room_info ;
 count 
-------
 22648
(1 row)

tshow=> select count(*) from tshow.res_room_weight  ;
 count 
-------
 23417

i don't see any  Exclusive lock in  pg_lock view .


657985552@qq.com
 
From: Bill Moran
Date: 2015-10-28 01:14
Subject: Re: [GENERAL]??: postgres cpu 100% need help
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:
 
> Dear sir:
>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp  server.
 
9.3.5 is pretty old, you should probably schedule an upgrade.
       
>  shared_buffers                                     | 25GB         
 
Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.
 
If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.
 
--
Bill Moran

Re: ??: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
Thank you for your reply.
1. I'm sure the io is not problem .but  I didn't record the IO status at that time.
2.what is THP compaction ? How can i rule this out?
3. explain 
 tshow=> explain  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r 
tshow->   where t.subcatlg_id=46 
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Aggregate  (cost=24581.50..24581.51 rows=1 width=60)
   ->  Hash Join  (cost=1737.27..24572.14 rows=3744 width=60)
         Hash Cond: (r.actorid = t.roomid)
         ->  Seq Scan on res_room_info r  (cost=0.00..22322.19 rows=22364 width=4)
               Filter: (levels >= 0)
         ->  Hash  (cost=1689.86..1689.86 rows=3793 width=64)
               ->  Seq Scan on res_room_weight t  (cost=0.00..1689.86 rows=3793 width=64)
                     Filter: (subcatlg_id = 46)

4. next i will use perf top to  capture during load is essential


657985552@qq.com
 
Date: 2015-10-28 02:35
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp  server.
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>>  shared_buffers                                     | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.
 
possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)
 
maybe some other things I'm not thinking of.
 
merlin
 

Re: Re: [GENERAL] 回复: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r 
tshow->   where t.subcatlg_id=46 
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=24581.57..24581.58 rows=1 width=60) (actual time=112.613..112.613 rows=1 loops=1)
   Buffers: shared hit=23439
   ->  Hash Join  (cost=1737.31..24572.21 rows=3744 width=60) (actual time=16.375..111.879 rows=3730 loops=1)
         Hash Cond: (r.actorid = t.roomid)
         Buffers: shared hit=23439
         ->  Seq Scan on res_room_info r  (cost=0.00..22322.20 rows=22365 width=4) (actual time=0.031..86.026 rows=22365 loops=1)
               Filter: (levels >= 0)
               Rows Removed by Filter: 291
               Buffers: shared hit=22039
         ->  Hash  (cost=1689.90..1689.90 rows=3793 width=64) (actual time=16.274..16.274 rows=3793 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 329kB
               Buffers: shared hit=1397
               ->  Seq Scan on res_room_weight t  (cost=0.00..1689.90 rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
                     Filter: (subcatlg_id = 46)
                     Rows Removed by Filter: 19641
                     Buffers: shared hit=1397
 Total runtime: 112.754 ms

cost only  112.754 ms . the High load time log is :

2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20 CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms  execute <unnamed>:                 SELECT          o_count                 FROM            tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""

657985552@qq.com
 
发件人: Jeff Janes
发送时间: 2015-10-28 03:33
收件人: 657985552@qq.com
抄送: pgsql-general
主题: Re: [GENERAL] 回复: postgres cpu 100% need help
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
         Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp  server.

From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop.  What is the query it is running, and what is the execution plan for it?  To get that, run:

explain (analyze, buffers) select o_count from ....

Cheers,

Jeff

Re: Re: Re: [GENERAL] 回复: postgres cpu 100% need help

От
John R Pierce
Дата:
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r 
tshow->   where t.subcatlg_id=46 
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;

http://explain.depesz.com/s/oKN5

shows where the time is spent in that query, but as you say, that one the timing was fine.    there's a lot of time being spent in a seq_scan of res_room_info.

I would have written that query as the inner join it is, but that probably won't effect how it's executed.

select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where  t.subcatlg_id=46 and r.levels>=0;


-- 
john r pierce, recycling bits in santa cruz

Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
thanks ,  I can change the sql . Needed a   little business modify in my application  .but i don't know why  in high load time  I Executing the sql
,it cost  200-300ms . but the log record it cost 2000ms

log message:
 duration: 2042.493 ms  execute <unnamed>:           select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'


657985552@qq.com
 
发件人: John R Pierce
发送时间: 2015-10-28 16:28
收件人: pgsql-general
主题: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r 
tshow->   where t.subcatlg_id=46 
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;

http://explain.depesz.com/s/oKN5

shows where the time is spent in that query, but as you say, that one the timing was fine.    there's a lot of time being spent in a seq_scan of res_room_info.

I would have written that query as the inner join it is, but that probably won't effect how it's executed.

select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where  t.subcatlg_id=46 and r.levels>=0;


-- 
john r pierce, recycling bits in santa cruz

Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
  I can change the sql . Needed a   little business modify in my application  .but i don't know why  in high load time  I Executing the sql
,it cost  200-300ms . but the log record it cost 2000ms

log message:
 duration: 2042.493 ms  execute <unnamed>:           select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'

thanks 


657985552@qq.com
 
发件人: 657985552@qq.com
发送时间: 2015-10-28 16:53
主题: Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
thanks ,  I can change the sql . Needed a   little business modify in my application  .but i don't know why  in high load time  I Executing the sql
,it cost  200-300ms . but the log record it cost 2000ms

log message:
 duration: 2042.493 ms  execute <unnamed>:           select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'


657985552@qq.com
 
发件人: John R Pierce
发送时间: 2015-10-28 16:28
收件人: pgsql-general
主题: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers)  select count(t.*)  from tshow.res_room_weight t,tshow.res_room_info r 
tshow->   where t.subcatlg_id=46 
tshow->   and t.roomid = r.actorid
tshow->   and r.levels>=0;

http://explain.depesz.com/s/oKN5

shows where the time is spent in that query, but as you say, that one the timing was fine.    there's a lot of time being spent in a seq_scan of res_room_info.

I would have written that query as the inner join it is, but that probably won't effect how it's executed.

select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where  t.subcatlg_id=46 and r.levels>=0;


-- 
john r pierce, recycling bits in santa cruz

Re: Re: [GENERAL] 回复: postgres cpu 100% need help

От
Jeff Janes
Дата:
On Wed, Oct 28, 2015 at 12:12 AM, 657985552@qq.com <657985552@qq.com> wrote:
> Thank you for your reply.
> tshow=> explain (analyze, buffers)  select count(t.*)  from
> tshow.res_room_weight t,tshow.res_room_info r
> tshow->   where t.subcatlg_id=46
> tshow->   and t.roomid = r.actorid
> tshow->   and r.levels>=0;
>                                                                QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=24581.57..24581.58 rows=1 width=60) (actual
> time=112.613..112.613 rows=1 loops=1)
>    Buffers: shared hit=23439
>    ->  Hash Join  (cost=1737.31..24572.21 rows=3744 width=60) (actual
> time=16.375..111.879 rows=3730 loops=1)
>          Hash Cond: (r.actorid = t.roomid)
>          Buffers: shared hit=23439
>          ->  Seq Scan on res_room_info r  (cost=0.00..22322.20 rows=22365
> width=4) (actual time=0.031..86.026 rows=22365 loops=1)
>                Filter: (levels >= 0)
>                Rows Removed by Filter: 291
>                Buffers: shared hit=22039
>          ->  Hash  (cost=1689.90..1689.90 rows=3793 width=64) (actual
> time=16.274..16.274 rows=3793 loops=1)
>                Buckets: 1024  Batches: 1  Memory Usage: 329kB
>                Buffers: shared hit=1397
>                ->  Seq Scan on res_room_weight t  (cost=0.00..1689.90
> rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
>                      Filter: (subcatlg_id = 46)
>                      Rows Removed by Filter: 19641
>                      Buffers: shared hit=1397
>  Total runtime: 112.754 ms
>
> cost only  112.754 ms . the High load time log is :
>
> 2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20
> CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms  execute <unnamed>:
> SELECT          o_count                 FROM
> tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""


You really need it during the high-load condition.  Perhaps
auto_explain could help.

Cheers,

Jeff


=?gb18030?B?u9i4tKO6IFtHRU5FUkFMXSA/PzogcG9zdGdyZXMg?= =?gb18030?B?Y3B1IDEwMCUgbmVlZCBoZWxw?=

От
"莎士比亚说:"
Дата:
Hi moran and others;

    yesterday i get the pg  problem  again, and i use perf top Observation follows:
PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],  (all, 32 CPUs)     
    81.39%  postgres       [.] s_lock
     5.42%  postgres       [.] LWLockAcquire
     4.59%  postgres       [.] LWLockRelease
     3.06%  postgres       [.] TransactionIdIsInProgress
     0.38%  postgres       [.] PinBuffer
     0.31%  postgres       [.] TransactionIdPrecedes
     0.27%  postgres       [.] UnpinBuffer
     0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
     0.16%  postgres       [.] heap_hot_search_buffer
     0.15%  [kernel]       [k] number.isra.1
     0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
     0.10%  [kernel]       [k] module_get_kallsym
     0.10%  libc-2.17.so   [.] __strcmp_sse42
     0.09%  [kernel]       [k] _raw_spin_lock
     0.09%  postgres       [.] hash_search_with_hash_value

is spin lock problem ?  I need everyone's help to solve the problem.thsnks!

------------------ 原始邮件 ------------------
发件人: "Merlin Moncure";<mmoncure@gmail.com>;
发送时间: 2015年10月28日(星期三) 凌晨2:35
收件人: "Bill Moran"<wmoran@potentialtech.com>;
抄送: "莎士比亚说:"<657985552@qq.com>; "pgsql-general"<pgsql-general@postgresql.org>;
主题: Re: [GENERAL] ??: postgres cpu 100% need help

On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Tue, 27 Oct 2015 11:30:45 +0800
> "657985552@qq.com" <657985552@qq.com> wrote:
>
>> Dear sir:
>>          Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well,  Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost  2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp  server.
>
> 9.3.5 is pretty old, you should probably schedule an upgrade.
>
>>  shared_buffers                                     | 25GB
>
> Try setting this to 16GB. It's been a while since I tested on
> large-memory/high-load systems, but I seem to remember that
> shared_buffers above 16G could cause these sorts of intermittant
> stalls.
>
> If that doesn't improve the situation, you'll probably need to
> provide more details, specifically the layout of the table in
> question, as well as the queries that are active when the
> problem occurs, and the contents of the pg_locks table when
> the problem is occurring.

possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)

maybe some other things I'm not thinking of.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] 回复: [GENERAL] ??: postgres cpu 100% need help

От
Pavel Stehule
Дата:


2015-11-08 14:22 GMT+01:00 莎士比亚说: <657985552@qq.com>:
Hi moran and others;

    yesterday i get the pg  problem  again, and i use perf top Observation follows:
PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],  (all, 32 CPUs)     
    81.39%  postgres       [.] s_lock
     5.42%  postgres       [.] LWLockAcquire
     4.59%  postgres       [.] LWLockRelease
     3.06%  postgres       [.] TransactionIdIsInProgress
     0.38%  postgres       [.] PinBuffer
     0.31%  postgres       [.] TransactionIdPrecedes
     0.27%  postgres       [.] UnpinBuffer
     0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
     0.16%  postgres       [.] heap_hot_search_buffer
     0.15%  [kernel]       [k] number.isra.1
     0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
     0.10%  [kernel]       [k] module_get_kallsym
     0.10%  libc-2.17.so   [.] __strcmp_sse42
     0.09%  [kernel]       [k] _raw_spin_lock
     0.09%  postgres       [.] hash_search_with_hash_value

is spin lock problem ?  I need everyone's help to solve the problem.thsnks!

yes, it is spin lock problem. you can try to limit number of active connections - http://www.enterprisedb.com/postgres-plus-edb-blog/amit-kapila/read-scalability-postgresql-95. You can try 9.4, where some issues is fixed

When you hit this issue, the basic prerequisite is test case. If you can run the simulation with this issue, then we can identify bottleneck.

You can try to check lwlocks via systemtap http://postgres.cz/wiki/Monitorov%C3%A1n%C3%AD_lwlocku_pomoc%C3%AD_systemtapu - the article is in Czech language, but this page support Google transalator.

 Regards

Pavel

Re: ??: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:
> Hi moran and others;
>
>     yesterday i get the pg  problem  again, and i use perf top Observation
> follows:
> PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],
> (all, 32 CPUs)
>     81.39%  postgres       [.] s_lock
>      5.42%  postgres       [.] LWLockAcquire
>      4.59%  postgres       [.] LWLockRelease
>      3.06%  postgres       [.] TransactionIdIsInProgress
>      0.38%  postgres       [.] PinBuffer
>      0.31%  postgres       [.] TransactionIdPrecedes
>      0.27%  postgres       [.] UnpinBuffer
>      0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
>      0.16%  postgres       [.] heap_hot_search_buffer
>      0.15%  [kernel]       [k] number.isra.1
>      0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
>      0.10%  [kernel]       [k] module_get_kallsym
>      0.10%  libc-2.17.so   [.] __strcmp_sse42
>      0.09%  [kernel]       [k] _raw_spin_lock
>      0.09%  postgres       [.] hash_search_with_hash_value
>
> is spin lock problem ?  I need everyone's help to solve the problem.thsnks!

Yup, spinlock problem.   These can be difficult.  What weneed now is
some metrics.   They must be captured during load event:

*) number of active queries with average duration:
SELECT query, count(*), avg(now() - query_start)  FROM
pg_stat_activity where state != 'idle' GROUP BY 1;

*) context switches via "vmstat 1"  get a snapshot during load and
during non load time for comparison

Solution to this problem will probably be one or more of:
1) Significantly downgrade shared_buffers (say, to 4GB)

2) Upgrade database to 9.4 and hope for the best

3) Capture backtrace during load event to determine exactly which path
is going into spinlock

4) Install pgbouncer or another connection pooler to limit active
queries on database

5) Install experimental patches to custom compiled database to test
and verify a hypothetical fix

Out of those 5 things, which are possible for you to do?  Best case
scenario is that you have non-essential server that reproduces the
issue.

merlin


Re: ??: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:
>> Hi moran and others;
>>
>>     yesterday i get the pg  problem  again, and i use perf top Observation
>> follows:
>> PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],
>> (all, 32 CPUs)
>>     81.39%  postgres       [.] s_lock
>>      5.42%  postgres       [.] LWLockAcquire
>>      4.59%  postgres       [.] LWLockRelease
>>      3.06%  postgres       [.] TransactionIdIsInProgress
>>      0.38%  postgres       [.] PinBuffer
>>      0.31%  postgres       [.] TransactionIdPrecedes
>>      0.27%  postgres       [.] UnpinBuffer
>>      0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
>>      0.16%  postgres       [.] heap_hot_search_buffer
>>      0.15%  [kernel]       [k] number.isra.1
>>      0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
>>      0.10%  [kernel]       [k] module_get_kallsym
>>      0.10%  libc-2.17.so   [.] __strcmp_sse42
>>      0.09%  [kernel]       [k] _raw_spin_lock
>>      0.09%  postgres       [.] hash_search_with_hash_value
>>
>> is spin lock problem ?  I need everyone's help to solve the problem.thsnks!
>
> Yup, spinlock problem.   These can be difficult.  What weneed now is
> some metrics.   They must be captured during load event:
>
> *) number of active queries with average duration:
> SELECT query, count(*), avg(now() - query_start)  FROM
> pg_stat_activity where state != 'idle' GROUP BY 1;
>
> *) context switches via "vmstat 1"  get a snapshot during load and
> during non load time for comparison
>
> Solution to this problem will probably be one or more of:
> 1) Significantly downgrade shared_buffers (say, to 4GB)
>
> 2) Upgrade database to 9.4 and hope for the best
>
> 3) Capture backtrace during load event to determine exactly which path
> is going into spinlock
>
> 4) Install pgbouncer or another connection pooler to limit active
> queries on database
>
> 5) Install experimental patches to custom compiled database to test
> and verify a hypothetical fix
>
> Out of those 5 things, which are possible for you to do?  Best case
> scenario is that you have non-essential server that reproduces the
> issue.

Can we also see output of:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag

merlin


Re: ??: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
hi  Merlin:
 I'm sorry to reply so late. I  don't know exactly the resault :
  > number of active queries with average duration:
  > SELECT query, count(*), avg(now() - query_start)  FROM
  > pg_stat_activity where state != 'idle' GROUP BY 1;
  there are avg 100 active connect during  load even.every query cost 2000 ms+.

 how can I  Capture backtrace during load even?
 
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled 
[always] madvise never
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag 
[always] madvise never


657985552@qq.com
 
Date: 2015-11-10 23:55
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:
>> Hi moran and others;
>>
>>     yesterday i get the pg  problem  again, and i use perf top Observation
>> follows:
>> PerfTop:   11574 irqs/sec  kernel: 2.2%  exact:  0.0% [4000Hz cycles],
>> (all, 32 CPUs)
>>     81.39%  postgres       [.] s_lock
>>      5.42%  postgres       [.] LWLockAcquire
>>      4.59%  postgres       [.] LWLockRelease
>>      3.06%  postgres       [.] TransactionIdIsInProgress
>>      0.38%  postgres       [.] PinBuffer
>>      0.31%  postgres       [.] TransactionIdPrecedes
>>      0.27%  postgres       [.] UnpinBuffer
>>      0.19%  postgres       [.] TransactionIdIsCurrentTransactionId
>>      0.16%  postgres       [.] heap_hot_search_buffer
>>      0.15%  [kernel]       [k] number.isra.1
>>      0.14%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
>>      0.10%  [kernel]       [k] module_get_kallsym
>>      0.10%  libc-2.17.so   [.] __strcmp_sse42
>>      0.09%  [kernel]       [k] _raw_spin_lock
>>      0.09%  postgres       [.] hash_search_with_hash_value
>>
>> is spin lock problem ?  I need everyone's help to solve the problem.thsnks!
>
> Yup, spinlock problem.   These can be difficult.  What weneed now is
> some metrics.   They must be captured during load event:
>
> *) number of active queries with average duration:
> SELECT query, count(*), avg(now() - query_start)  FROM
> pg_stat_activity where state != 'idle' GROUP BY 1;
>
> *) context switches via "vmstat 1"  get a snapshot during load and
> during non load time for comparison
>
> Solution to this problem will probably be one or more of:
> 1) Significantly downgrade shared_buffers (say, to 4GB)
>
> 2) Upgrade database to 9.4 and hope for the best
>
> 3) Capture backtrace during load event to determine exactly which path
> is going into spinlock
>
> 4) Install pgbouncer or another connection pooler to limit active
> queries on database
>
> 5) Install experimental patches to custom compiled database to test
> and verify a hypothetical fix
>
> Out of those 5 things, which are possible for you to do?  Best case
> scenario is that you have non-essential server that reproduces the
> issue.
 
Can we also see output of:
 
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
 
merlin
 

Re: ??: postgres cpu 100% need help

От
Kevin Grittner
Дата:
On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:

> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
> [always] madvise never

There's your problem.  You need to set those to "never".  You can
get immediate relief by echoing 'never' to those pseudo-files, but
you need to configure your system to set them at OS boot time, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: ??: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Mon, Nov 23, 2015 at 9:30 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:
>
>> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
>> [always] madvise never
>> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
>> [always] madvise never
>
> There's your problem.  You need to set those to "never".  You can
> get immediate relief by echoing 'never' to those pseudo-files, but
> you need to configure your system to set them at OS boot time, too.

yup.  IMNSHO automatic enabling of THP defrag is one of the worst
performance related mistakes I've ever seen in the history of the
linux kernel (although maybe the blame lies with the distros...I'm not
sure).  It caused a number of terrible outages here before we figured
out the underlying cause.  (of course, everyone blamed postgres until
then)

merlin


Re: ??: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
hi kevin:
thanks for you help . but the following config Indicate my OS THP is  Disable
[root@pg1 ~]# grep -i HugePages_Total /proc/meminfo 
HugePages_Total:       0
[root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
0


657985552@qq.com
 
Date: 2015-11-23 23:30
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:
 
> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> [root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
> [always] madvise never
 
There's your problem.  You need to set those to "never".  You can
get immediate relief by echoing 'never' to those pseudo-files, but
you need to configure your system to set them at OS boot time, too.
 
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
 

Re: ??: postgres cpu 100% need help

От
Tom Lane
Дата:
"657985552@qq.com" <657985552@qq.com> writes:
> thanks for you help . but the following config Indicate my OS THP is  Disable
> [root@pg1 ~]# grep -i HugePages_Total /proc/meminfo
> HugePages_Total:       0
> [root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
> 0

Those values reflect whether *explicit* huge pages are enabled.  They do
not control *transparent* huge pages.

Red Hat's version of all this is documented here:
https://access.redhat.com/solutions/46111

            regards, tom lane


Re: ??: postgres cpu 100% need help

От
"657985552@qq.com"
Дата:
oh .thanks i understand . but i still have a question .
[root@pg1 pgdata]# uname -a
Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux   
[root@pg1 pgdata]# cat /etc/redhat-release 
CentOS Linux release 7.0.1406 (Core) 

my os is centos7 .  is there  THP problem in it ?


657985552@qq.com
 
From: Tom Lane
Date: 2015-11-25 10:21
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
"657985552@qq.com" <657985552@qq.com> writes:
> thanks for you help . but the following config Indicate my OS THP is  Disable
> [root@pg1 ~]# grep -i HugePages_Total /proc/meminfo
> HugePages_Total:       0
> [root@pg1 ~]# cat /proc/sys/vm/nr_hugepages
> 0
 
Those values reflect whether *explicit* huge pages are enabled.  They do
not control *transparent* huge pages.
 
Red Hat's version of all this is documented here:
https://access.redhat.com/solutions/46111
 
regards, tom lane

Re: ??: postgres cpu 100% need help

От
Merlin Moncure
Дата:
On Tue, Nov 24, 2015 at 9:57 PM, 657985552@qq.com <657985552@qq.com> wrote:
> oh .thanks i understand . but i still have a question .
> [root@pg1 pgdata]# uname -a
> Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014
> x86_64 x86_64 x86_64 GNU/Linux
> [root@pg1 pgdata]# cat /etc/redhat-release
> CentOS Linux release 7.0.1406 (Core)
>
> my os is centos7 .  is there  THP problem in it ?

Yes. The settings posted above (*/transparent_hugepage/*) are the smoking gun.

I've had the exact same problem as you; suddenly the database slows
down to zero and just as suddenly goes back to normal.  What is
happening here is that the operating system put in some
"optimizations" to help systems manage large amounts of memory
(typical server memory configurations have gone up by several orders
of magnitude since the 4kb page size was chosen).  These optimizations
do not play well with postgres memory access patterns; the operating
system is forced to defragment memory at random intervals which slows
down memory accesss causing spinlock problems.  Basically postgres and
the kernel get into a very high speed argument over memory access.

Lowering shared buffers to around 2GB also provides relief.  This
suggests that clock sweep is a contributor to the problem, in
particular it's maintenance of usage_count (the maintenance of which
IIRC is changing soon to pure atomic update) would be a place to start
sniffing around if we wanted to Really Fix It.   So far though, no one
has been able to reproduce this issue in a non production system.

I guess if we were using (non portable) futexes instead of hand
written spinlocks we'd probably have less problems in this area.
Nevertheless given the huge performance risks I really wonder what
RedHat was thinking when they enabled it by default.

merlin