Server stalls, all CPU 100% system time

Поиск
Список
Период
Сортировка
Hi,
Since our upgrade of hardware, OS and Postgres we experience server stalls under certain conditions, during that time
(upto 2 minutes) all CPUs show 100% system time. All Postgres processes show BIND in top. 
Usually the server only has a load of  < 0.5 (12 cores) with up to 30 connections, 200-400 tps

Here is top -H during the stall:
Threads: 279 total,  25 running, 254 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us, 99.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

This is under normal circumstances:
Threads: 274 total,   1 running, 273 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.2 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

iostat shows under 0.3% load on the drives.

The stalls are mostly reproducible when there is the normal load on the server and then 20-40 new processes start
executingSQLs. 
Deactivating HT seemed to have reduced the frequency and length of the stalls.

The log shows entries for slow BINDs (8 seconds):
... LOG:  duration: 8452.654 ms  bind pdo_stmt_00000001: SELECT [20 columns selected] FROM users WHERE users.USERID=$1
LIMIT1 

I have tried to create a testcase, but even starting 200 client processes that execute prepared statements does not
reproducethis behaviour on a nearly idle server, only under normal workload does it stall. 

Hardware details:
2x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
64 GB RAM

Postgres version: 9.2.2 and 9.2.3

Linux: OpenSUSE 12.2 with Kernel 3.4.6

Postgres config:
max_connections = 200
effective_io_concurrency = 3
max_wal_senders = 2
wal_keep_segments = 2048
max_locks_per_transaction = 500
default_statistics_target = 100
checkpoint_completion_target = 0.9
maintenance_work_mem = 1GB
effective_cache_size = 60GB
work_mem = 384MB
wal_buffers = 8MB
checkpoint_segments = 64
shared_buffers = 15GB


This might be related to this topic:
http://www.postgresql.org/message-id/CANQNgOquOGH7AkqW6ObPafrgxv=J3WsiZg-NgVvbki-qYpoY7Q@mail.gmail.com(Poor
performanceafter update from SLES11 SP1 to SP2) 
I believe the old server was OpenSUSE 11.x.


Thanks for any hint on how to fix this or diagnose the problem.


В списке pgsql-performance по дате отправления:

Предыдущее
От: Jeff Janes
Дата:
Сообщение: Re: Avoiding Recheck Cond when using Select Distinct
Следующее
От: Bèrto ëd Sèra
Дата:
Сообщение: Re: Server stalls, all CPU 100% system time