Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

Поиск
Список
Период
Сортировка
От pinker
Тема Re: [GENERAL] core system is getting unresponsive because over 300 cpu load
Дата
Msg-id 1507674532671-0.post@n3.nabble.com
обсуждение исходный текст
Ответ на Re: [GENERAL] core system is getting unresponsive because over 300cpu load  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: [GENERAL] core system is getting unresponsive because over 300cpu load  (John R Pierce <pierce@hogranch.com>)
Re: [GENERAL] core system is getting unresponsive because over 300cpu load  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: [GENERAL] core system is getting unresponsive because over 300cpu load  (Scott Marlowe <scott.marlowe@gmail.com>)
Список pgsql-general
Tomas Vondra-4 wrote
> What is "CPU load"? Perhaps you mean "load average"?

Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the
graph from yesterday's failure (after 6p.m.):
<http://www.postgresql-archive.org/file/t342733/cpu.png> 
So as one can see connections spikes follow cpu spikes...


Tomas Vondra-4 wrote
> Also, what are the basic system parameters (number of cores, RAM), it's
> difficult to help without knowing that.

I have actually written everything in the first post:
80 CPU and 4 sockets
over 500GB RAM


Tomas Vondra-4 wrote
> Well, 3M transactions over ~2h period is just ~450tps, so nothing
> extreme. Not sure how large the transactions are, of course.

It's quite a lot going on. Most of them are complicated stored procedures.


Tomas Vondra-4 wrote
> Something gets executed on the database. We have no idea what it is, but
> it should be in the system logs. And you should see the process in 'top'
> with large amounts of virtual memory ...

Yes, it would be much easier if it would be just single query from the top,
but the most cpu is eaten by the system itself and I'm not sure why. I
suppose because of page tables size and anon pages is NUMA related.



Tomas Vondra-4 wrote
> Another possibility is a run-away query that consumes a lot of work_mem.

It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of
stored procedures with unnecessary WITH clauses (i.e. materialization) and
right after it IN query with results of that (hash).



Tomas Vondra-4 wrote
> Measure cache hit ratio (see pg_stat_database.blks_hit and blks_read),
> and then you can decide.

Thank you for the tip. I always do it but haven't here,  so the result is
0.992969610990056 - so increasing it is rather pointless.


Tomas Vondra-4 wrote
> You may also make the bgwriter more aggressive - that won't really
> improve the hit ratio, it will only make enough room for the backends.

yes i probably will


Tomas Vondra-4 wrote
> But I don't quite see how this could cause the severe problems you have,
> as I assume this is kinda regular behavior on that system. Hard to say
> without more data.

I can provide you with any data you need :)


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

В списке pgsql-general по дате отправления:

Предыдущее
От: Victor Yegorov
Дата:
Сообщение: Re: [GENERAL] core system is getting unresponsive because over 300cpu load
Следующее
От: John R Pierce
Дата:
Сообщение: Re: [GENERAL] core system is getting unresponsive because over 300cpu load