Обсуждение: BUG #9342: CPU / Memory Run-away

Поиск

Список

Период

Сортировка

BUG #9342: CPU / Memory Run-away

От

cnielsen@atlassian.com

Дата:

25 февраля 2014 г., 00:00:27

The following bug has been logged on the website:

Bug reference:      9342
Logged by:          Christopher Nielsen
Email address:      cnielsen@atlassian.com
PostgreSQL version: 9.2.6
Operating system:   CentOS 6.5
Description:


Hello,

Thanks very much for authoring such a full-featured and useful database.
Itâs been a rock-solid component of many applications Iâve worked with, and
am very glad to see it grow.

There is an issue weâve run into recently, that could be a database bug, and
it would be great to get someoneâs input.

In the last few months, and more frequently, recently, we have seen Postgres
run away with CPU and disk resources.

When this occurs, performance becomes degraded, and response time for all
queries increases.

This seems to happen, regardless of query-load, and often happens when
traffic to the application is very low.

Here is some more information that characterizes the issue we found.

Some symptoms include:

    * The machine load is very high (many times of the number of cpus
available)
    * The CPU is 100% utilized, with 85% user and 15% sys
    * We do not see increase in iowait, that we might expect from a storage
issue
    * All query results take significantly longer to be delivered
    * The issue can be worked around, by restarting the database
        - This gives us downtime we would like to avoid
    * On strace, we often workers stuck on semop syscalls

Our setup, looks like the below:

    System Properties:

        PG Version:     9.2.6
        OS and Version: CentOS 6.5 running 2.6.32-431.3.1
        CPU:            80 cores (Intel Xeon)
        Memory:         252 GB

    Postgresql Setup:

        * 1 master and 3 slaves streaming replication from the master
        * WALs are archived using rsync in the archive_command
        * WAL files on the master are stored on a separate partition
        * Both pgdata and pg_xlog partitions have CFQ IO scheduler set.
            - We plan to switch both to either deadline or noop,
              possibly if that could help

We have core and stack traces available, and would love to collaborate with
anyone who has time to look into the issue with us, anytime.

Thanks very much again,

Chris

Re: BUG #9342: CPU / Memory Run-away

От

Heikki Linnakangas

Дата:

25 февраля 2014 г., 20:06:35

On 02/24/2014 10:47 PM, cnielsen@atlassian.com wrote:
> We have core and stack traces available, and would love to collaborate with
> anyone who has time to look into the issue with us, anytime.

Feel free to post the stack traces on this list, so that everyone can
have a look.

- Heikki

Re: BUG #9342: CPU / Memory Run-away

От

Peter Geoghegan

Дата:

26 февраля 2014 г., 00:49:10

On Mon, Feb 24, 2014 at 12:47 PM,  <cnielsen@atlassian.com> wrote:
> This seems to happen, regardless of query-load, and often happens when
> traffic to the application is very low.

Is there anything that these queries share in common? Is there a GIN
index involved?

Speaking of bugs, perhaps you could see to it that this one is fixed?  :-)

http://help.hipchat.com/forums/138883-suggestions-issues/suggestions/5163526-wrong-sort-order-c-locale-used-in-people-list

--
Peter Geoghegan

Re: BUG #9342: CPU / Memory Run-away

От

Christopher Nielsen

Дата:

27 февраля 2014 г., 03:10:49

Thank you, Heikki,

 We have core and stack traces available, and would love to collaborate with
>> anyone who has time to look into the issue with us, anytime.
>
>

>  Feel free to post the stack traces on this list, so that everyone can
> have a look.
>

Linked is a strace / status / syscall list of what our postgres processes
were doing, at the time of the run-away.

    Archive on Google
Drive<https://drive.google.com/a/atlassian.com/file/d/0B9wdVVUs5uWVTVVSSXpZN0JxX3M/edit?usp=sharing>
(Will
bonk on "approve" on request)

Also included, is a copy of the postgresql.conf, for reference.

It seemed funny that a db-wide slow-down should happen, independent of
increased query / app load.

Thanks again for any insight, or opinions, into what could be going on,

-Chris

Re: BUG #9342: CPU / Memory Run-away

От

Jeff Janes

Дата:

27 февраля 2014 г., 03:33:28

On Wed, Feb 26, 2014 at 4:10 PM, Christopher Nielsen <cnielsen@atlassian.com
> wrote:

> Thank you, Heikki,
>
>  We have core and stack traces available, and would love to collaborate
>>> with
>>> anyone who has time to look into the issue with us, anytime.
>>
>>
>
>>  Feel free to post the stack traces on this list, so that everyone can
>> have a look.
>>
>
> Linked is a strace / status / syscall list of what our postgres processes
> were doing, at the time of the run-away.
>
>     Archive on Google
Drive<https://drive.google.com/a/atlassian.com/file/d/0B9wdVVUs5uWVTVVSSXpZN0JxX3M/edit?usp=sharing>(Will 
> bonk on "approve" on request)
>
> Also included, is a copy of the postgresql.conf, for reference.
>
> It seemed funny that a db-wide slow-down should happen, independent of
> increased query / app load.
>
> Thanks again for any insight, or opinions, into what could be going on,
>

Hi Chris,

It is the very high max_connections that is allowing the problem to happen.
 If you only have 80 cores, nothing good can happen by allowing 3000
connections.  Once the connections start spending most of their time
fighting each over internal locks (spinlocks and lwlocks) than doing what
they came for, you will never recover because queries continue to come in
faster than they can be handled.  So by allowing too many connections, you
allow a (probably) transient load spike to turn into a permanent overload
condition.

It looks like you need a connection pooler (or more than one) in front of
your database.

Cheers,

Jeff

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #9342: CPU / Memory Run-away

BUG #9342: CPU / Memory Run-away

Re: BUG #9342: CPU / Memory Run-away

Re: BUG #9342: CPU / Memory Run-away

Re: BUG #9342: CPU / Memory Run-away

Re: BUG #9342: CPU / Memory Run-away