Spurious Stalls

Поиск
Список
Период
Сортировка
От Christopher Nielsen
Тема Spurious Stalls
Дата
Msg-id CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com
обсуждение исходный текст
Ответы Re: Spurious Stalls
Re: Spurious Stalls
Re: Spurious Stalls
Список pgsql-general

Hi Group,

My team has been very happy using Postgres, hosting Bitbucket.  Thanks very much for all the community contributions, to the platform.

Lately, though, about once a day now, for about a week, we have been experiencing periods of stalling.  When Postgres stalls, we haven't been able to recover, without restarting the database, unfortunately.

This brings our uptime down some, that we'd like to avoid (99.2%) :(  We'd like to do a better job keeping things running.

It would be great to get your input about it.  Alternately, if someone is available, as a consultant, that would be great too.

Here is some background, about the issue.  We have found the following symptoms.
  • During this performance issue, we found the following symptoms.
  • Running queries do not return.
  • The application sometimes can no longer get new connections.
  • The CPU load increases
  • There is no I/O wait.
  • There is no swapping.
Also, our database configuration, is attached to this email, as postgresql.conf, for reference, along with a profile of our hardware and tuning, as pg_db_profile.txt.

While the database was unavailable, we also collected a lot of data.  Looking through this info, a few things pop-out to us, that may be problematic, or useful to notice.
  • Disk I/O appears to be all write, and little read.
  • In previous incidents, with the same symptoms, we have seen pg processes spending much time in s_lock
  • That info is attached to this email also, as files named perf_*.
Additionally, monitoring graphs show the following performance profile.

Problem

As you can probably see below, at 11:54, the DB stops returning rows.

Also, transactions stop returning, causing the active transaction time to trend up to the sky.

Consequences of Problem

Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer connect.

The cpu utilization increases to nearly 100%, in user space, and stays there, until the database is restarted.

Events Before Problem

This is likely the most useful part.  As the time approaches 11:54, there are periods of increased latency.  There is also a marked increase in write operations, in general.
Lastly, about 10 minutes before outage, postgres writes a sustained 30 MB/s of temp files.


After investigating this, we found a query that was greatly exceeding work_mem.  We've since optimized it, and hopefully, that will have a positive effect on the above.

We may not know until the next issue happens, though.

With a problem like this, I am not exactly positive how to proceed.  I am really looking forward to hearing your thoughts, and opinions, if you can share them.

Thanks very much,

-Chris

Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: "Vasudevan, Ramya"
Дата:
Сообщение: Re: max_connections reached in postgres 9.3.3
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: max_connections reached in postgres 9.3.3