Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

Поиск

Список

Период

Сортировка

От	Merlin Moncure
Тема	Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Дата	13 сентября 2013 г. 20:58:58
Msg-id	CAHyXU0yQF-nthU_pEUDig+jtjOBNg+QegNoBHyes0n_xQ2r3bA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers (Josh Berkus <josh@agliodbs.com>)
Список	pgsql-hackers

Дерево обсуждения

On Fri, Sep 13, 2013 at 3:20 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 09/13/2013 12:55 PM, Merlin Moncure wrote:
>> what are the specific symptoms of your problem?  anything interesting
>> in pg_locks?  is $client willing to experiment with custom patches?
>
> 3 servers: 1 master, two replicas.
> 32-core Xeon, hyperthreaded to 64 cores
> 512GB RAM each
> s_b set to 8GB
> Load-balanced between all 3
> ~~ 11 different databases
> combined database size around 600GB
> using pgbouncer
>
> Irregularly, during periods of high activity (although not necessarily
> peak activity) one or another of the systems will go into paralysis,
> with all backends apparently waiting on LWLocks (we need more tracing
> information to completely confirm this).  Activity at this time is
> usually somewhere between 50 and 100 concurrent queries (and 80 to 150
> connections).  pg_locks doesn't think anything is waiting on a lock.
>
> What's notable is that sometimes it's just *one* of the replicas which
> goes into paralysis.  If the master gets this issue though, the replicas
> experience it soon afterwards.  Increasing wal_buffers from 16GB to 64GB
> seems to make this issue happen less frequently, but it doesn't go away
> entirely.  Only a restart of the server, or killing all backend, ends
> the lockup.
>
> The workload is OLTP, essentially, around 20/80 write/read.  They use
> PostGIS.  The other notable thing about their workload is that due to an
> ORM defect, they get idle-in-transactions which last from 5 to 15
> seconds several times a minute.
>
> They are willing to use experimental patches, but only if those patches
> can be applied only to a replica.

ok, points similar:
*) master/slave config (two slaves for me)
*) 'big' server 256GB mem, 32 core
*) 80% approx. (perhaps more)
*) some spacial searching (but not very much)
*) OLTP
*) presentation of load, although in my case it did resolve anywhere
from 30 secs to half hour
*) aside from the spike, 100% healthy

points different
*) application side pooling: 96 app servers, max 5 connections each
(aside: are you using transaction mode pgbouncer?)
*) I saw gripes about relation extension in pg_locks

merlin

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers