Problems with pg_locks explosion

Поиск

Список

Период

Сортировка

От	Armand du Plessis
Тема	Problems with pg_locks explosion
Дата	2 апреля 2013 г. 01:36:02
Msg-id	CANf99sX6dK9cG9PDhXARq70QybO3Tn7EkRwt_YPxp8Lo-AFhmQ@mail.gmail.com обсуждение исходный текст
Ответы	Fwd: Problems with pg_locks explosion Re: Problems with pg_locks explosion Re: Problems with pg_locks explosion
Список	pgsql-performance

Дерево обсуждения

[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem.

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8) and switched to a new instance with a RAID-0 volume array. The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200.

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres.

The instance still idles at this point. The only clue I could find was that it usually starts a few minutes after the checkpoint entries appear in my logs.

Any suggestions would really be appreciated. It's killing our business at the moment. I can supply more info if required but pasted what I thought would be useful below. Not sure what else to change in the settings.

Kind regards,

Armand

It's on Amazon EC2 -

* cc2.8xlarge instance type

* 6 volumes in RAID-0 configuration. (1000 PIOPS)

60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No**

API name: cc2.8xlarge

postgresql.conf

fsync = off

full_page_writes = off

default_statistics_target = 100

maintenance_work_mem = 1GB

checkpoint_completion_target = 0.9

effective_cache_size = 48GB

work_mem = 64MB

wal_buffers = -1

checkpoint_segments = 128

shared_buffers = 32GB

max_connections = 800

effective_io_concurrency = 3 # Down from 6

# - Background Writer -

#bgwriter_delay = 200ms # 10-10000ms between rounds

#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round

#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round

$ free

total used free shared buffers cached

Mem: 61368192 60988180 380012 0 784 44167172

-/+ buffers/cache: 16820224 44547968

Swap: 0 0 0

$ top -c

top - 21:55:51 up 12 days, 12:41, 4 users, load average: 6.03, 16.10, 24.15

top - 21:55:54 up 12 days, 12:41, 4 users, load average: 6.03, 15.94, 24.06

Tasks: 837 total, 6 running, 831 sleeping, 0 stopped, 0 zombie

Cpu(s): 15.7%us, 1.7%sy, 0.0%ni, 81.6%id, 0.3%wa, 0.0%hi, 0.6%si, 0.0%st

Mem: 61368192k total, 54820988k used, 6547204k free, 9032k buffer

[ec2-user@ip-10-155-231-112 ~]$ sudo iostat

Linux 3.2.39-6.88.amzn1.x86_64 () 04/01/2013 _x86_64_ (32 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle

21.00 0.00 1.10 0.26 0.00 77.63

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

xvda 0.21 5.00 2.22 5411830 2401368

xvdk 98.32 1774.67 969.86 1919359965 1048932113

xvdj 98.28 1773.68 969.14 1918288697 1048156776

xvdi 98.29 1773.69 969.61 1918300250 1048662470

xvdh 98.24 1773.92 967.54 1918544618 1046419936

xvdg 98.27 1774.15 968.85 1918790636 1047842846

xvdf 98.32 1775.56 968.69 1920316435 1047668172

md127 733.85 10645.68 5813.70 11513598393 6287682313

What bugs me on this is the throughput percentage on the volumes in Cloudwatch is 100% on all volumes.

The problems seem to overlap with checkpoints.

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""

2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""

2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""

В списке pgsql-performance по дате отправления:

Предыдущее

От: Merlin Moncure
Дата: 01 апреля 2013 г., 16:42:51
Сообщение: Re: Question about postmaster's CPU usage

Следующее

От: Vasilis Ventirozos
Дата: 02 апреля 2013 г., 01:56:49
Сообщение: Fwd: Problems with pg_locks explosion

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Problems with pg_locks explosion

Предыдущее

Следующее