Re: very high replay_lag on 3-node cluster

Поиск

Список

Период

Сортировка

От	Tiemen Ruiten
Тема	Re: very high replay_lag on 3-node cluster
Дата	22 июля 2019 г. 13:58:47
Msg-id	CAEkBuzeDHX4PSb4SirzHhWhjsPeCuWdbOByvR=5UrK5Rx0ZxYA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: very high replay_lag on 3-node cluster ("Jehan-Guillaume (ioguix) de Rorthais" <ioguix@free.fr>)
Ответы	Re: very high replay_lag on 3-node cluster ("Jehan-Guillaume (ioguix) de Rorthais" <ioguix@free.fr>)
Список	pgsql-general

Дерево обсуждения

On Mon, Jul 22, 2019 at 11:28 AM Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr> wrote:

Hi,

On Mon, 22 Jul 2019 11:05:57 +0200
Tiemen Ruiten <t.ruiten@tech-lab.io> wrote:
[...]
> > Now to my current issue: I took the advice to add more monitoring on
> > replay lag (using pg_last_xact_replay_timestamp) and things are not looking
> > good. Last night replication lagged by almost 6 hours on one of the
> > nodes[3], but eventually caught up. As you can see in that screenshot,
> > ph-sql-03 is consistently slower to replay than ph-sql-05 (ph-sql-04 is the
> > current master) and there happen to be different SSD's in ph-sql-03
> > (Crucial MX300 vs Crucial MX500 in the other two), which makes me think
> > this is IO related.

Such a difference is quite surprising. Moreover, I suppose you have some
caching in front of disks (either RAID or SAN?). Could you describe your disk
stack with more details?

Do you have any detailed metrics about disks and network IO to share?

The network is the same for both nodes?

Hi!

All nodes have only local storage. The filesystem is ZFS, which has its own builtin caching (ARC/L2ARC & ZIL). There are no hardware caching devices present. The disks are connected using a Supermicro AOC-S3008L-L8E HBA (shown as LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) with lspci). This is the layout of the pool:

[root@ph-sql-04 ter]# zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0B in 0 days 00:07:51 with 0 errors on Tue Jul 16 03:17:52 2019
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0

I've attached a graph of network IO on all servers. The network config is identical for all three nodes: 2x bonded gigabit connection to the same stacked switch pair.

Currently I don't have much metrics on raw disk IO, I can add some monitoring items and report those tomorrow if that would be useful. I do have a lot of metrics available related to PostgreSQL (buffers written/read, checkpoints, rows fetched/updated/inserted etc.)

Вложения

network-pgdb2.PNG

В списке pgsql-general по дате отправления:

Предыдущее

От: "Jehan-Guillaume (ioguix) de Rorthais"
Дата: 22 июля 2019 г., 12:27:54
Сообщение: Re: very high replay_lag on 3-node cluster

Следующее

От: Rory Campbell-Lange
Дата: 22 июля 2019 г., 16:06:04
Сообщение: Re: How to run a task continuously in the background

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: very high replay_lag on 3-node cluster

Вложения

Предыдущее

Следующее