Обсуждение: What causes streaming replication delay?

Поиск
Список
Период
Сортировка

What causes streaming replication delay?

От
Dianne Skoll
Дата:
Hi,

I have a problem that's driving me insane.  I have two PostgreSQL 9.1
installations (yes, I know... very old.)  One is configured as a master
and the other is a streaming-replication hot-standby that supports
read-only queries.

Every now and then, although the standby claims to be streaming transactions
if you look with "ps", it seems to stall and the replication delay grows
by one second for each second of elapsed time.

If I restart the hot-standby, it consumes WAL files and very quickly
catches up.  And then a few minutes later, it happens again.

So, what are the possible causes of this behaviour?  There are no
long transactions on the standby, nor on the master.  And how can
I find out exactly what query is causing the problem, assuming
it's a query?

The relevant settings in the hot-standby postgresql.conf are:

hot_standby = on
max_standby_archive_delay = 10s
max_standby_streaming_delay = 10s
hot_standby_feedback = on

and on the master:

wal_level = hot_standby
archive_mode = on
archive_command = (a command to send WAL files to the standby)
max_wal_senders = 8

and all other parameters are at their defaults.

Regards,

Dianne.


Re: What causes streaming replication delay?

От
Keith
Дата:


On Wed, Aug 31, 2016 at 11:44 AM, Dianne Skoll <dfs@roaringpenguin.com> wrote:
Hi,

I have a problem that's driving me insane.  I have two PostgreSQL 9.1
installations (yes, I know... very old.)  One is configured as a master
and the other is a streaming-replication hot-standby that supports
read-only queries.

Every now and then, although the standby claims to be streaming transactions
if you look with "ps", it seems to stall and the replication delay grows
by one second for each second of elapsed time.

If I restart the hot-standby, it consumes WAL files and very quickly
catches up.  And then a few minutes later, it happens again.

So, what are the possible causes of this behaviour?  There are no
long transactions on the standby, nor on the master.  And how can
I find out exactly what query is causing the problem, assuming
it's a query?

The relevant settings in the hot-standby postgresql.conf are:

hot_standby = on
max_standby_archive_delay = 10s
max_standby_streaming_delay = 10s
hot_standby_feedback = on

and on the master:

wal_level = hot_standby
archive_mode = on
archive_command = (a command to send WAL files to the standby)
max_wal_senders = 8

and all other parameters are at their defaults.

Regards,

Dianne.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


All depends on how you're monitoring your slave lag. Check this blog post and if you're seeing actual byte lag, then you could have cause for concern.

https://www.keithf4.com/monitoring_streaming_slave_lag/

Keith

Re: What causes streaming replication delay?

От
Dianne Skoll
Дата:
On Wed, 31 Aug 2016 12:04:23 -0400
Keith <keith@keithf4.com> wrote:

> All depends on how you're monitoring your slave lag. Check this blog
> post and if you're seeing actual byte lag, then you could have cause
> for concern.

It's not because the master is idle... there is tons of write traffic.

And the way I measure the lag is about as directly as you can: I
periodically insert a timestamp into a specific table on the master,
and then poll the hot-standby until the inserted timestamp has
appeared.

Regards,

Dianne.


Re: What causes streaming replication delay? [SOLVED, I think]

От
Dianne Skoll
Дата:
Hi,

I think I solved it.  This is so ridiculous!

The primary and standby databases are 200km apart and the path between them
has a bandwidth of only 100 Mb/s.  We had bursts of writes that created
huge amounts of replication traffic.  In addition to streaming
replication, we're still copying WAL files over for safety (and also
because this is 9.1...)

We were hitting our bandwidth limit.

By adding the --bwlimit parameter to the rsync command that copies the WAL
files, we can limit the bandwidth usage during the bursts while still
keeping it high enough for the WAL-file archiving to keep up on average.

Keeping an eye on it for 24h before I completely declare it solved.

Regards,

Dianne.


Re: What causes streaming replication delay? [SOLVED, I think]

От
Scott Whitney
Дата:
Also, for the stupid-system-impaired-metric-didn't-put-us-on-the-moon, 200km is like 120 miles.

Or, for the LOCAL network administrator something like 650k feet.

So, we're not talking a direct cable...

;)


-------- Original message --------
From: Dianne Skoll <dfs@roaringpenguin.com>
Date: 08/31/2016 3:24 PM (GMT-06:00)
To: pgsql-admin <pgsql-admin@postgresql.org>
Subject: Re: [ADMIN] What causes streaming replication delay? [SOLVED, I think]

Hi,

I think I solved it.  This is so ridiculous!

The primary and standby databases are 200km apart and the path between them
has a bandwidth of only 100 Mb/s.  We had bursts of writes that created
huge amounts of replication traffic.  In addition to streaming
replication, we're still copying WAL files over for safety (and also
because this is 9.1...)

We were hitting our bandwidth limit.

By adding the --bwlimit parameter to the rsync command that copies the WAL
files, we can limit the bandwidth usage during the bursts while still
keeping it high enough for the WAL-file archiving to keep up on average.

Keeping an eye on it for 24h before I completely declare it solved.

Regards,

Dianne.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Journyx, Inc.
7600 Burnet Road #300
Austin, TX 78757
www.journyx.com

p 512.834.8888 
f 512-834-8858 

Do you receive our promotional emails? You can subscribe or unsubscribe to those emails at http://go.journyx.com/emailPreference/e/4932/714/ 

Re: What causes streaming replication delay? [SOLVED, I think]

От
Eduardo Morras
Дата:
On Wed, 31 Aug 2016 16:22:22 -0400
Dianne Skoll <dfs@roaringpenguin.com> wrote:

> Hi,
>
> I think I solved it.  This is so ridiculous!
>
> The primary and standby databases are 200km apart and the path
> between them has a bandwidth of only 100 Mb/s.  We had bursts of
> writes that created huge amounts of replication traffic.  In addition
> to streaming replication, we're still copying WAL files over for
> safety (and also because this is 9.1...)
>
> We were hitting our bandwidth limit.
>
> By adding the --bwlimit parameter to the rsync command that copies
> the WAL files, we can limit the bandwidth usage during the bursts
> while still keeping it high enough for the WAL-file archiving to keep
> up on average.
>
> Keeping an eye on it for 24h before I completely declare it solved.

You were bitten by the RFC 1925 points 1, 2 and 9. A solution as is point 3 is "a bad idea".

https://tools.ietf.org/html/rfc1925

> Regards,
>
> Dianne.
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin


---   ---
Eduardo Morras <emorrasg@yahoo.es>