Обсуждение: Weird spikes in delay for async streaming replication on 9.1

Поиск
Список
Период
Сортировка

Weird spikes in delay for async streaming replication on 9.1

От
"David F. Skoll"
Дата:
Hi,

I have a two-database cluster.  The machines are geographically
separated and the nature of my application is that many read-only
queries can tolerate being "behind the times" by a few seconds.  So
machines near the hot-standby connect to the hot-standby for these
delay-tolerant queries in order to reduce traffic over the relatively
slow link between geographical locations.

I have a monitoring script that tests the actual delay for a
transaction on the master to appear on the hot-standby.  Every few
minutes, my script runs an update on the master and then sits in a
loop checking how long it takes to appear on the hot-standby.  99% of
the time, it's less than a second.

But every once in a while, the time spikes dramatically, to hundreds
or thousands of seconds, and that's too long... the delay-tolerant
queries are not *that* delay-tolerant, so we switch to sending them
all to the master.

See the graph: http://ibin.co/1rdm4ekiWmpM

I've tried to figure out what causes this, and the only events I can
find that correlate are a pg_dump on the master and possibly some
autovacuum jobs kicking off.  So my questions:

1) Can a long-running transaction on the master block subsequent
transactions from being consumed on the hot-standby, or am I totally
out to lunch?

2) If (1) is correct, is it still true in 9.4?

3) If (1) is false, does anyone have plausibly explanations for what
I'm seeing?  I don't think it's the link between the sites, because we
also monitor that and it seems to be fine.

Regards,

David.



Re: Weird spikes in delay for async streaming replication on 9.1

От
"David F. Skoll"
Дата:
Hi,

I posted this a couple of weeks ago and no response... I guess it's
quite mysterious.

Anyway, behavior is consistent.  It seems that long-running
transactions on the primary (even read-only ones, because in my case
it's a pg_dump) can block subsequent transactions from appearing on
the hot-standby until the first transaction finishes.  Is this a known
limitation?  I run 9.1; is it still present in 9.4?

Regards,

David.

========================================================================
[...]

> I have a monitoring script that tests the actual delay for a
> transaction on the master to appear on the hot-standby.  Every few
> minutes, my script runs an update on the master and then sits in a
> loop checking how long it takes to appear on the hot-standby.  99% of
> the time, it's less than a second.

> But every once in a while, the time spikes dramatically, to hundreds
> or thousands of seconds, and that's too long... the delay-tolerant
> queries are not *that* delay-tolerant, so we switch to sending them
> all to the master.

> See the graph: http://ibin.co/1rdm4ekiWmpM

[...]


Re: Weird spikes in delay for async streaming replication on 9.1

От
John Scalia
Дата:
On 2/26/2015 9:44 AM, David F. Skoll wrote:
> Hi,
>
> I posted this a couple of weeks ago and no response... I guess it's
> quite mysterious.
>
> Anyway, behavior is consistent.  It seems that long-running
> transactions on the primary (even read-only ones, because in my case
> it's a pg_dump) can block subsequent transactions from appearing on
> the hot-standby until the first transaction finishes.  Is this a known
> limitation?  I run 9.1; is it still present in 9.4?
>
> Regards,
>
> David.
>
> ========================================================================
> [...]
>
>> I have a monitoring script that tests the actual delay for a
>> transaction on the master to appear on the hot-standby.  Every few
>> minutes, my script runs an update on the master and then sits in a
>> loop checking how long it takes to appear on the hot-standby.  99% of
>> the time, it's less than a second.
>> But every once in a while, the time spikes dramatically, to hundreds
>> or thousands of seconds, and that's too long... the delay-tolerant
>> queries are not *that* delay-tolerant, so we switch to sending them
>> all to the master.
>> See the graph: http://ibin.co/1rdm4ekiWmpM
> [...]

Apologies, David, but as I don't check this list very often, I thought you should have received an answer. Have you
triedwatching the transaction with tcpdump or wireshark? To me  
this sounds like either a network problem or like you're thinking a contention problem. Also, are you running streaming
replication?If so, there shouldn't be ANY delay as the  
standby gets the commit before the primary. Otherwise, how are you replicating?
--
Jay


Re: Weird spikes in delay for async streaming replication on 9.1

От
"David F. Skoll"
Дата:
On Thu, 26 Feb 2015 10:40:31 -0500
John Scalia <jayknowsunix@gmail.com> wrote:

> Apologies, David, but as I don't check this list very often, I
> thought you should have received an answer. Have you tried watching
> the transaction with tcpdump or wireshark?

Yes.  Haven't observed anything unusual.

> To me this sounds like either a network problem or like you're
> thinking a contention problem.

It is not a network problem; we monitor the link between sites.  And it
happens absolutely regularly, like clockwork, when we run our nightly
pg_dump on the master.

> Also, are you running streaming replication? If so, there
> shouldn't be ANY delay as the standby gets the commit before the
> primary. Otherwise, how are you replicating? -- Jay

We are running asynchronous streaming replication.

Regards,

David.


Re: Weird spikes in delay for async streaming replication on 9.1

От
John Scalia
Дата:
On 2/26/2015 10:44 AM, David F. Skoll wrote:
> On Thu, 26 Feb 2015 10:40:31 -0500
> John Scalia <jayknowsunix@gmail.com> wrote:
>
>> Apologies, David, but as I don't check this list very often, I
>> thought you should have received an answer. Have you tried watching
>> the transaction with tcpdump or wireshark?
> Yes.  Haven't observed anything unusual.
>
>> To me this sounds like either a network problem or like you're
>> thinking a contention problem.
> It is not a network problem; we monitor the link between sites.  And it
> happens absolutely regularly, like clockwork, when we run our nightly
> pg_dump on the master.
>
>> Also, are you running streaming replication? If so, there
>> shouldn't be ANY delay as the standby gets the commit before the
>> primary. Otherwise, how are you replicating? -- Jay
> We are running asynchronous streaming replication.
>
OK, if it's asynchronous, is your script checking that primary isn't holding up closing out and transmitting the latest
WALsegment during these times? And if the Standby really  
needs to be up to date, why not try synchronous replication?


Re: Weird spikes in delay for async streaming replication on 9.1

От
"David F. Skoll"
Дата:
On Thu, 26 Feb 2015 10:51:55 -0500
John Scalia <jayknowsunix@gmail.com> wrote:

> OK, if it's asynchronous, is your script checking that primary isn't
> holding up closing out and transmitting the latest WAL segment during
> these times?

No.  The checking script is dead simple.  It's a perl script that does
more-or-less the equivalent of this.  $master_dbi is a DBI handle
pointing to the master database server and $standby_dbi is one pointing
to the standby.  Pseudocode is shown below (the real code actually measures
the duration in much finer increments than 1s, but you get the idea...)

   my $time = time();

   # There's one row in the table with key column 'Timestamp'
   $master_dbi->do("UPDATE table SET value = ? WHERE key = 'Timestamp'", undef, $time);

   my $start = time();
   while(..) {
       $row = $standby_dbi->fetchrow_arrayref("SELECT * FROM table WHERE key = 'Timestamp'");
       # Check that we got back the same $time we put in and if so,
       # break out of the while loop; otherwise pause for a bit.
   }
   # See how long that took
   my $duration = time() - $start;

> And if the Standby really needs to be up to date, why
> not try synchronous replication?

It doesn't need to be totally up-to-date.  It can lag by up to 30s without
impacting our application.  And we don't use synchronous replication because
then a network problem would stall all the write transactions on the primary
and that would be fatal.

Regards,

David.


Re: Weird spikes in delay for async streaming replication on 9.1

От
John Scalia
Дата:
On 2/26/2015 11:00 AM, David F. Skoll wrote:
> On Thu, 26 Feb 2015 10:51:55 -0500
> John Scalia <jayknowsunix@gmail.com> wrote:
>
>> OK, if it's asynchronous, is your script checking that primary isn't
>> holding up closing out and transmitting the latest WAL segment during
>> these times?
> No.  The checking script is dead simple.  It's a perl script that does
> more-or-less the equivalent of this.  $master_dbi is a DBI handle
> pointing to the master database server and $standby_dbi is one pointing
> to the standby.  Pseudocode is shown below (the real code actually measures
> the duration in much finer increments than 1s, but you get the idea...)
>
>     my $time = time();
>
>     # There's one row in the table with key column 'Timestamp'
>     $master_dbi->do("UPDATE table SET value = ? WHERE key = 'Timestamp'", undef, $time);
>
>     my $start = time();
>     while(..) {
>         $row = $standby_dbi->fetchrow_arrayref("SELECT * FROM table WHERE key = 'Timestamp'");
>         # Check that we got back the same $time we put in and if so,
>         # break out of the while loop; otherwise pause for a bit.
>     }
>     # See how long that took
>     my $duration = time() - $start;
>
>> And if the Standby really needs to be up to date, why
>> not try synchronous replication?
> It doesn't need to be totally up-to-date.  It can lag by up to 30s without
> impacting our application.  And we don't use synchronous replication because
> then a network problem would stall all the write transactions on the primary
> and that would be fatal.
Possibly, but that's why WE use a pair of standby servers, not a single one, so that all transactions get committed in
atimely manner. The odds of both standbys failing at the  
same time are really small. Maybe your script should check which is latest WAL segment on each system first? That might
showthat you have a timedelay with getting the info to the  
standby.


Re: Weird spikes in delay for async streaming replication on 9.1

От
"David F. Skoll"
Дата:
On Thu, 26 Feb 2015 11:06:50 -0500
John Scalia <jayknowsunix@gmail.com> wrote:

> Possibly, but that's why WE use a pair of standby servers, not a
> single one, so that all transactions get committed in a timely
> manner. The odds of both standbys failing at the same time are really
> small. Maybe your script should check which is latest WAL segment on
> each system first? That might show that you have a timedelay with
> getting the info to the standby.

That doesn't seem to be a problem.  When I run this:

     watch -n 0.1 'ps auxww|grep [w]al.*streaming'

on both primary and standby, the ID number after "streaming" increments
on both smoothly; there's no significant pause even when the delay starts
growing.

Regards,

David.