Re: [HACKERS] Measuring replay lag

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: [HACKERS] Measuring replay lag
Дата
Msg-id CAHGQGwF-wZfxrukRxta7SXG8uHfq0OB8BSdaxB-mpGbdNUJx+g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Measuring replay lag  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: [HACKERS] Measuring replay lag  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
On Thu, Dec 22, 2016 at 6:14 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Dec 22, 2016 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I agree that the capability to measure the remote_apply lag is very useful.
>> Also I want to measure the remote_write and remote_flush lags, for example,
>> in order to diagnose the cause of replication lag.
>
> Good idea.  I will think about how to make that work.  There was a
> proposal to make writing and flushing independent[1].  I'd like that
> to go in.  Then the write_lag and flush_lag could diverge
> significantly, and it would be nice to be able to see that effect as
> time (though you could already see it with LSN positions).
>
>> For that, what about maintaining the pairs of send-timestamp and LSN in
>> *sender side* instead of receiver side? That is, walsender adds the pairs
>> of send-timestamp and LSN into the buffer every sampling period.
>> Whenever walsender receives the write, flush and apply locations from
>> walreceiver, it calculates the write, flush and apply lags by comparing
>> the received and stored LSN and comparing the current timestamp and
>> stored send-timestamp.
>
> I thought about that too, but I couldn't figure out how to make the
> sampling work.  If the primary is choosing (LSN, time) pairs to store
> in a buffer, and the standby is sending replies at times of its
> choosing (when wal_receiver_status_interval has been exceeded), then
> you can't accurately measure anything.

Yeah, even though the primary stores (100, 2017-01-17 00:00:00) as the pair of
(LSN, timestamp), for example, the standby may not send back the reply for
LSN 100 itself. The primary may receive the reply for larger LSN like 200,
instead. So the measurement of the lag in the primary side would not be so
accurate.

But we can calculate the "sync rep" lag by comparing the stored timestamp of
LSN 100 and the timestamp when the reply for LSN 200 is received. In sync rep,
since the transaction waiting for LSN 100 to be replicated is actually released
after the reply for LSN 200 is received, the above calculated lag is basically
accurate as sync rep lag.

Therefore I'm still thinking that it's better to maintain the pairs of LSN
and timestamp in the *primary* side. Thought?

Regards,

-- 
Fujii Masao



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: [HACKERS] Supporting huge pages on Windows
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] increasing the default WAL segment size