Обсуждение: Replication lag from transaction logs

Поиск
Список
Период
Сортировка

Replication lag from transaction logs

От
Debraj Manna
Дата:
Is it possible to figure out the replication lag from transaction log files in the slave and master without querying postgres on the slave? Basically figuring out the replication lag in a slave when the postgres is down in that slave.

Re: Replication lag from transaction logs

От
Fabio Pardi
Дата:

That I know, if standby is down, you cannot know exactly the last applied transaction log file.

But you could get a gross estimate if you check how many WAL files is the standby behind, just listing the WAL files on the master and on the standby.

Given that you can login on the standby machine.

regards,

fabio pardi


On 18/06/18 14:43, Debraj Manna wrote:
Is it possible to figure out the replication lag from transaction log files in the slave and master without querying postgres on the slave? Basically figuring out the replication lag in a slave when the postgres is down in that slave.

Re: Replication lag from transaction logs

От
Keith
Дата:
You can use pg_controldata on any data directory to get its rough position in WAL replay, even when the database isn't running


On Mon, Jun 18, 2018 at 8:56 AM, Fabio Pardi <f.pardi@portavita.eu> wrote:

That I know, if standby is down, you cannot know exactly the last applied transaction log file.

But you could get a gross estimate if you check how many WAL files is the standby behind, just listing the WAL files on the master and on the standby.

Given that you can login on the standby machine.

regards,

fabio pardi


On 18/06/18 14:43, Debraj Manna wrote:
Is it possible to figure out the replication lag from transaction log files in the slave and master without querying postgres on the slave? Basically figuring out the replication lag in a slave when the postgres is down in that slave.


Re: Replication lag from transaction logs

От
Debraj Manna
Дата:
Thanks Keith I am looking something like the solution you suggested.

Once I run pg_controldata on the slave data directory it will be give me the latest checkpoint like below. Can you let me know a bit more detail on how can I get the approximate lag in the slave in bytes from this?

Latest checkpoint location: 2F9/B38DE758




On Mon 18 Jun, 2018, 6:53 PM Keith, <keith@keithf4.com> wrote:
You can use pg_controldata on any data directory to get its rough position in WAL replay, even when the database isn't running


On Mon, Jun 18, 2018 at 8:56 AM, Fabio Pardi <f.pardi@portavita.eu> wrote:

That I know, if standby is down, you cannot know exactly the last applied transaction log file.

But you could get a gross estimate if you check how many WAL files is the standby behind, just listing the WAL files on the master and on the standby.

Given that you can login on the standby machine.

regards,

fabio pardi


On 18/06/18 14:43, Debraj Manna wrote:
Is it possible to figure out the replication lag from transaction log files in the slave and master without querying postgres on the slave? Basically figuring out the replication lag in a slave when the postgres is down in that slave.


Re: Replication lag from transaction logs

От
Keith
Дата:
The portion before the slash is the actual WAL file itself that it got to start replaying, usually the last few characters of the WAL file name. The portion after the slash is the position in the WAL file itself. Default size of each WAL is 16MB, so if that first portion doesn't match the primary, you can count back that many WAL files x16MB. If it's on the same WAL file, I'm not really sure how to get a more precise byte lag than that.

Keith


On Mon, Jun 18, 2018 at 9:59 AM, Debraj Manna <subharaj.manna@gmail.com> wrote:
Thanks Keith I am looking something like the solution you suggested.

Once I run pg_controldata on the slave data directory it will be give me the latest checkpoint like below. Can you let me know a bit more detail on how can I get the approximate lag in the slave in bytes from this?

Latest checkpoint location: 2F9/B38DE758




On Mon 18 Jun, 2018, 6:53 PM Keith, <keith@keithf4.com> wrote:
You can use pg_controldata on any data directory to get its rough position in WAL replay, even when the database isn't running


On Mon, Jun 18, 2018 at 8:56 AM, Fabio Pardi <f.pardi@portavita.eu> wrote:

That I know, if standby is down, you cannot know exactly the last applied transaction log file.

But you could get a gross estimate if you check how many WAL files is the standby behind, just listing the WAL files on the master and on the standby.

Given that you can login on the standby machine.

regards,

fabio pardi


On 18/06/18 14:43, Debraj Manna wrote:
Is it possible to figure out the replication lag from transaction log files in the slave and master without querying postgres on the slave? Basically figuring out the replication lag in a slave when the postgres is down in that slave.



Re: Replication lag from transaction logs

От
Debraj Manna
Дата:
Thanks Keith this is useful.

One more query if I need to know that if a have fallen too far behind and the WAL is not available. I guess I can do this. Let me know if my understanding is correct.
  • Run pg_controldata <DATA_DIR> on the slave node which has been down for long.
  • It will output the details about the WAL along with the WAL file name from where it will start the replication. (Field to look for in the output– 'Latest checkpoint's REDO WAL file')
  • Then check if the file mentioned in ` 'Latest checkpoint's REDO WAL file' is present in master `pg_wal` directory. If not then slave have fallen too far behind and will not be able to recover from WAL.  

On Mon, Jun 18, 2018 at 7:39 PM, Keith <keith@keithf4.com> wrote:
The portion before the slash is the actual WAL file itself that it got to start replaying, usually the last few characters of the WAL file name. The portion after the slash is the position in the WAL file itself. Default size of each WAL is 16MB, so if that first portion doesn't match the primary, you can count back that many WAL files x16MB. If it's on the same WAL file, I'm not really sure how to get a more precise byte lag than that.

Keith


On Mon, Jun 18, 2018 at 9:59 AM, Debraj Manna <subharaj.manna@gmail.com> wrote:
Thanks Keith I am looking something like the solution you suggested.

Once I run pg_controldata on the slave data directory it will be give me the latest checkpoint like below. Can you let me know a bit more detail on how can I get the approximate lag in the slave in bytes from this?

Latest checkpoint location: 2F9/B38DE758




On Mon 18 Jun, 2018, 6:53 PM Keith, <keith@keithf4.com> wrote:
You can use pg_controldata on any data directory to get its rough position in WAL replay, even when the database isn't running


On Mon, Jun 18, 2018 at 8:56 AM, Fabio Pardi <f.pardi@portavita.eu> wrote:

That I know, if standby is down, you cannot know exactly the last applied transaction log file.

But you could get a gross estimate if you check how many WAL files is the standby behind, just listing the WAL files on the master and on the standby.

Given that you can login on the standby machine.

regards,

fabio pardi


On 18/06/18 14:43, Debraj Manna wrote:
Is it possible to figure out the replication lag from transaction log files in the slave and master without querying postgres on the slave? Basically figuring out the replication lag in a slave when the postgres is down in that slave.




Re: Replication lag from transaction logs

От
Scott Ribe
Дата:
> On Jun 18, 2018, at 9:56 AM, Debraj Manna <subharaj.manna@gmail.com> wrote:
>
> Thanks Keith this is useful.
>
> One more query if I need to know that if a have fallen too far behind and the WAL is not available. I guess I can do
this.Let me know if my understanding is correct. 
>     • Run pg_controldata <DATA_DIR> on the slave node which has been down for long.
>     • It will output the details about the WAL along with the WAL file name from where it will start the replication.
(Fieldto look for in the output– 'Latest checkpoint's REDO WAL file') 
>     • Then check if the file mentioned in ` 'Latest checkpoint's REDO WAL file' is present in master `pg_wal`
directory.If not then slave have fallen too far behind and will not be able to recover from WAL.   

You could also try bringing the slave back up, and monitoring the log for the error about needed WAL file not being
available--thisavoids the race condition between checking that all WAL is available and restarting the slave. 

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/




Re: Replication lag from transaction logs

От
Keith
Дата:
Yes, in general using pg_basebackup is not ideal to monitor for replica lag. I was just providing a means to do so from an offline instance. For normal monitoring, if you want to monitor byte lag, you can query that from the primary system.  Or you can also check for lag from the replica as well.


Keith

On Mon, Jun 18, 2018 at 1:03 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:

> On Jun 18, 2018, at 9:56 AM, Debraj Manna <subharaj.manna@gmail.com> wrote:
>
> Thanks Keith this is useful.
>
> One more query if I need to know that if a have fallen too far behind and the WAL is not available. I guess I can do this. Let me know if my understanding is correct.
>       • Run pg_controldata <DATA_DIR> on the slave node which has been down for long.
>       • It will output the details about the WAL along with the WAL file name from where it will start the replication. (Field to look for in the output– 'Latest checkpoint's REDO WAL file')
>       • Then check if the file mentioned in ` 'Latest checkpoint's REDO WAL file' is present in master `pg_wal` directory. If not then slave have fallen too far behind and will not be able to recover from WAL. 

You could also try bringing the slave back up, and monitoring the log for the error about needed WAL file not being available--this avoids the race condition between checking that all WAL is available and restarting the slave.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/




Re: Replication lag from transaction logs

От
Keith
Дата:
Sorry, I mean pg_controldata, not pg_basebackup.

On Mon, Jun 18, 2018 at 1:57 PM, Keith <keith@keithf4.com> wrote:
Yes, in general using pg_basebackup is not ideal to monitor for replica lag. I was just providing a means to do so from an offline instance. For normal monitoring, if you want to monitor byte lag, you can query that from the primary system.  Or you can also check for lag from the replica as well.


Keith

On Mon, Jun 18, 2018 at 1:03 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:

> On Jun 18, 2018, at 9:56 AM, Debraj Manna <subharaj.manna@gmail.com> wrote:
>
> Thanks Keith this is useful.
>
> One more query if I need to know that if a have fallen too far behind and the WAL is not available. I guess I can do this. Let me know if my understanding is correct.
>       • Run pg_controldata <DATA_DIR> on the slave node which has been down for long.
>       • It will output the details about the WAL along with the WAL file name from where it will start the replication. (Field to look for in the output– 'Latest checkpoint's REDO WAL file')
>       • Then check if the file mentioned in ` 'Latest checkpoint's REDO WAL file' is present in master `pg_wal` directory. If not then slave have fallen too far behind and will not be able to recover from WAL. 

You could also try bringing the slave back up, and monitoring the log for the error about needed WAL file not being available--this avoids the race condition between checking that all WAL is available and restarting the slave.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/