Logical replication restart position (restart_lsn)
От | PG Doc comments form |
---|---|
Тема | Logical replication restart position (restart_lsn) |
Дата | |
Msg-id | 161659021377.695.8314130669179661646@wrigleys.postgresql.org обсуждение исходный текст |
Список | pgsql-docs |
The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/10/warm-standby.html Description: Hi, I was struggling to understand why my logical replication clients (using the wal2json plugin) are not advancing the restart_lsn as much as I expected. E.g. I regularly see restart_lsn becoming 1GB or more behind confirmed_flush_lsn, but it always catches up eventually (within an hour). The documentation doesn't seem to explain why this can happen, only that "restart_lsn [is] The address (LSN) of oldest WAL which still might be required by the consumer of this slot and thus won't be automatically removed during checkpoints..." <https://www.postgresql.org/docs/13/view-pg-replication-slots.html>. We are advised to monitor logical replication in a similar way to physical replication (e.g. using pg_replication_slots) <https://www.postgresql.org/docs/13/logical-replication-monitoring.html>, presumably to avoid infinite WAL buildup and exhausting disk space on the master. (And similarly, "An important health indicator of streaming replication is the amount of WAL records generated in the primary, but not yet applied in the standby.") However, no advice about monitoring restart_lsn is given, or reasonable levels of restart lag. I found a discussion on pgsql-hackers <https://www.postgresql-archive.org/Movement-of-restart-lsn-position-movement-of-logical-replication-slots-is-very-slow-td6167219.html> where participants said: "In our observation via PSQL the advance command as well do not move the restart_lsn immediately." "The restart_lsn is lsn required by the oldest txn." "even though we update confirmed_flush_lsn every time with the new value but restart_lsn is updated only when candidate_restart_valid has a valid value each time after a call to LogicalConfirmReceivedLocation. We update candidate_restart_valid in LogicalIncreaseRestartDecodingForSlot which is called only during decoding of XLOG_RUNNING_XACTS record." The comment "The restart_lsn is lsn required by the oldest txn" doesn't seem to be exactly true, as I have seen the restart_lsn pointing to snapshots with running transactions, such as this one: rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: E3A/912B8030, prev E3A/912B7F50, desc: RUNNING_XACTS nextXid 487077663 latestCompletedXid 487 077661 oldestRunningXid 487077662; 1 xacts: 487077662 However, I also found that "[a snapshot] has room for at most 64 non-aborted subtransactions per session. If there are more than 64 such subtransactions, the snapshot is marked as suboverflowed. A suboverflowed snapshot does not contain all data required to determine visibility, so PostgreSQL will occasionally have to resort to pg_subtrans." <https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/> (Which is not possible after a server crash, since pg_subtrans is discarded, hence a suboverflowed transaction cannot be a valid restart point?) Therefore, I surmise that restart_lsn must point to a RUNNING_XACTS WAL record (a snapshot?) which is not suboverflowed (does not have the "subxid ovf" flag), and thus the restart_lsn will point to the most recent such record before the client's confirmed_flush_lsn. From observation that also appears to be the case. If this is correct, I think it would be great if it was in the official documentation. Perhaps something like this on the pg_replication_slots page? <https://www.postgresql.org/docs/13/view-pg-replication-slots.html> "The address (LSN) of oldest WAL which still might be required by the consumer of this slot and thus won't be automatically removed during checkpoints. Postgres always restart replication from the most recent snapshot WAL record which is not suboverflowed, before the confirmed_flush_lsn, which could be significantly behind if there are many subtransactions running. Clients can request (on connection) that decoding start from a later LSN (see https://www.postgresql.org/docs/10/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS)." Or perhaps this information could be added under "48.2.2. Replication Slots", with just a link from the restart_lsn entry? Thanks, Chris.
В списке pgsql-docs по дате отправления:
Предыдущее
От: Daniel GustafssonДата:
Сообщение: Update to reflect that TLS1 and TLSv1.1 are now deprecated
Следующее
От: "Jonathan S. Katz"Дата:
Сообщение: Re: Update to reflect that TLS1 and TLSv1.1 are now deprecated