Re: Logical decoding from promoted standby with same replication slot

Поиск
Список
Период
Сортировка
От Jeremy Finzel
Тема Re: Logical decoding from promoted standby with same replication slot
Дата
Msg-id CAMa1XUgJBo5qaP5BhAobwqutx9NWX2VAc56w_mdZOmMWgPE38Q@mail.gmail.com
обсуждение исходный текст
Ответ на Logical decoding from promoted standby with same replication slot  (Jeremy Finzel <finzelj@gmail.com>)
Список pgsql-hackers
On Fri, Jul 13, 2018 at 2:30 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
Hello -

We are working on several DR scenarios with logical decoding.  Although we are using pglogical the question we have I think is generally applicable to logical replication.

Say we have need to drop a logical replication slot for some emergency reason on the master, but we don't want to lose the data permanently.  We can make a point-in-time-recovery snapshot of the master to use in order to recover the lost data in the slot we are about to drop.  Then we drop the slot on master.

We can then point our logical subscription to pull from the snapshot to get the lost data, once we promote it.

The question is that after promotion, logical decoding is looking for a timeline 2 file whereas the file is still at timeline 1.

The WAL file is 00000001000008FD0000003C, for example.  After promotion, it is still 00000001000008FD0000003C in pg_wal.  But logical decoding says ERROR: segment 00000002000008FD0000003C has already been removed (it is looking for a timeline 2 WAL file).  Simply renaming the file actually allows us to stream from the replication slot accurately and recover the data.

But all of this begs the question of an easier way to do this - why doesn't logical decoding know to look for a timeline 1 file?  It is really helpful to have this ability to easily recover logical replicated data from a snapshot of a replication slot, in case of disaster.

All thoughts very welcome!

Thanks,
Jeremy

I'd like to bump this question with some elaboration on my original question: is it possible to do a *controlled* failover reliably with logical decoding, assuming there are unconsumed changes in the replication slot that client still needs?

It is rather easy to do a controlled failover if we can verify there are no unconsumed changes in the slot before failover.  Then, we just recreate the slot on the promoted standby while clients are locked out, and we have not missed any data changes.

I am trying to figure out if the problem of following timelines, as per this wiki for example: https://wiki.postgresql.org/wiki/Failover_slots, can be worked around in a controlled scenario.  One additional part of this is that after failover I have 2 WAL files with the same walfile name but on differing timelines, and the promoted standby is only going to decode from the latter.  Does that mean I am likely to lose data?

Part of the reason I ask is because in testing, I have NOT lost data in doing a controlled failover as described above (i.e. with unconsumed changes in the slot that I need to replay on promoted standby).  I am trying to figure out if I've gotten lucky or if this method is actually reliable.  That is, renaming the WAL files to bump the timeline, since these WAL files are simply identical to the ones that were played on the master, and thus ought to show the same logical decoding information to be consumed.


Thank you!
Jeremy

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Refactor documentation for wait events (Was: pgsql: Add waitevent for fsync of WAL segments)
Следующее
От: Robert Haas
Дата:
Сообщение: Re: New GUC to sample log queries