On Mon, 10 Feb 2020 at 01:29, Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Hi,
>
> While jumping around partically decoded xacts questions [1], I've read
> through the copy replication slots code (9f06d79ef) and found a couple
> of issues.
>
> 1) It seems quite reckless to me to dive into
> DecodingContextFindStartpoint without actual WAL reservation (donors
> slot restart_lsn is used, but it is not acquired). Why risking erroring
> out with WAL removal error if the function advances new slot position to
> updated donors one in the end anyway?
Good catch. It's possible that DecodingContextFindStartpoint could
fail when the restart_lsn of the source slot is advanced and removed
required WAL.
>
> 2) In the end, restart_lsn of new slot is set to updated donors
> one. However, confirmed_flush field is not updated. This is just wrong
> -- we could start decoding too early and stream partially decoded
> transaction.
I think you are right.
>
> I'd probably avoid doing DecodingContextFindStartpoint at all. Its only
> purpose is to assemble consistent snapshot (and establish corresponding
> <restart_lsn, confirmed_flush_lsn> pair), but donor slot must have
> already done that and we could use it as well. Was this considered?
Skipping doing DecodingContextFindStartpoint while creating a new
destination logical slot seems sensible to me.
I've attached the draft patch fixing this issue but I'll continue
investigating it more deeply.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services