Re: long-standing data loss bug in initial sync of logical replication

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: long-standing data loss bug in initial sync of logical replication
Дата
Msg-id 20231118210519.yborkczu2aln6zdd@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: long-standing data loss bug in initial sync of logical replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: long-standing data loss bug in initial sync of logical replication  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
Hi,

On 2023-11-18 21:45:35 +0100, Tomas Vondra wrote:
> On 11/18/23 19:12, Andres Freund wrote:
> >> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
> >> we're making it conflict with RowExclusive. Which is just DML, and I
> >> think we need to do that.
> > 
> > From what I can tell it needs to to be an AccessExlusiveLock. Completely
> > independent of logical decoding. The way the cache stays coherent is catalog
> > modifications conflicting with anything that builds cache entries. We have a
> > few cases where we do use lower level locks, but for those we have explicit
> > analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
> > could have an old view of the catalog (various CONCURRENTLY) operations.
> > 
> 
> Yeah, I got too focused on the issue I triggered, which seems to be
> fixed by using SRE (still don't understand why ...). But you're probably
> right there may be other cases where SRE would not be sufficient, I
> certainly can't prove it'd be safe.

I think it makes sense here: SRE prevents the problematic "scheduling" in your
test - with SRE no DML started before ALTER PUB ... ADD can commit after.

I'm not sure there are any cases where using SRE instead of AE would cause
problems for logical decoding, but it seems very hard to prove. I'd be very
surprised if just using SRE would not lead to corrupted cache contents in some
situations. The cases where a lower lock level is ok are ones where we just
don't care that the cache is coherent in that moment.

In a way, the logical decoding cache-invalidation situation is a lot more
atomic than the "normal" situation. During normal operation locking is
strictly required to prevent incoherent states when building a cache entry
after a transaction committed, but before the sinval entries have been
queued. But in the logical decoding case that window doesn't exist.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: long-standing data loss bug in initial sync of logical replication
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Use of backup_label not noted in log