RE: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

Поиск
Список
Период
Сортировка
От Floris Van Nee
Тема RE: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
Дата
Msg-id dd4c68beba5c40c5a7946569d450ec59@Optiver.com
обсуждение исходный текст
Ответ на Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure  (Andres Freund <andres@anarazel.de>)
Ответы Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
Список pgsql-bugs
>
> I think the problem might actually be much simpler - when
> pgstat_drop_entry() can't free the entry, we don't call
> pgstat_request_entry_refs_gc(), in contrast to
> pgstat_drop_database_and_contents(), pgstat_drop_all_entries().
> Because of that, entry refs to dropped stats entries can survive an effectively
> unbound amount of time. If pgstat_request_entry_refs_gc() had been
> called, the local entry would have been released, before we reused the slot's
> stats entry.  There's no race around the entry being dropped concurrently
> with acquiring an entry references, as external locking needs to prevent an
> entry from being dropped while new references are being established.

This sounds like a reasonable root cause for this thread.

> I can't explain why pgstat_drop_entry() doesn't call
> pgstat_request_entry_refs_gc(). It seems fairly obvious that it ought to?

It looks like pgstat_drop_entry expects callers to take care of this. All callers in pgstat_xact.c
(AtEOXact_PgStat_DroppedStatsfor example) are dropping stats for several relations at once, and then at the end they
callpgstat_request_entry_refs_gc once if needed. I'm guessing this was an optimization. 

However, pgstat_drop_replslot didn't get that memo. Attached patch adds a call to gc in pgstat_drop_replslot to bring
itin line with usage in pgstat_xact.c. On my setup this fixes the issue that occurs on this thread. 

>
> I think there would be a leftover edge-case, namely that
> pgstat_gc_entry_refs() doesn't gc entries with pending data. But we could
> just do that - and it should be correct, afaict. I wonder if the reason we didn't
> is that we introduced PgStat_KindInfo.delete_pending_cb later?

Can you clarify why this would be safe to do? Looking at commit history, it looks like they were added in the same
commit5891c7a8e which introduced the shared memory stats. I can reason why it *should* be safe, but I can't find a
discussionaround why every single function except pgstat_drop_entry uses discard_pending=false if it would be safe to
passtrue everywhere. It seems strange that it'd have been chosen without reason. 

This does mean it's unfortunately unrelated to the issue I reported here
https://www.postgresql.org/message-id/flat/20240505223546.6yvjzgqifuoiii3e%40awork3.anarazel.de#1dfe3ecf6a75ace833444bdc0d268f4a
,because the issue in the current thread is fixed by changing a replication-slot call (which is not used in the
databasefor which I reported it). 

-Floris


Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Mor Lehr
Дата:
Сообщение: Re: Missing semicolumn in anonymous plpgsql block does not raise syntax error
Следующее
От: Sjors Gielen
Дата:
Сообщение: Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE