Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5
От | Amit Kapila |
---|---|
Тема | Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5 |
Дата | |
Msg-id | CAA4eK1L7CA-A=VMn8fiugZ+CRt+wz473Adrx3nxq8Ougu=O2kQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5 (Amit Kapila <amit.kapila16@gmail.com>) |
Список | pgsql-bugs |
On Thu, May 22, 2025 at 6:29 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, Sawada-san, > > > Good point. After replaying the transaction, it doesn't matter because > > we would have already relayed the required invalidation while > > processing REORDER_BUFFER_CHANGE_INVALIDATION messages. However > > for > > concurrent abort case it could matter. See my analysis for the same > > below: > > > > Simulation of concurrent abort > > ------------------------------------------ > > 1) S1: CREATE TABLE d(data text not null); > > 2) S1: INSERT INTO d VALUES('d1'); > > 3) S2: BEGIN; INSERT INTO d VALUES('d2'); > > 4) S2: INSERT INTO unrelated_tab VALUES(1); > > 5) S1: ALTER PUBLICATION pb ADD TABLE d; > > 6) S2: INSERT INTO unrelated_tab VALUES(2); > > 7) S2: ROLLBACK; > > 8) S2: INSERT INTO d VALUES('d3'); > > 9) S1: INSERT INTO d VALUES('d4'); > > > The problem with the sequence is that the insert from 3) could be > > decoded *after* 5) in step 6) due to streaming and that to decode the > > insert (which happened before the ALTER) the catalog snapshot and > > cache state is from *before* the ALTER TABLE. Because the transaction > > started in 3) doesn't actually modify any catalogs, no invalidations > > are executed after decoding it. Now, assume, while decoding Insert > > from 4), we detected a concurrent abort, then the distributed > > invalidation won't be executed, and if we don't have accumulated > > messages in txn->invalidations, then the invalidation from step 5) > > won't be performed. The data loss can occur in steps 8 and 9. This is > > just a theory, so I could be missing something. > > I verified this is real or not, and succeeded to reproduce. See appendix the > detailed steps. > > > If the above turns out to be a problem, one idea for fixing it is that > > for the concurrent abort case (both during streaming and for prepared > > transaction's processing), we still check all the remaining changes > > and process only the changes related to invalidations. This has to be > > done before the current txn changes are freed via > > ReorderBufferResetTXN->ReorderBufferTruncateTXN. > > I roughly implemented the part, PSA the updated version. One concern is whether we > should consider the case that invalidations can cause ereport(ERROR). If happens, > the walsender will exit at that time. > But, in the catch part, we are already executing invalidations: ... /* make sure there's no cache pollution */ ReorderBufferExecuteInvalidations(txn->ninvalidations, txn->invalidations); ... So, the behaviour should be the same. -- With Regards, Amit Kapila.
В списке pgsql-bugs по дате отправления: