Re: pgsql: Improve runtime and output of tests for replication slots checkp
От | Tom Lane |
---|---|
Тема | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Дата | |
Msg-id | 2852529.1750553810@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: pgsql: Improve runtime and output of tests for replication slots checkp (Alexander Korotkov <aekorotkov@gmail.com>) |
Ответы |
Re: pgsql: Improve runtime and output of tests for replication slots checkp
|
Список | pgsql-committers |
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Sat, Jun 21, 2025 at 2:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But "Wait for the next page to become available" seems awfully >> trusting that there will be another page. Should this be >> using the no-wait code path? > Thank you for the help. It seems to me that problem is deeper. The > code seems to only trying to read till the end of given WAL record, > but can't reach it. According to the values I've seen in XLogCtl, it > seems that RedoRecPtr points somewhere inside of that record's body. > I don't feel confident about to understand what's going on and how to > fix it. Hmm. My theory about what's happening is that we are writing a WAL record that spans across a page boundary, and the asynchronous immediate-stop request comes in and kills that operation, so that the first half of the record is on disk but the second is not. (This outcome would obviously be very timing-dependent.) Then we restart the server, and crash recovery copes with that situation successfully. But pg_logical_slot_get_changes() fails to, because it's coded in such a way that it will wait forever for the second half of the WAL record to appear. The main hole in this theory is: if crash recovery thought that the partial record was invalid, wouldn't it set the end-of-WAL position to the start of that record? And shouldn't that stop pg_logical_slot_get_changes() from trying to process the partial record? So I'm sure there's some details not quite right about this picture. But maybe it's within hailing distance of the truth. > The new tests just spotted the existing bug. Yeah, that's what I think too. The unintentional omission of a pre-shutdown delay in the 046 test has exposed some pre-existing fragility in pg_logical_slot_get_changes(). So I'm not in favor of changing 046 till we understand this better. regards, tom lane
В списке pgsql-committers по дате отправления: