Обсуждение: gistGetFakeLSN() can return incorrect LSNs
Hi,
Tomas encountered a crash with the index prefetching patchset. One of the
patches included therein is a generalization of the gistGetFakeLSN()
mechanism, which is then used by other indexes as well. That triggered an
occasional, hard to locally reproduce, ERROR or PANIC in CI, about
ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000
A bunch of debugging later, it turns out that this is a pre-existing issue.
XLogRecPtr
gistGetFakeLSN(Relation rel)
{
...
else if (RelationIsPermanent(rel))
{
/*
* WAL-logging on this relation will start after commit, so its LSNs
* must be distinct numbers smaller than the LSN at the next commit.
* Emit a dummy WAL record if insert-LSN hasn't advanced after the
* last call.
*/
static XLogRecPtr lastlsn = InvalidXLogRecPtr;
XLogRecPtr currlsn = GetXLogInsertRecPtr();
/* Shouldn't be called for WAL-logging relations */
Assert(!RelationNeedsWAL(rel));
/* No need for an actual record if we already have a distinct LSN */
if (XLogRecPtrIsValid(lastlsn) && lastlsn == currlsn)
currlsn = gistXLogAssignLSN();
lastlsn = currlsn;
XLogFlush(currlsn);
return currlsn;
}
The problem is that GetXLogInsertRecPtr() returns the start of the next
record. That's *most* of the time the same as what XLogInsert() would return
(i.e. one byte past the end of the last record), but not reliably so: If the
last record ends directly at a page boundary, XLogInsert() returns an LSN
directly to the start of the page, but GetXLogInsertRecPtr() will return an
LSN that points to just after the xlog page header.
If you look at the error from above, that's exactly what's happening - the
flush is only up to 0/01BD2000 (0x2000 is 8192, i.e. an 8kB boundary). But the
flush request is to 0/01BD2018, where 0x18 is the size of XLogPageHeaderData.
To be safe, this code would need to use a version of GetXLogInsertRecPtr()
that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr().
It's probably not easy to trigger this outside of aggressive test scenarios,
due to needing to avoid the gistXLogAssingLSN() path and having to encounter a
"reason" to flush the buffer immediately after.
However, if I put an XLogFlush() into gistGetFakeLSN() and use
wal_level=minimal, it's a lot easier.
It looks like this was introduced in
commit c6b92041d38
Author: Noah Misch <noah@leadboat.com>
Date: 2020-04-04 12:25:34 -0700
Skip WAL for new relfilenodes, under wal_level=minimal.
Greetings,
Andres Freund
Interesting bug. Your analysis seems correct to me. > On 5 Mar 2026, at 22:10, Andres Freund <andres@anarazel.de> wrote: > > To be safe, this code would need to use a version of GetXLogInsertRecPtr() > that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr(). Can't we just take Insert->CurrBytePos without XLogBytePosToEndRecPtr()? Is there a point in alignment before the page header? Best regards, Andrey Borodin.
On Thu, Mar 05, 2026 at 12:10:11PM -0500, Andres Freund wrote: > Tomas encountered a crash with the index prefetching patchset. One of the > patches included therein is a generalization of the gistGetFakeLSN() > mechanism, which is then used by other indexes as well. That triggered an > occasional, hard to locally reproduce, ERROR or PANIC in CI, about > > ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000 > To be safe, this code would need to use a version of GetXLogInsertRecPtr() > that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr(). I agree. Thanks for diagnosing it. Feel free to move forward with that strategy, or let me know if you'd like me to do it. > However, if I put an XLogFlush() into gistGetFakeLSN() and use > wal_level=minimal, it's a lot easier. > > > It looks like this was introduced in > > commit c6b92041d38 > Author: Noah Misch <noah@leadboat.com> > Date: 2020-04-04 12:25:34 -0700 > > Skip WAL for new relfilenodes, under wal_level=minimal. Combining that XLogFlush() with wal_level=minimal (your recipe) does make predicate-gist.spec fail, both at that commit and today.
Hi,
On 2026-03-05 23:26:30 +0500, Andrey Borodin wrote:
> Interesting bug. Your analysis seems correct to me.
>
> > On 5 Mar 2026, at 22:10, Andres Freund <andres@anarazel.de> wrote:
> >
> > To be safe, this code would need to use a version of GetXLogInsertRecPtr()
> > that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr().
>
> Can't we just take Insert->CurrBytePos without XLogBytePosToEndRecPtr()?
> Is there a point in alignment before the page header?
No, that'd be a completely bogus LSN, as CurrBytePos does not include any
space for page headers, to make the the very contended spinlock'ed section in
ReserveXLogInsertLocation() cheaper:
/*
* The duration the spinlock needs to be held is minimized by minimizing
* the calculations that have to be done while holding the lock. The
* current tip of reserved WAL is kept in CurrBytePos, as a byte position
* that only counts "usable" bytes in WAL, that is, it excludes all WAL
* page headers. The mapping between "usable" byte positions and physical
* positions (XLogRecPtrs) can be done outside the locked region, and
* because the usable byte position doesn't include any headers, reserving
* X bytes from WAL is almost as simple as "CurrBytePos += X".
*/
SpinLockAcquire(&Insert->insertpos_lck);
And we rely on the page LSNs to be correct for the content of newly created
permanent relations with wal_level=minimal, so you can't just return arbitrary
other values.
Greetings,
Andres Freund
Hi, On 2026-03-05 10:52:03 -0800, Noah Misch wrote: > On Thu, Mar 05, 2026 at 12:10:11PM -0500, Andres Freund wrote: > > Tomas encountered a crash with the index prefetching patchset. One of the > > patches included therein is a generalization of the gistGetFakeLSN() > > mechanism, which is then used by other indexes as well. That triggered an > > occasional, hard to locally reproduce, ERROR or PANIC in CI, about > > > > ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000 > > > To be safe, this code would need to use a version of GetXLogInsertRecPtr() > > that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr(). > > I agree. Thanks for diagnosing it. Feel free to move forward with that > strategy, or let me know if you'd like me to do it. I'd appreciate if you could do it. > > However, if I put an XLogFlush() into gistGetFakeLSN() and use > > wal_level=minimal, it's a lot easier. > > > > > > It looks like this was introduced in > > > > commit c6b92041d38 > > Author: Noah Misch <noah@leadboat.com> > > Date: 2020-04-04 12:25:34 -0700 > > > > Skip WAL for new relfilenodes, under wal_level=minimal. > > Combining that XLogFlush() with wal_level=minimal (your recipe) does make > predicate-gist.spec fail, both at that commit and today. Good. I've been thinking that we really need to do more verification of LSNs in a bunch of places. When reading in a page outside of recovery, we should probably complain about LSNs that are not from the past, instead of reading them into the buffer pool and then complaining when a hint bit is set on the buffer and the buffer needs to be flushed. When WAL logging of hint bits is enabled and we are writing out a dirty permanent page (leaving aside the FSM), we probably should assert that the page has an LSN that's recent enough to be plausible. It's too bad that PageSetLSN() doesn't have a reference to the buffer, so it's nontrivial to see if the buffer is permanent and thus should always have a valid LSN. I guess we could check if the page is within (BufferBlocks, BLCKSZ*NBlocks) and only do additional verification in that case, but brrr. Greetings, Andres Freund
> On 6 Mar 2026, at 00:27, Andres Freund <andres@anarazel.de> wrote: > > No, that'd be a completely bogus LSN, as CurrBytePos does not include any > space for page headers, to make the the very contended spinlock'ed section in > ReserveXLogInsertLocation() cheaper Thanks! Now I see it's documented that "usable bytes position" is not XLogRecPtr at all. And XLogBytePosToEndRecPtr() is unavoidable for newly created relations in wal_level=minimal. Perhaps we can even add an Assert that PageSetLSN() does not stamp LSNs from future. Sorry for raising off-topic questions, I'll go check this myself. Best regards, Andrey Borodin.
On 3/5/26 21:25, Andres Freund wrote:
> Hi,
>
> On 2026-03-05 10:52:03 -0800, Noah Misch wrote:
>> On Thu, Mar 05, 2026 at 12:10:11PM -0500, Andres Freund wrote:
>>> Tomas encountered a crash with the index prefetching patchset. One of the
>>> patches included therein is a generalization of the gistGetFakeLSN()
>>> mechanism, which is then used by other indexes as well. That triggered an
>>> occasional, hard to locally reproduce, ERROR or PANIC in CI, about
>>>
>>> ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000
>>
>>> To be safe, this code would need to use a version of GetXLogInsertRecPtr()
>>> that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr().
>>
>> I agree. Thanks for diagnosing it. Feel free to move forward with that
>> strategy, or let me know if you'd like me to do it.
>
> I'd appreciate if you could do it.
>
Here's a fix for master (and backpatching). It introduces a new function
GetXLogInsertEndRecPtr() and then uses that in gistGetFakeLSN(). I still
need to test this a bit more, but it did fix the issue in our dev branch
(where we saw regular failures). So I'm 99% sure it's fine.
After writing the fix I had the idea to grep for GetXLogInsertRecPtr
calls that might have similar issue (being passed to XLogFlush), and
sure enough walsender does this:
XLogFlush(GetXLogInsertRecPtr());
Which AFAICS has the same issue, right? Funnily enough, this is a very
new call, from 2026/03/06. Before 6eedb2a5fd88 walsender might flush not
far enough, now it may be flushing too far ;-) AFAIK it should call the
same GetXLogInsertEndRecPtr() once we have it.
regards
--
Tomas Vondra
Вложения
On Fri, Mar 13, 2026 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote: > > On 3/5/26 21:25, Andres Freund wrote: > > Hi, > > > > On 2026-03-05 10:52:03 -0800, Noah Misch wrote: > >> On Thu, Mar 05, 2026 at 12:10:11PM -0500, Andres Freund wrote: > >>> Tomas encountered a crash with the index prefetching patchset. One of the > >>> patches included therein is a generalization of the gistGetFakeLSN() > >>> mechanism, which is then used by other indexes as well. That triggered an > >>> occasional, hard to locally reproduce, ERROR or PANIC in CI, about > >>> > >>> ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000 > >> > >>> To be safe, this code would need to use a version of GetXLogInsertRecPtr() > >>> that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr(). > >> > >> I agree. Thanks for diagnosing it. Feel free to move forward with that > >> strategy, or let me know if you'd like me to do it. > > > > I'd appreciate if you could do it. > > > > Here's a fix for master (and backpatching). It introduces a new function > GetXLogInsertEndRecPtr() and then uses that in gistGetFakeLSN(). I still > need to test this a bit more, but it did fix the issue in our dev branch > (where we saw regular failures). So I'm 99% sure it's fine. > > After writing the fix I had the idea to grep for GetXLogInsertRecPtr > calls that might have similar issue (being passed to XLogFlush), and > sure enough walsender does this: > > XLogFlush(GetXLogInsertRecPtr()); > > Which AFAICS has the same issue, right? Funnily enough, this is a very > new call, from 2026/03/06. Before 6eedb2a5fd88 walsender might flush not > far enough, now it may be flushing too far ;-) AFAIK it should call the > same GetXLogInsertEndRecPtr() once we have it. Yes, this is already being discussed at [1], and Anthonin has also proposed a patch there that adds GetXLogInsertEndRecPtr(). Regards, [1] https://postgr.es/m/vzguaguldbcyfbyuq76qj7hx5qdr5kmh67gqkncyb2yhsygrdt@dfhcpteqifux -- Fujii Masao
On 3/13/26 12:33, Fujii Masao wrote:
> On Fri, Mar 13, 2026 at 8:12 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 3/5/26 21:25, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2026-03-05 10:52:03 -0800, Noah Misch wrote:
>>>> On Thu, Mar 05, 2026 at 12:10:11PM -0500, Andres Freund wrote:
>>>>> Tomas encountered a crash with the index prefetching patchset. One of the
>>>>> patches included therein is a generalization of the gistGetFakeLSN()
>>>>> mechanism, which is then used by other indexes as well. That triggered an
>>>>> occasional, hard to locally reproduce, ERROR or PANIC in CI, about
>>>>>
>>>>> ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000
>>>>
>>>>> To be safe, this code would need to use a version of GetXLogInsertRecPtr()
>>>>> that does use XLogBytePosToEndRecPtr() instead of XLogBytePosToRecPtr().
>>>>
>>>> I agree. Thanks for diagnosing it. Feel free to move forward with that
>>>> strategy, or let me know if you'd like me to do it.
>>>
>>> I'd appreciate if you could do it.
>>>
>>
>> Here's a fix for master (and backpatching). It introduces a new function
>> GetXLogInsertEndRecPtr() and then uses that in gistGetFakeLSN(). I still
>> need to test this a bit more, but it did fix the issue in our dev branch
>> (where we saw regular failures). So I'm 99% sure it's fine.
>>
>> After writing the fix I had the idea to grep for GetXLogInsertRecPtr
>> calls that might have similar issue (being passed to XLogFlush), and
>> sure enough walsender does this:
>>
>> XLogFlush(GetXLogInsertRecPtr());
>>
>> Which AFAICS has the same issue, right? Funnily enough, this is a very
>> new call, from 2026/03/06. Before 6eedb2a5fd88 walsender might flush not
>> far enough, now it may be flushing too far ;-) AFAIK it should call the
>> same GetXLogInsertEndRecPtr() once we have it.
>
> Yes, this is already being discussed at [1], and Anthonin has also proposed
> a patch there that adds GetXLogInsertEndRecPtr().
>
Thanks, I didn't notice the other thread, to much stuff happening.
I've now pushed the fix (and backpatched all the way to 14), so it
should be possible to use the new function in walsender too. That should
do the trick, I think.
It took me quite a bit of time to reproduce the failure well enough to
make sure the fix is correct. I was hoping it'd allow me to write a
regression test for this bug (and similar ones), but that turned out to
be way trickier because it requires a very precise sequence of WAL
records, aligned just the right way.
I reproduced it on three different platforms (x86-64, 32/64-bit arm),
and every platform requires *completely different* parameters (number of
rows inserted, etc.). On some platforms I had to disable hint bits or
FPWs, etc.
It was good enough for validating the fix, but I gave up on the
regression test. A single extra WAL record breaks the test - an
unexpected hint bit, an autovacuum run updating something, or even a
tiny change from an artibrary commit, and the test stops working.
I'm actually quite amazed it started failing so reliably on our branch
with prefetching. But it only ever failed on CI, which just shows how
finicky the test would be.
I'm attaching the script I used (derived from 018_wal_optimize.tap). To
reproduce the issue you'll have to experiment with the various
parameters (number of rows, FPW, hint bits). What helped me was patching
gistGetFakeLSN() to show calls where currlsn points right after page
header, and using pg_waldump to check there's nothing between the last
ASSIGN_LSN and COMMIT.
After I gave up on a regression test, I had the idea we might add an
assert to XLogFlush() to check we're not flushing LSNs in the page
header. So something like:
Assert((record % XLOG_BLCKSZ == 0) ||
(record % XLOG_BLCKSZ > SizeOfXLogShortPHD));
Surprisingly, that fails because GiST decided to use "1" as a special
LSN during build.
#define GistBuildLSN ((XLogRecPtr) 1)
Of course, GetVictimBuffer() is unaware of that and ends up flushing
this buffers with XLogFlush(record=1). I wonder if this might lead to
some funny issues elsewhere, in code that uses LSN in some way. Backups,
checksums, ... that kind of stuff. I wonder if other index AMs do more
stuff like this.
With (record == 1) added to the assert it seems to work, or at least it
passes check-world for me. But I decided not to include it in the fix.
regards
--
Tomas Vondra