Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency)
Дата
Msg-id 10616819-5333-e35e-32c8-c0d138542167@enterprisedb.com
обсуждение исходный текст
Ответ на Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency)  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency)  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers

On 7/2/23 04:09, Thomas Munro wrote:
> On Sun, Jul 2, 2023 at 1:40 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> I think there's some sort of bug in how dd38ff28ad deals with
>> contrecords. Consider something as simple as
>>
>>   pgbench -i -s 100
>>
>> and then doing pg_waldump on the WAL segments, I get this for every
>> single one:
>>
>>   pg_waldump: error: error in WAL record at 0/1FFFF98: missing
>>                      contrecord at 0/1FFFFE0
>>
>> This only happens since dd38ff28ad, and revert makes it disappear.
>>
>> It's possible we still have some issue with contrecords, but IIUC we
>> fixed those. So unless there's some unknown one (and considering this
>> seems to happen for *every* WAL segment that's hard to believe), this
>> seems more like an issue in the error detection.
> 
> Yeah.  That message is due to this part of dd38ff28ad's change:
> 
>     Also add an explicit error message for missing contrecords.  It was a
>     bit strange that we didn't report an error already, and became a latent
>     bug with prefetching, since the internal state that tracks aborted
>     contrecords would not survive retrying, as revealed by
>     026_overwrite_contrecord.pl with this adjustment.  Reporting an error
>     prevents that.
> 
> We can change 'missing contrecord' back to silent end-of-decoding (as
> it was in 14) with the attached.  Here [1] is some analysis of this
> error that I wrote last year.  The reason for my hesitation in pushing
> a fix was something like this:
> 
> 1.  For pg_waldump, it's "you told me to decode only this WAL segment,
> so decoding failed here", which is useless noise
> 2.  For walsender, it's "you told me to shut down, so decoding failed
> here", which is useless noise
> 3.  For crash recovery, "I ran out of data, so decoding failed here",
> which seems like a report-worthy condition, I think?
> 
> When I added that new error I was thinking about that third case.  We
> generally expect to detect the end of WAL replay after a crash with an
> error ("invalid record length ...: wanted 24, got 0" + several other
> possibilities), but in this rare case it would be silent.  The effect
> on the first two cases is cosmetic, but certainly annoying.  Perhaps I
> should go ahead and back-patch the attached change, and then we can
> discuss whether/how we should do a better job of distinguishing "user
> requested artificial end of decoding" from "unexpectedly ran out of
> data" for v17?
> 

Yeah, I think that'd be sensible. IMHO we have a habit of scaring users
with stuff that might be dangerous/bad, but 99% of the time it's
actually fine and perhaps even expected. It's almost as if we're
conditioning people to ignore errors.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Replacing abort() with __builtin_trap()?
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Memory leak in incremental sort re-scan