Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Дата
Msg-id f03d9166-ad12-2a3c-f605-c1873ee86ae4@iki.fi
обсуждение исходный текст
Ответ на Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Список pgsql-bugs
On 23/06/2021 03:50, Thomas Munro wrote:
> On Wed, Jun 23, 2021 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@gmail.com> writes:
>>> Your analysis seems right to me.  We have to worry about both things:
>>> atomicity of writes on power failure (assumed to be sector-level,
>>> hence our 512 byte struct -- all good), and atomicity of concurrent
>>> reads and writes (we can't assume anything at all, so r/w locking is
>>> the simplest way to get a consistent read).  Shouldn't relmap_redo()
>>> also acquire the lock exclusively?
>>
>> Shouldn't we instead file a kernel bug report?  I seem to recall that
>> POSIX guarantees atomicity of these things up to some operation size.
>> Or is that just for pipe I/O?
> 
> The spec doesn't cover us according to some opinions, at least:
> 
> https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic
> 
> But at the same time, the behaviour seems quite surprising given the
> parameters involved and how at least I thought this stuff worked in
> practice (ie what the rules about the visibility of writes that
> precede reads imply for the unspoken locking rule that must be the
> obvious reasonable implementation, and the reality of the inode-level
> read/write locking plainly visible in the source).  It's possible that
> it's not working as designed in some weird edge case.  I guess the
> next thing to do is write a minimal repro and find an expert to ask
> about what it's supposed to do.

That would be nice. At this point, though, I'm convinced at this point 
that the POSIX doesn't give the guarantees we want, or even if it does, 
there are a lot of systems out there that don't respect that. Do we rely 
on that anywhere else than in load_relmap_file()? I don't think we do. 
Let's just add the lock there.

Now, that leaves the question with pg_control. That's a different 
situation. It doesn't rely on read() and write() being atomic across 
processes, but on a 512 sector write not being torn on power failure. 
How strong is that guarantee? It used to be common wisdom with hard 
drives, and it was carried over to SSDs although I'm not sure if it was 
ever strictly speaking guaranteed. What about the new kid on the block: 
Persistent Memory? I found this article: 
https://lwn.net/Articles/686150/. So at hardware level, Persistent 
Memory only guarantees atomicity at cache line level (64 bytes). To 
provide the traditional 512 byte sector atomicity, there's a feature in 
Linux called BTT. Perhaps we should add a note to the docs that you 
should enable that.

We haven't heard of broken control files from the field, so that doesn't 
seem to be a problem in practice, at least not yet. Still, I would sleep 
better if the control file had more redundancy. For example, have two 
copies of it on disk. At startup, read both copies, and if they're both 
valid, ignore the one with older timestamp. When updating it, write over 
the older copy. That way, if you crash in the middle of updating it, the 
old copy is still intact.

- Heikki



В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17070: Sometimes copy from ingnores transaction
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"