Обсуждение: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 17064 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 14beta1 Operating system: Ubuntu 20.04 Description: The following script: === for i in `seq 100`; do createdb db$i done # Based on the contents of the regression test "vacuum" echo " CREATE TABLE pvactst (i INT); INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i; DELETE FROM pvactst; VACUUM pvactst; DROP TABLE pvactst; VACUUM FULL pg_database; " >/tmp/vacuum.sql for n in `seq 10`; do echo "iteration $n" for i in `seq 100`; do ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i ) >psql-$i.log 2>&1 & done wait grep -C5 FATAL psql*.log && break; done === detects sporadic FATAL errors: iteration 1 psql-56.log-DROP TABLE psql-56.log-VACUUM psql-56.log-CREATE TABLE psql-56.log-INSERT 0 10000 psql-56.log-DELETE 10000 psql-56.log:FATAL: relation mapping file "global/pg_filenode.map" contains incorrect checksum psql-56.log-server closed the connection unexpectedly psql-56.log- This probably means the server terminated abnormally psql-56.log- before or while processing the request. psql-56.log-connection to server was lost (Discovered when running multiple installcheck's for a single instance. (Except for this failure, installcheck x 100 (with several tests excluded) is rather stable.)) Reproduced on REL9_6_STABLE .. master.
Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
От
Heikki Linnakangas
Дата:
On 18/06/2021 18:00, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 17064 > Logged by: Alexander Lakhin > Email address: exclusion@gmail.com > PostgreSQL version: 14beta1 > Operating system: Ubuntu 20.04 > Description: > > The following script: > === > for i in `seq 100`; do > createdb db$i > done > > # Based on the contents of the regression test "vacuum" > echo " > CREATE TABLE pvactst (i INT); > INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i; > DELETE FROM pvactst; > VACUUM pvactst; > DROP TABLE pvactst; > > VACUUM FULL pg_database; > " >/tmp/vacuum.sql > > for n in `seq 10`; do > echo "iteration $n" > for i in `seq 100`; do > ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i ) >> psql-$i.log 2>&1 & > done > wait > grep -C5 FATAL psql*.log && break; > done > === > detects sporadic FATAL errors: > iteration 1 > psql-56.log-DROP TABLE > psql-56.log-VACUUM > psql-56.log-CREATE TABLE > psql-56.log-INSERT 0 10000 > psql-56.log-DELETE 10000 > psql-56.log:FATAL: relation mapping file "global/pg_filenode.map" contains > incorrect checksum > psql-56.log-server closed the connection unexpectedly > psql-56.log- This probably means the server terminated abnormally > psql-56.log- before or while processing the request. > psql-56.log-connection to server was lost Hmm, the simplest explanation would be that the read() or write() on the relmapper file is not atomic. We assume that it is, and don't use a lock in load_relmap_file() because of that. Is there anything unusual about the filesystem, mount options or the kernel you're using? I could not reproduce this on my laptop. Does the attached patch fix it for you? If that's the cause, it is easy to fix by taking the RelationMappingLock in load_relmap_file(), like in the attached patch. But if the write is not atomic, you might have a bigger problem: we also rely on the atomicity when writing the pg_control file. If that becomes corrupt because of a partial write, the server won't start up. If it's just a race condition between the read/write, or only the read() is not atomic, maybe pg_control is OK, but I'd like to understand the issue better before just adding a lock to load_relmap_file(). - Heikki
Вложения
On Tue, Jun 22, 2021 at 9:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm, the simplest explanation would be that the read() or write() on the > relmapper file is not atomic. We assume that it is, and don't use a lock > in load_relmap_file() because of that. Is there anything unusual about > the filesystem, mount options or the kernel you're using? I could not > reproduce this on my laptop. Does the attached patch fix it for you? I have managed to reproduce this twice on a laptop running Linux 5.10.0-2-amd64, after trying many things for several hours. Both times I was using ext4 in a loopback file (underlying is xfs, I had no luck there hence hunch that I should try ext4, may not be significant though) with fsync=off (ditto). > If that's the cause, it is easy to fix by taking the RelationMappingLock > in load_relmap_file(), like in the attached patch. But if the write is > not atomic, you might have a bigger problem: we also rely on the > atomicity when writing the pg_control file. If that becomes corrupt > because of a partial write, the server won't start up. If it's just a > race condition between the read/write, or only the read() is not atomic, > maybe pg_control is OK, but I'd like to understand the issue better > before just adding a lock to load_relmap_file(). Your analysis seems right to me. We have to worry about both things: atomicity of writes on power failure (assumed to be sector-level, hence our 512 byte struct -- all good), and atomicity of concurrent reads and writes (we can't assume anything at all, so r/w locking is the simplest way to get a consistent read). Shouldn't relmap_redo() also acquire the lock exclusively?
Thomas Munro <thomas.munro@gmail.com> writes: > Your analysis seems right to me. We have to worry about both things: > atomicity of writes on power failure (assumed to be sector-level, > hence our 512 byte struct -- all good), and atomicity of concurrent > reads and writes (we can't assume anything at all, so r/w locking is > the simplest way to get a consistent read). Shouldn't relmap_redo() > also acquire the lock exclusively? Shouldn't we instead file a kernel bug report? I seem to recall that POSIX guarantees atomicity of these things up to some operation size. Or is that just for pipe I/O? If we can't assume atomicity of relmapper file I/O, I wonder about pg_control as well. But on the whole, what I'm smelling is a moderately recently introduced kernel bug. We've been doing this this way for years and heard no previous reports. regards, tom lane
Hello, 22.06.2021 16:00, Thomas Munro wrote: > On Tue, Jun 22, 2021 at 9:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Hmm, the simplest explanation would be that the read() or write() on the >> relmapper file is not atomic. We assume that it is, and don't use a lock >> in load_relmap_file() because of that. Is there anything unusual about >> the filesystem, mount options or the kernel you're using? I could not >> reproduce this on my laptop. Does the attached patch fix it for you? > I have managed to reproduce this twice on a laptop running Linux > 5.10.0-2-amd64, after trying many things for several hours. Both > times I was using ext4 in a loopback file (underlying is xfs, I had no > luck there hence hunch that I should try ext4, may not be significant > though) with fsync=off (ditto). I'm sorry, I forgot that I've set "fsync=off" in my postgresql.conf (to avoid NVME-specific slowdown on fsyncs). It really does matter. With fsync=on the demo script passes 20 iterations successfully. I reproduce the issue on Ubuntu 20.04 with the kernel 5.9.15, ext4 (without any specific options) on NVME storage, and Ryzen 3700x. It was first encountered on Debian 10 with the kernel 4.19.0, ext4 on software RAID built on NVME storage too, and Xeon 5220. The attached patch fixes it for me (with fsync=off). 3 runs by 20 iterations completed without the error (without the patch I get the error on the first iteration). Best regards, Alexander
On Tue, Jun 22, 2021 at 10:11:06AM -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> Your analysis seems right to me. We have to worry about both things: >> atomicity of writes on power failure (assumed to be sector-level, >> hence our 512 byte struct -- all good), and atomicity of concurrent >> reads and writes (we can't assume anything at all, so r/w locking is >> the simplest way to get a consistent read). Shouldn't relmap_redo() >> also acquire the lock exclusively? You are implying anything calling write_relmap_file(), right? > Shouldn't we instead file a kernel bug report? I seem to recall that > POSIX guarantees atomicity of these things up to some operation size. > Or is that just for pipe I/O? Even if this is recognized as a bug report, it seems to me that we'd better cope with an extra lock for instances that may run into this issue anyway in the future, no? Just to be on the safe side. > If we can't assume atomicity of relmapper file I/O, I wonder about > pg_control as well. But on the whole, what I'm smelling is a moderately > recently introduced kernel bug. We've been doing this this way for > years and heard no previous reports. True. PG_CONTROL_MAX_SAFE_SIZE relies on that. Now, the only things updating the control file are the startup process and the checkpointer so that's less prone to conflicts contrary to the reported problem here, and the code takes a ControlFileLock where necessary. -- Michael
Вложения
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Jun 22, 2021 at 10:11:06AM -0400, Tom Lane wrote: >> Shouldn't we instead file a kernel bug report? I seem to recall that >> POSIX guarantees atomicity of these things up to some operation size. >> Or is that just for pipe I/O? > Even if this is recognized as a bug report, it seems to me that we'd > better cope with an extra lock for instances that may run into this > issue anyway in the future, no? Just to be on the safe side. Based on the evidence at hand, we can't really say whether the writes are non-atomic. If that's the case then we have much worse problems than just needing to add a lock. Thus, I think we need to get some kernel people involved. regards, tom lane
On Wed, Jun 23, 2021 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Your analysis seems right to me. We have to worry about both things: > > atomicity of writes on power failure (assumed to be sector-level, > > hence our 512 byte struct -- all good), and atomicity of concurrent > > reads and writes (we can't assume anything at all, so r/w locking is > > the simplest way to get a consistent read). Shouldn't relmap_redo() > > also acquire the lock exclusively? > > Shouldn't we instead file a kernel bug report? I seem to recall that > POSIX guarantees atomicity of these things up to some operation size. > Or is that just for pipe I/O? The spec doesn't cover us according to some opinions, at least: https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic But at the same time, the behaviour seems quite surprising given the parameters involved and how at least I thought this stuff worked in practice (ie what the rules about the visibility of writes that precede reads imply for the unspoken locking rule that must be the obvious reasonable implementation, and the reality of the inode-level read/write locking plainly visible in the source). It's possible that it's not working as designed in some weird edge case. I guess the next thing to do is write a minimal repro and find an expert to ask about what it's supposed to do.
On Wed, Jun 23, 2021 at 12:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > repro Here's a quick-and-dirty repro -- just touch test-file and then run it. It fails about 0.1% of its read loops on ext4, for me, but outputs nothing on other file systems and OSes I tried (XFS, FBSD UFS, APFS). Seems a bit like there is a tendency for 64 byte (cache line?) granularity in the mashed data it prints out.
Вложения
23.06.2021 05:29, Thomas Munro wrote: > On Wed, Jun 23, 2021 at 12:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> repro > Here's a quick-and-dirty repro -- just touch test-file and then run > it. It fails about 0.1% of its read loops on ext4, for me, but > outputs nothing on other file systems and OSes I tried (XFS, FBSD UFS, > APFS). Seems a bit like there is a tendency for 64 byte (cache line?) > granularity in the mashed data it prints out. Yes, this repro fails for me too (on ext4). 11111111111111111111111111111111222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 11111111111111111111111111111111111111111111111111111111111111111111111111111111222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222221111111111111111111111111111111122222222222222222222222222222222111111111111111111111111111111112222222222222222222222222222222211111111111111111111111111111111222222222222222222222222222222222222222222222222 11111111111111111111111111111111111111111111111111111111111111112222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 But granularity is 16-byte. (I've seen the fail count 19763 out of 1000000 loops.) Best regards, Alexander
Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
От
Heikki Linnakangas
Дата:
On 23/06/2021 03:50, Thomas Munro wrote: > On Wed, Jun 23, 2021 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thomas Munro <thomas.munro@gmail.com> writes: >>> Your analysis seems right to me. We have to worry about both things: >>> atomicity of writes on power failure (assumed to be sector-level, >>> hence our 512 byte struct -- all good), and atomicity of concurrent >>> reads and writes (we can't assume anything at all, so r/w locking is >>> the simplest way to get a consistent read). Shouldn't relmap_redo() >>> also acquire the lock exclusively? >> >> Shouldn't we instead file a kernel bug report? I seem to recall that >> POSIX guarantees atomicity of these things up to some operation size. >> Or is that just for pipe I/O? > > The spec doesn't cover us according to some opinions, at least: > > https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic > > But at the same time, the behaviour seems quite surprising given the > parameters involved and how at least I thought this stuff worked in > practice (ie what the rules about the visibility of writes that > precede reads imply for the unspoken locking rule that must be the > obvious reasonable implementation, and the reality of the inode-level > read/write locking plainly visible in the source). It's possible that > it's not working as designed in some weird edge case. I guess the > next thing to do is write a minimal repro and find an expert to ask > about what it's supposed to do. That would be nice. At this point, though, I'm convinced at this point that the POSIX doesn't give the guarantees we want, or even if it does, there are a lot of systems out there that don't respect that. Do we rely on that anywhere else than in load_relmap_file()? I don't think we do. Let's just add the lock there. Now, that leaves the question with pg_control. That's a different situation. It doesn't rely on read() and write() being atomic across processes, but on a 512 sector write not being torn on power failure. How strong is that guarantee? It used to be common wisdom with hard drives, and it was carried over to SSDs although I'm not sure if it was ever strictly speaking guaranteed. What about the new kid on the block: Persistent Memory? I found this article: https://lwn.net/Articles/686150/. So at hardware level, Persistent Memory only guarantees atomicity at cache line level (64 bytes). To provide the traditional 512 byte sector atomicity, there's a feature in Linux called BTT. Perhaps we should add a note to the docs that you should enable that. We haven't heard of broken control files from the field, so that doesn't seem to be a problem in practice, at least not yet. Still, I would sleep better if the control file had more redundancy. For example, have two copies of it on disk. At startup, read both copies, and if they're both valid, ignore the one with older timestamp. When updating it, write over the older copy. That way, if you crash in the middle of updating it, the old copy is still intact. - Heikki
On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Let's just add the lock there. +1, no doubt about that. > Now, that leaves the question with pg_control. That's a different > situation. It doesn't rely on read() and write() being atomic across > processes, but on a 512 sector write not being torn on power failure. > How strong is that guarantee? It used to be common wisdom with hard > drives, and it was carried over to SSDs although I'm not sure if it was > ever strictly speaking guaranteed. ... Right, it's always been tacit, no standard relevant to userspace mentions any of this AFAIK. > ... What about the new kid on the block: > Persistent Memory? I found this article: > https://lwn.net/Articles/686150/. So at hardware level, Persistent > Memory only guarantees atomicity at cache line level (64 bytes). To > provide the traditional 512 byte sector atomicity, there's a feature in > Linux called BTT. Perhaps we should add a note to the docs that you > should enable that. Right, also called sector mode. I don't know enough about that to comment really, but... if my google-fu is serving me, you can't actually use interesting sector sizes like 8KB (you have to choose 512 or 4096 bytes), so you'll have to pay for *two* synthetic atomic page schemes: BTT and our full page writes. That makes me wonder... if you need to leave full page writes on anyway, maybe it would be a better trade-off to do double writes of our special atomic files (relmapper files and control file) so that we could safely turn BTT off and avoid double-taxation for relation data. Just a thought. No pmem experience here, I could be way off. > We haven't heard of broken control files from the field, so that doesn't > seem to be a problem in practice, at least not yet. Still, I would sleep > better if the control file had more redundancy. For example, have two > copies of it on disk. At startup, read both copies, and if they're both > valid, ignore the one with older timestamp. When updating it, write over > the older copy. That way, if you crash in the middle of updating it, the > old copy is still intact. +1, with a flush in between so that only one can be borked no matter how the storage works. It is interesting how few reports there are on the mailing list of a control file CRC check failures though, if I'm searching for the right thing[1]. [1] https://www.postgresql.org/search/?m=1&q=calculated+CRC+checksum+does+not+match+value+stored+in+file&l=&d=-1&s=r
Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
От
Heikki Linnakangas
Дата:
On 23/06/2021 12:45, Thomas Munro wrote: > On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Let's just add the lock there. > > +1, no doubt about that. Committed that. Thanks for the report, Alexander! >> ... What about the new kid on the block: >> Persistent Memory? I found this article: >> https://lwn.net/Articles/686150/. So at hardware level, Persistent >> Memory only guarantees atomicity at cache line level (64 bytes). To >> provide the traditional 512 byte sector atomicity, there's a feature in >> Linux called BTT. Perhaps we should add a note to the docs that you >> should enable that. > > Right, also called sector mode. I don't know enough about that to > comment really, but... if my google-fu is serving me, you can't > actually use interesting sector sizes like 8KB (you have to choose 512 > or 4096 bytes), so you'll have to pay for *two* synthetic atomic page > schemes: BTT and our full page writes. That makes me wonder... if you > need to leave full page writes on anyway, maybe it would be a better > trade-off to do double writes of our special atomic files (relmapper > files and control file) so that we could safely turn BTT off and avoid > double-taxation for relation data. Just a thought. No pmem > experience here, I could be way off. Yeah, you wouldn't want to turn on BTT for anything else than the pg_control file. That's the only place where we rely on sector atomicity, I believe. For everything else, it just adds overhead. Not sure how much overhead; maybe it doesn't matter in practice. >> We haven't heard of broken control files from the field, so that doesn't >> seem to be a problem in practice, at least not yet. Still, I would sleep >> better if the control file had more redundancy. For example, have two >> copies of it on disk. At startup, read both copies, and if they're both >> valid, ignore the one with older timestamp. When updating it, write over >> the older copy. That way, if you crash in the middle of updating it, the >> old copy is still intact. > > +1, with a flush in between so that only one can be borked no matter > how the storage works. It is interesting how few reports there are on > the mailing list of a control file CRC check failures though, if I'm > searching for the right thing[1]. > > [1] https://www.postgresql.org/search/?m=1&q=calculated+CRC+checksum+does+not+match+value+stored+in+file&l=&d=-1&s=r If anyone wants a write a patch for that, I'd be happy to review it. And if anyone has access to a system with pmem hardware, it would be interesting to try to reproduce a torn sector and broken control file by pulling the power plug. - Heikki
On Thu, Jun 24, 2021 at 7:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 23/06/2021 12:45, Thomas Munro wrote: > > On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> Let's just add the lock there. > > > > +1, no doubt about that. > > Committed that. Thanks for the report, Alexander! I think you missed relmap_redo (including a misleading comment). > If anyone wants a write a patch for that, I'd be happy to review it. And > if anyone has access to a system with pmem hardware, it would be > interesting to try to reproduce a torn sector and broken control file by > pulling the power plug. I have been working on a kind of experimental file system for simulating torn sectors (and other interesting file system phenomena) as part of some work on recovery scenerio testing, not quite ready to share yet but it can simulate that exact failure...
Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
От
Heikki Linnakangas
Дата:
On 24/06/2021 11:06, Thomas Munro wrote: > On Thu, Jun 24, 2021 at 7:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 23/06/2021 12:45, Thomas Munro wrote: >>> On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>>> Let's just add the lock there. >>> >>> +1, no doubt about that. >> >> Committed that. Thanks for the report, Alexander! > > I think you missed relmap_redo (including a misleading comment). Fixed, thanks! >> If anyone wants a write a patch for that, I'd be happy to review it. And >> if anyone has access to a system with pmem hardware, it would be >> interesting to try to reproduce a torn sector and broken control file by >> pulling the power plug. > > I have been working on a kind of experimental file system for > simulating torn sectors (and other interesting file system phenomena) > as part of some work on recovery scenerio testing, not quite ready to > share yet but it can simulate that exact failure... Cool! We know what happens if pg_control file is torn, though. What I'd like to know is whether it can happen in practice with pmem, an how likely it is. For curiosity mostly, I think we have already established that it can happen, and it would be nice to protect against it in any case, even if it's rare. - Heikki
> We haven't heard of broken control files from the field, so that doesn't > seem to be a problem in practice, at least not yet. Still, I would sleep > better if the control file had more redundancy. For example, have two > copies of it on disk. At startup, read both copies, and if they're both > valid, ignore the one with older timestamp. When updating it, write over > the older copy. That way, if you crash in the middle of updating it, the > old copy is still intact. Seems like a good idea. I somehow doubt that accessing pmem through old school read()/write() interfaces is the future of databases, but ideally this should work correctly, and the dependency is indeed unnecessary if we are prepared to jump through more hoops in just a couple of places. There may also be other benefits. In hindsight, it's a bit strange that we don't have explicit documentation of this requirement. There is some related (and rather dated) discussion of sectors in wal.sgml but nothing to say that we need 512 byte atomic sectors for correct operation, unless I've managed to miss it (even though it's well known among people who read the source code). I experimented with a slightly different approach, attached, and a TAP test to exercise it. Instead of alternating between two copies, I tried writing out both copies every time with a synchronisation barrier in between (the same double-write principle some other database uses to deal with torn data pages). I think it's mostly equivalent to your scheme, though the updates are of course slower. I was thinking that there may be other benefits to having two copies of the "current" version around, for resilience (though perhaps they should be in separate files, not done here), and maybe it's better to avoid having to invent a timestamp scheme. Or maybe the two ideas should be combined: when both CRC checks pass, you could still be more careful which one you choose than I have been here. Or maybe trying to be resilient against handwavy unknown forms of corruption is a waste of time. I'm not proposing anything here, I was just trying out ideas, for discussion.