Обсуждение: Creation of an empty table is not fsync'd at checkpoint
If you create an empty table, it is not fsync'd. As soon as you insert a row to it, register_dirty_segment() gets called, and after that, the next checkpoint will fsync it. But before that, the creation itself is never fsync'd. That's obviously not great. The lack of an fsync is a bit hard to prove because it requires a hardware failure, or a simulation of it, and can be affected by filesystem options too. But I was able to demonstrate a problem with these steps: 1. Create a VM with two virtual disks. Use ext4, with 'data=writeback' option (I'm not sure if that's required). Install PostgreSQL on one of the virtual disks. 2. Start the server, and create a tablespace on the other disk: CREATE TABLESPACE foospc LOCATION '/data/heikki'; 3. Do this: CREATE TABLE foo (i int) TABLESPACE foospc; CHECKPOINT; 4. Immediately after that, kill the VM. I used: killall -9 qemu-system-x86_64 5. Restart the VM, restart PostgreSQL. Now when you try to use the table, you get an error: postgres=# select * from crashtest ; ERROR: could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": No such file or directory I was not able to reproduce this without the tablespace on a different virtual disk, I presume because ext4 orders the writes so that the checkpoint implicitly always flushes the creation of the file to disk. I tried data=writeback but it didn't make a difference. But with a separate disk, it happens every time. I think the simplest fix is to call register_dirty_segment() from mdcreate(). As in the attached. Thoughts? - Heikki
Вложения
Hi, On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote: > If you create an empty table, it is not fsync'd. As soon as you insert a row > to it, register_dirty_segment() gets called, and after that, the next > checkpoint will fsync it. But before that, the creation itself is never > fsync'd. That's obviously not great. Good catch. > The lack of an fsync is a bit hard to prove because it requires a hardware > failure, or a simulation of it, and can be affected by filesystem options > too. I've been thinking that we need a validation layer for fsyncs, it's too hard to get right without testing, and crash testing is not likel enough to catch problems quickly / resource intensive. My thought was that we could keep a shared hash table of all files created / dirtied at the fd layer, with the filename as key and the value the current LSN. We'd delete files from it when they're fsynced. At checkpoints we go through the list and see if there's any files from before the redo that aren't yet fsynced. All of this in assert builds only, of course. > > 1. Create a VM with two virtual disks. Use ext4, with 'data=writeback' > option (I'm not sure if that's required). Install PostgreSQL on one of the > virtual disks. > > 2. Start the server, and create a tablespace on the other disk: > > CREATE TABLESPACE foospc LOCATION '/data/heikki'; > > 3. Do this: > > CREATE TABLE foo (i int) TABLESPACE foospc; > CHECKPOINT; > > 4. Immediately after that, kill the VM. I used: > > killall -9 qemu-system-x86_64 > > 5. Restart the VM, restart PostgreSQL. Now when you try to use the table, > you get an error: > > postgres=# select * from crashtest ; > ERROR: could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": No > such file or directory I think it might be good to have a bunch of in-core scripting to do this kind of thing. I think we can get lower iteration time by using linux device-mapper targets to refuse writes. > I was not able to reproduce this without the tablespace on a different > virtual disk, I presume because ext4 orders the writes so that the > checkpoint implicitly always flushes the creation of the file to disk. It's likely that the control file sync at the end of a checkpoint has the side effect of also forcing the file creation to be durable if on the same tablespace (it'd not make the file contents durable, but they don't exist here, so ...). > I tried data=writeback but it didn't make a difference. But with a separate > disk, it happens every time. My understanding is that data=writeback just stops a journal flush of filesystem metadata from also flushing all previously dirtied data. Which often is good, because that's a lot of unnecessary writes. Even with data=writeback prior file system journal contents are still synced to disk. > I think the simplest fix is to call register_dirty_segment() from > mdcreate(). As in the attached. Thoughts? > diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c > index d26c915f90e..2dfd80ca66b 100644 > --- a/src/backend/storage/smgr/md.c > +++ b/src/backend/storage/smgr/md.c > @@ -225,6 +225,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) > mdfd = &reln->md_seg_fds[forkNum][0]; > mdfd->mdfd_vfd = fd; > mdfd->mdfd_segno = 0; > + > + if (!SmgrIsTemp(reln)) > + register_dirty_segment(reln, forkNum, mdfd); I think we might want different handling for unlogged tables and for relations that are synced to disk in other ways (e.g. index builds). But I've not re-read the relevant code. Greetings, Andres Freund
On Fri, Jan 28, 2022 at 7:28 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote: > > I was not able to reproduce this without the tablespace on a different > > virtual disk, I presume because ext4 orders the writes so that the > > checkpoint implicitly always flushes the creation of the file to disk. > > It's likely that the control file sync at the end of a checkpoint has the side > effect of also forcing the file creation to be durable if on the same > tablespace (it'd not make the file contents durable, but they don't exist > here, so ...). It might be possible to avoid that on xfs or pretty much any other file system. I wasn't following this closely, but even with ext4's recent fast commit changes, its fsync implementation still deliberately synchronises data for other file descriptors as a side effect as summarised in [1], unlike xfs and other systems. So they've caught up with xfs's concurrent writes (and gone further than xfs by doing it also for buffered I/O giving up even page-level atomicity, as discussed in a couple of other threads), but not yet decided to pull the trigger on just-fsync-what-I-asked-for. [1] https://lwn.net/Articles/842385/
On Fri, Jan 28, 2022 at 6:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I think the simplest fix is to call register_dirty_segment() from > mdcreate(). As in the attached. Thoughts? +1
Hi, On 2022-01-28 08:01:01 +1300, Thomas Munro wrote: > On Fri, Jan 28, 2022 at 7:28 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote: > > > I was not able to reproduce this without the tablespace on a different > > > virtual disk, I presume because ext4 orders the writes so that the > > > checkpoint implicitly always flushes the creation of the file to disk. > > > > It's likely that the control file sync at the end of a checkpoint has the side > > effect of also forcing the file creation to be durable if on the same > > > tablespace (it'd not make the file contents durable, but they don't exist > > here, so ...). > > It might be possible to avoid that on xfs or pretty much any other > file system. I wasn't following this closely, but even with ext4's > recent fast commit changes, its fsync implementation still > deliberately synchronises data for other file descriptors as a side > effect as summarised in [1], unlike xfs and other systems. Not data, just metadata, right? Well, and volatile write caches (by virtue of doing an otherwise unnecessary REQ_PREFLUSH). With data=writeback file contents for file a are not flushed to disk (rather than from disk write cache) when file b is fsynced. Before/After the fast commit feature. > So they've caught up with xfs's concurrent writes (and gone further than xfs > by doing it also for buffered I/O giving up even page-level atomicity, as > discussed in a couple of other threads), but not yet decided to pull the > trigger on just-fsync-what-I-asked-for. I don't think the page level locking and the changes above / fsyncing are pretty much independent? Greetings, Andres Freund
On Fri, Jan 28, 2022 at 8:17 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-01-28 08:01:01 +1300, Thomas Munro wrote: > > It might be possible to avoid that on xfs or pretty much any other > > file system. I wasn't following this closely, but even with ext4's > > recent fast commit changes, its fsync implementation still > > deliberately synchronises data for other file descriptors as a side > > effect as summarised in [1], unlike xfs and other systems. > > Not data, just metadata, right? Well, and volatile write caches (by virtue of > doing an otherwise unnecessary REQ_PREFLUSH). With data=writeback file > contents for file a are not flushed to disk (rather than from disk write > cache) when file b is fsynced. Before/After the fast commit feature. Pass. [/me suppresses the urge to go and test]. I'm just going by words in that article and T'so's linked email about file entanglement and the global barrier nature of fsync. I'm not studying the code. > > So they've caught up with xfs's concurrent writes (and gone further than xfs > > by doing it also for buffered I/O giving up even page-level atomicity, as > > discussed in a couple of other threads), but not yet decided to pull the > > trigger on just-fsync-what-I-asked-for. > > I don't think the page level locking and the changes above / fsyncing are > pretty much independent? Right, the concurrency thing is completely unrelated and an old change (maybe so old that I'm actually thinking of ext3?). I was just commenting on two historical reasons why people used to switch to xfs (especially other databases using DIO). I see that Noah's results showed "atomicity" (for some definition) for read/write in 3.10, but not for pread/pwrite, with the difference gone in 4.9, so I guess perhaps somewhere between those releases is where our unlocked control file access became problematic, but perhaps it wasn't fundamental, just a quirk of the implementation of read/write with implicit position, and if we'd used pread/pwrite for the control file we'd have seen it sooner. *shrug*
On Fri, Jan 28, 2022 at 8:12 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jan 28, 2022 at 6:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I think the simplest fix is to call register_dirty_segment() from > > mdcreate(). As in the attached. Thoughts? > > +1 [Testing] Erm, so now I see my new table in checkpoint's activities: openat(AT_FDCWD,"base/5/16399",O_RDWR,00) = 20 (0x14) fsync(20) = 0 (0x0) ... but we still never synchronize "base/5". According to our project's reading of the POSIX tea leaves we should be doing that to nail down the directory entry.
On 28/01/2022 00:11, Thomas Munro wrote: > On Fri, Jan 28, 2022 at 8:12 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Fri, Jan 28, 2022 at 6:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> I think the simplest fix is to call register_dirty_segment() from >>> mdcreate(). As in the attached. Thoughts? >> >> +1 > > [Testing] > > Erm, so now I see my new table in checkpoint's activities: > > openat(AT_FDCWD,"base/5/16399",O_RDWR,00) = 20 (0x14) > fsync(20) = 0 (0x0) > > ... but we still never synchronize "base/5". According to our > project's reading of the POSIX tea leaves we should be doing that to > nail down the directory entry. Really? 'base/5' is fsync'd by initdb, when it's created. I didn't think we try to fsync() the directory, when a new file is created in it. We do that with durable_rename() and durable_unlink(), but not with file creation. Hmm, if a relation is dropped, we use plain unlink() to delete it (at the next checkpoint). Should we use durable_unlink() there, or otherwise arrange to fsync() the parent directory? - Heikki
Hi, On 2022-01-28 00:39:22 +0200, Heikki Linnakangas wrote: > On 28/01/2022 00:11, Thomas Munro wrote: > > ... but we still never synchronize "base/5". According to our > > project's reading of the POSIX tea leaves we should be doing that to > > nail down the directory entry. > > Really? 'base/5' is fsync'd by initdb, when it's created. I didn't think we > try to fsync() the directory, when a new file is created in it. I've not heard of concrete reports of it being needed (whereas the directory fsync being needed after a rename() is pretty easy to be reproduce). There's some technical reasons why it'd make sense for it to only be really needed for things after the initial file creation, but I'm not sure it's a good idea to rely on it. From the filesystem POV a file doesn't necessarily know which directory it is in, and it can be several at once due to hardlinks. Scheduling all the directories entries to be durable only really is feasible if all metadata operations are globally ordered - which we don't really want for scalability reasons. But when initially creating a file, it's always in the context of a directory. I assume that most journalled filesystem is going to have the creation of the directory entry and of the file itself be related journalling operations. But I wouldn't bet it's all, and theoretically we claim to be usable on non-journalled filesystems as well... Greetings, Andres Freund
On Fri, Jan 28, 2022 at 12:36 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-01-28 00:39:22 +0200, Heikki Linnakangas wrote: > > On 28/01/2022 00:11, Thomas Munro wrote: > > > ... but we still never synchronize "base/5". According to our > > > project's reading of the POSIX tea leaves we should be doing that to > > > nail down the directory entry. > > > > Really? 'base/5' is fsync'd by initdb, when it's created. I didn't think we > > try to fsync() the directory, when a new file is created in it. > > I've not heard of concrete reports of it being needed (whereas the directory > fsync being needed after a rename() is pretty easy to be reproduce). There's > some technical reasons why it'd make sense for it to only be really needed for > things after the initial file creation, but I'm not sure it's a good idea to > rely on it. I don't personally know of any system where that would break either. I base my paranoia on man pages and the Austin Group's famous open ticket 0000672. Those don't mention special treatment for O_CREAT, though I get the technical reason why it's different. I think we should probably do something about that even if it's hypothetical, but I'm happy to call it a separate topic and follow up later (like commit aca74843 did for SLRUs).
On Fri, Jan 28, 2022 at 11:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm, if a relation is dropped, we use plain unlink() to delete it (at > the next checkpoint). Should we use durable_unlink() there, or otherwise > arrange to fsync() the parent directory? Hmmmmm. I think the latter might be a good idea, but not because of the file we unlink after checkpoint. Rationale: On commit, we truncate all segments, and unlink segments > 0. After checkpoint, we unlink segment 0 (the tombstone preventing relfilenode recycling). So, it might close some storage leak windows if we did: 1. register_dirty_segment() on truncated segment 0, so that at checkpoint it is fsynced. That means that if we lose power between the checkpoint and the unlink(), at least its size is durably zero and not 1GB. I don't know how to completely avoid leaking empty tombstone files if you lose power in that window. durable_unlink() may narrow the window but it's still on the wrong side of a checkpoint and won't be replayed on crash recovery. I hope we can get rid of tombstones completely as Dilip is attempting in [1]. 2. fsync() the containing directory as part of checkpointing, so the unlinks of non-tombstone segments are made durable at checkpoint. Otherwise, after checkpoint, you might lose power, come back up and find the segments are still present in the directory, and worse, not truncated. AFAICS we have no defences against zombie segments > 0 if the tombstone is gone, allowing recycling. Zombie segments could appear to be concatenated to the next user of the relfilenode. That problem goes away with Dilip's project. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BfTEcYQvc18aEbrJjPri99A09JZcEXzXjT5h57S%2BAgkw%40mail.gmail.com
This patch required a trivial rebase after conflicting with bfcf1b3480 so I've attached a v2 to get the CFBot to run this. -- Daniel Gustafsson
Вложения
On 03/07/2023 15:59, Daniel Gustafsson wrote: > This patch required a trivial rebase after conflicting with bfcf1b3480 so I've > attached a v2 to get the CFBot to run this. Thank you! Pushed to all supported branches. (Without forgetting the new REL_16_STABLE :-D ). -- Heikki Linnakangas Neon (https://neon.tech)