Обсуждение: Creation of an empty table is not fsync'd at checkpoint

Поиск
Список
Период
Сортировка

Creation of an empty table is not fsync'd at checkpoint

От
Heikki Linnakangas
Дата:
If you create an empty table, it is not fsync'd. As soon as you insert a 
row to it, register_dirty_segment() gets called, and after that, the 
next checkpoint will fsync it. But before that, the creation itself is 
never fsync'd. That's obviously not great.

The lack of an fsync is a bit hard to prove because it requires a 
hardware failure, or a simulation of it, and can be affected by 
filesystem options too. But I was able to demonstrate a problem with 
these steps:

1. Create a VM with two virtual disks. Use ext4, with 'data=writeback' 
option (I'm not sure if that's required). Install PostgreSQL on one of 
the virtual disks.

2. Start the server, and create a tablespace on the other disk:

CREATE TABLESPACE foospc LOCATION '/data/heikki';

3. Do this:

CREATE TABLE foo (i int) TABLESPACE foospc;
CHECKPOINT;

4. Immediately after that, kill the VM. I used:

killall -9 qemu-system-x86_64

5. Restart the VM, restart PostgreSQL. Now when you try to use the 
table, you get an error:

postgres=# select * from crashtest ;
ERROR:  could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": 
No such file or directory

I was not able to reproduce this without the tablespace on a different 
virtual disk, I presume because ext4 orders the writes so that the 
checkpoint implicitly always flushes the creation of the file to disk. I 
tried data=writeback but it didn't make a difference. But with a 
separate disk, it happens every time.

I think the simplest fix is to call register_dirty_segment() from 
mdcreate(). As in the attached. Thoughts?

- Heikki
Вложения

Re: Creation of an empty table is not fsync'd at checkpoint

От
Andres Freund
Дата:
Hi,

On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote:
> If you create an empty table, it is not fsync'd. As soon as you insert a row
> to it, register_dirty_segment() gets called, and after that, the next
> checkpoint will fsync it. But before that, the creation itself is never
> fsync'd. That's obviously not great.

Good catch.


> The lack of an fsync is a bit hard to prove because it requires a hardware
> failure, or a simulation of it, and can be affected by filesystem options
> too.

I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.

My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced.  All of this in assert builds only, of course.



> 
> 1. Create a VM with two virtual disks. Use ext4, with 'data=writeback'
> option (I'm not sure if that's required). Install PostgreSQL on one of the
> virtual disks.
> 
> 2. Start the server, and create a tablespace on the other disk:
> 
> CREATE TABLESPACE foospc LOCATION '/data/heikki';
> 
> 3. Do this:
> 
> CREATE TABLE foo (i int) TABLESPACE foospc;
> CHECKPOINT;
> 
> 4. Immediately after that, kill the VM. I used:
> 
> killall -9 qemu-system-x86_64
> 
> 5. Restart the VM, restart PostgreSQL. Now when you try to use the table,
> you get an error:
> 
> postgres=# select * from crashtest ;
> ERROR:  could not open file "pg_tblspc/81921/PG_15_202201271/5/98304": No
> such file or directory

I think it might be good to have a bunch of in-core scripting to do this kind
of thing.  I think we can get lower iteration time by using linux
device-mapper targets to refuse writes.


> I was not able to reproduce this without the tablespace on a different
> virtual disk, I presume because ext4 orders the writes so that the
> checkpoint implicitly always flushes the creation of the file to disk.

It's likely that the control file sync at the end of a checkpoint has the side
effect of also forcing the file creation to be durable if on the same
tablespace (it'd not make the file contents durable, but they don't exist
here, so ...).


> I tried data=writeback but it didn't make a difference. But with a separate
> disk, it happens every time.

My understanding is that data=writeback just stops a journal flush of
filesystem metadata from also flushing all previously dirtied data. Which
often is good, because that's a lot of unnecessary writes. Even with
data=writeback prior file system journal contents are still synced to disk.


> I think the simplest fix is to call register_dirty_segment() from
> mdcreate(). As in the attached. Thoughts?

> diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> index d26c915f90e..2dfd80ca66b 100644
> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -225,6 +225,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
>      mdfd = &reln->md_seg_fds[forkNum][0];
>      mdfd->mdfd_vfd = fd;
>      mdfd->mdfd_segno = 0;
> +
> +    if (!SmgrIsTemp(reln))
> +        register_dirty_segment(reln, forkNum, mdfd);

I think we might want different handling for unlogged tables and for relations
that are synced to disk in other ways (e.g. index builds). But I've not
re-read the relevant code.

Greetings,

Andres Freund



Re: Creation of an empty table is not fsync'd at checkpoint

От
Thomas Munro
Дата:
On Fri, Jan 28, 2022 at 7:28 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote:
> > I was not able to reproduce this without the tablespace on a different
> > virtual disk, I presume because ext4 orders the writes so that the
> > checkpoint implicitly always flushes the creation of the file to disk.
>
> It's likely that the control file sync at the end of a checkpoint has the side
> effect of also forcing the file creation to be durable if on the same

> tablespace (it'd not make the file contents durable, but they don't exist
> here, so ...).

It might be possible to avoid that on xfs or pretty much any other
file system.  I wasn't following this closely, but even with ext4's
recent fast commit changes, its fsync implementation still
deliberately synchronises data for other file descriptors as a side
effect as summarised in [1], unlike xfs and other systems.  So they've
caught up with xfs's concurrent writes (and gone further than xfs by
doing it also for buffered I/O giving up even page-level atomicity, as
discussed in a couple of other threads), but not yet decided to pull
the trigger on just-fsync-what-I-asked-for.

[1] https://lwn.net/Articles/842385/



Re: Creation of an empty table is not fsync'd at checkpoint

От
Thomas Munro
Дата:
On Fri, Jan 28, 2022 at 6:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I think the simplest fix is to call register_dirty_segment() from
> mdcreate(). As in the attached. Thoughts?

+1



Re: Creation of an empty table is not fsync'd at checkpoint

От
Andres Freund
Дата:
Hi,

On 2022-01-28 08:01:01 +1300, Thomas Munro wrote:
> On Fri, Jan 28, 2022 at 7:28 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-01-27 19:55:45 +0200, Heikki Linnakangas wrote:
> > > I was not able to reproduce this without the tablespace on a different
> > > virtual disk, I presume because ext4 orders the writes so that the
> > > checkpoint implicitly always flushes the creation of the file to disk.
> >
> > It's likely that the control file sync at the end of a checkpoint has the side
> > effect of also forcing the file creation to be durable if on the same
> 
> > tablespace (it'd not make the file contents durable, but they don't exist
> > here, so ...).
> 
> It might be possible to avoid that on xfs or pretty much any other
> file system.  I wasn't following this closely, but even with ext4's
> recent fast commit changes, its fsync implementation still
> deliberately synchronises data for other file descriptors as a side
> effect as summarised in [1], unlike xfs and other systems.

Not data, just metadata, right? Well, and volatile write caches (by virtue of
doing an otherwise unnecessary REQ_PREFLUSH). With data=writeback file
contents for file a are not flushed to disk (rather than from disk write
cache) when file b is fsynced. Before/After the fast commit feature.


> So they've caught up with xfs's concurrent writes (and gone further than xfs
> by doing it also for buffered I/O giving up even page-level atomicity, as
> discussed in a couple of other threads), but not yet decided to pull the
> trigger on just-fsync-what-I-asked-for.

I don't think the page level locking and the changes above / fsyncing are
pretty much independent?

Greetings,

Andres Freund



Re: Creation of an empty table is not fsync'd at checkpoint

От
Thomas Munro
Дата:
On Fri, Jan 28, 2022 at 8:17 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-28 08:01:01 +1300, Thomas Munro wrote:
> > It might be possible to avoid that on xfs or pretty much any other
> > file system.  I wasn't following this closely, but even with ext4's
> > recent fast commit changes, its fsync implementation still
> > deliberately synchronises data for other file descriptors as a side
> > effect as summarised in [1], unlike xfs and other systems.
>
> Not data, just metadata, right? Well, and volatile write caches (by virtue of
> doing an otherwise unnecessary REQ_PREFLUSH). With data=writeback file
> contents for file a are not flushed to disk (rather than from disk write
> cache) when file b is fsynced. Before/After the fast commit feature.

Pass.  [/me suppresses the urge to go and test].  I'm just going by
words in that article and T'so's linked email about file entanglement
and the global barrier nature of fsync.  I'm not studying the code.

> > So they've caught up with xfs's concurrent writes (and gone further than xfs
> > by doing it also for buffered I/O giving up even page-level atomicity, as
> > discussed in a couple of other threads), but not yet decided to pull the
> > trigger on just-fsync-what-I-asked-for.
>
> I don't think the page level locking and the changes above / fsyncing are
> pretty much independent?

Right, the concurrency thing is completely unrelated and an old change
(maybe so old that I'm actually thinking of ext3?).  I was just
commenting on two historical reasons why people used to switch to xfs
(especially other databases using DIO).  I see that Noah's results
showed "atomicity" (for some definition) for read/write in 3.10, but
not for pread/pwrite, with the difference gone in 4.9, so I guess
perhaps somewhere between those releases is where our unlocked control
file access became problematic, but perhaps it wasn't fundamental,
just a quirk of the implementation of read/write with implicit
position, and if we'd used pread/pwrite for the control file we'd have
seen it sooner.  *shrug*



Re: Creation of an empty table is not fsync'd at checkpoint

От
Thomas Munro
Дата:
On Fri, Jan 28, 2022 at 8:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 28, 2022 at 6:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I think the simplest fix is to call register_dirty_segment() from
> > mdcreate(). As in the attached. Thoughts?
>
> +1

[Testing]

Erm, so now I see my new table in checkpoint's activities:

openat(AT_FDCWD,"base/5/16399",O_RDWR,00)     = 20 (0x14)
fsync(20)                     = 0 (0x0)

... but we still never synchronize "base/5".  According to our
project's reading of the POSIX tea leaves we should be doing that to
nail down the directory entry.



Re: Creation of an empty table is not fsync'd at checkpoint

От
Heikki Linnakangas
Дата:
On 28/01/2022 00:11, Thomas Munro wrote:
> On Fri, Jan 28, 2022 at 8:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Fri, Jan 28, 2022 at 6:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I think the simplest fix is to call register_dirty_segment() from
>>> mdcreate(). As in the attached. Thoughts?
>>
>> +1
> 
> [Testing]
> 
> Erm, so now I see my new table in checkpoint's activities:
> 
> openat(AT_FDCWD,"base/5/16399",O_RDWR,00)     = 20 (0x14)
> fsync(20)                     = 0 (0x0)
> 
> ... but we still never synchronize "base/5".  According to our
> project's reading of the POSIX tea leaves we should be doing that to
> nail down the directory entry.

Really? 'base/5' is fsync'd by initdb, when it's created. I didn't think 
we try to fsync() the directory, when a new file is created in it. We do 
that with durable_rename() and durable_unlink(), but not with file creation.

Hmm, if a relation is dropped, we use plain unlink() to delete it (at 
the next checkpoint). Should we use durable_unlink() there, or otherwise 
arrange to fsync() the parent directory?

- Heikki



Re: Creation of an empty table is not fsync'd at checkpoint

От
Andres Freund
Дата:
Hi,

On 2022-01-28 00:39:22 +0200, Heikki Linnakangas wrote:
> On 28/01/2022 00:11, Thomas Munro wrote:
> > ... but we still never synchronize "base/5".  According to our
> > project's reading of the POSIX tea leaves we should be doing that to
> > nail down the directory entry.
> 
> Really? 'base/5' is fsync'd by initdb, when it's created. I didn't think we
> try to fsync() the directory, when a new file is created in it.

I've not heard of concrete reports of it being needed (whereas the directory
fsync being needed after a rename() is pretty easy to be reproduce). There's
some technical reasons why it'd make sense for it to only be really needed for
things after the initial file creation, but I'm not sure it's a good idea to
rely on it.

From the filesystem POV a file doesn't necessarily know which directory it is
in, and it can be several at once due to hardlinks. Scheduling all the
directories entries to be durable only really is feasible if all metadata
operations are globally ordered - which we don't really want for scalability
reasons.

But when initially creating a file, it's always in the context of a
directory. I assume that most journalled filesystem is going to have the
creation of the directory entry and of the file itself be related journalling
operations. But I wouldn't bet it's all, and theoretically we claim to be
usable on non-journalled filesystems as well...

Greetings,

Andres Freund



Re: Creation of an empty table is not fsync'd at checkpoint

От
Thomas Munro
Дата:
On Fri, Jan 28, 2022 at 12:36 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-28 00:39:22 +0200, Heikki Linnakangas wrote:
> > On 28/01/2022 00:11, Thomas Munro wrote:
> > > ... but we still never synchronize "base/5".  According to our
> > > project's reading of the POSIX tea leaves we should be doing that to
> > > nail down the directory entry.
> >
> > Really? 'base/5' is fsync'd by initdb, when it's created. I didn't think we
> > try to fsync() the directory, when a new file is created in it.
>
> I've not heard of concrete reports of it being needed (whereas the directory
> fsync being needed after a rename() is pretty easy to be reproduce). There's
> some technical reasons why it'd make sense for it to only be really needed for
> things after the initial file creation, but I'm not sure it's a good idea to
> rely on it.

I don't personally know of any system where that would break either.
I base my paranoia on man pages and the Austin Group's famous open
ticket 0000672.  Those don't mention special treatment for O_CREAT,
though I get the technical reason why it's different.  I think we
should probably do something about that even if it's hypothetical, but
I'm happy to call it a separate topic and follow up later (like commit
aca74843 did for SLRUs).



Re: Creation of an empty table is not fsync'd at checkpoint

От
Thomas Munro
Дата:
On Fri, Jan 28, 2022 at 11:39 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, if a relation is dropped, we use plain unlink() to delete it (at
> the next checkpoint). Should we use durable_unlink() there, or otherwise
> arrange to fsync() the parent directory?

Hmmmmm.  I think the latter might be a good idea, but not because of
the file we unlink after checkpoint.  Rationale:  On commit, we
truncate all segments, and unlink segments > 0.  After checkpoint, we
unlink segment 0 (the tombstone preventing relfilenode recycling).
So, it might close some storage leak windows if we did:

1.  register_dirty_segment() on truncated segment 0, so that at
checkpoint it is fsynced.  That means that if we lose power between
the checkpoint and the unlink(), at least its size is durably zero and
not 1GB.  I don't know how to completely avoid leaking empty tombstone
files if you lose power in that window.  durable_unlink() may narrow
the window but it's still on the wrong side of a checkpoint and won't
be replayed on crash recovery.  I hope we can get rid of tombstones
completely as Dilip is attempting in [1].

2.  fsync() the containing directory as part of checkpointing, so the
unlinks of non-tombstone segments are made durable at checkpoint.
Otherwise, after checkpoint, you might lose power, come back up and
find the segments are still present in the directory, and worse, not
truncated.

AFAICS we have no defences against zombie segments > 0 if the
tombstone is gone, allowing recycling.  Zombie segments could appear
to be concatenated to the next user of the relfilenode.  That problem
goes away with Dilip's project.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BfTEcYQvc18aEbrJjPri99A09JZcEXzXjT5h57S%2BAgkw%40mail.gmail.com



Re: Creation of an empty table is not fsync'd at checkpoint

От
Daniel Gustafsson
Дата:
This patch required a trivial rebase after conflicting with bfcf1b3480 so I've
attached a v2 to get the CFBot to run this.

--
Daniel Gustafsson


Вложения

Re: Creation of an empty table is not fsync'd at checkpoint

От
Heikki Linnakangas
Дата:
On 03/07/2023 15:59, Daniel Gustafsson wrote:
> This patch required a trivial rebase after conflicting with bfcf1b3480 so I've
> attached a v2 to get the CFBot to run this.

Thank you! Pushed to all supported branches. (Without forgetting the new 
REL_16_STABLE :-D ).

-- 
Heikki Linnakangas
Neon (https://neon.tech)