Обсуждение: Avoiding smgrimmedsync() during nbtree index builds

Поиск

Список

Период

Сортировка

Avoiding smgrimmedsync() during nbtree index builds

От

Andres Freund

Дата:

21 января 2021 г., 23:36:56

Hi,

Every nbtree index build currently does an smgrimmedsync at the end:

/*
 * Read tuples in correct sort order from tuplesort, and load them into
 * btree leaves.
 */
static void
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
...
    /*
     * When we WAL-logged index pages, we must nonetheless fsync index files.
     * Since we're building outside shared buffers, a CHECKPOINT occurring
     * during the build has no way to flush the previously written data to
     * disk (indeed it won't know the index even exists).  A crash later on
     * would replay WAL from the checkpoint, therefore it wouldn't replay our
     * earlier WAL entries. If we do not fsync those pages here, they might
     * still not be on disk when the crash occurs.
     */
    if (wstate->btws_use_wal)
    {
        RelationOpenSmgr(wstate->index);
        smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
    }

In cases we create lots of small indexes, e.g. because of an initial
schema load, partition creation or something like that, that turns out
to be a major limiting factor (unless one turns fsync off).


One way to address that would be to put newly built indexes into s_b
(using a strategy, to avoid blowing out the whole cache), instead of
using smgrwrite() etc directly. But that's a discussion with a bit more
complex tradeoffs.


What I wonder is why the issue addressed in the comment I copied above
can't more efficiently be addressed using sync requests, like we do for
other writes?  It's possibly bit more complicated than just passing
skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?


A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.


Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
   write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Greetings,

Andres Freund

Вложения

createlots.sql

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Heikki Linnakangas

Дата:

22 января 2021 г., 00:54:04

On 21/01/2021 22:36, Andres Freund wrote:
> Hi,
> 
> Every nbtree index build currently does an smgrimmedsync at the end:
> 
> /*
>   * Read tuples in correct sort order from tuplesort, and load them into
>   * btree leaves.
>   */
> static void
> _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> ...
>     /*
>      * When we WAL-logged index pages, we must nonetheless fsync index files.
>      * Since we're building outside shared buffers, a CHECKPOINT occurring
>      * during the build has no way to flush the previously written data to
>      * disk (indeed it won't know the index even exists).  A crash later on
>      * would replay WAL from the checkpoint, therefore it wouldn't replay our
>      * earlier WAL entries. If we do not fsync those pages here, they might
>      * still not be on disk when the crash occurs.
>      */
>     if (wstate->btws_use_wal)
>     {
>         RelationOpenSmgr(wstate->index);
>         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
>     }
> 
> In cases we create lots of small indexes, e.g. because of an initial
> schema load, partition creation or something like that, that turns out
> to be a major limiting factor (unless one turns fsync off).
> 
> 
> One way to address that would be to put newly built indexes into s_b
> (using a strategy, to avoid blowing out the whole cache), instead of
> using smgrwrite() etc directly. But that's a discussion with a bit more
> complex tradeoffs.
> 
> 
> What I wonder is why the issue addressed in the comment I copied above
> can't more efficiently be addressed using sync requests, like we do for
> other writes?  It's possibly bit more complicated than just passing
> skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?

Makes sense.

> A quick hack (probably not quite correct!) to evaluate the benefit shows
> that the attached script takes 2m17.223s with the smgrimmedsync and
> 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> the former case, CPU bound in the latter.
> 
> Creating lots of tables with indexes (directly or indirectly through
> relations having a toast table) is pretty common, particularly after the
> introduction of partitioning.
> 
> 
> Thinking through the correctness of replacing smgrimmedsync() with sync
> requests, the potential problems that I can see are:
> 
> 1) redo point falls between the log_newpage() and the
>     write()/register_dirty_segment() in smgrextend/smgrwrite.
> 2) redo point falls between write() and register_dirty_segment()
> 
> But both of these are fine in the context of initially filling a newly
> created relfilenode, as far as I can tell? Otherwise the current
> smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the 
register_dirty_segment(), and the checkpointer finishes the whole 
checkpoint before register_dirty_segment(), you are not safe. That can't 
happen with write from the buffer manager, because the checkpointer 
would block waiting for the flush of the buffer to finish.

- Heikki

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Andres Freund

Дата:

22 января 2021 г., 01:51:23

Hi,

On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> On 21/01/2021 22:36, Andres Freund wrote:
> > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > that the attached script takes 2m17.223s with the smgrimmedsync and
> > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > the former case, CPU bound in the latter.
> >
> > Creating lots of tables with indexes (directly or indirectly through
> > relations having a toast table) is pretty common, particularly after the
> > introduction of partitioning.
> >
> >
> > Thinking through the correctness of replacing smgrimmedsync() with sync
> > requests, the potential problems that I can see are:
> >
> > 1) redo point falls between the log_newpage() and the
> >     write()/register_dirty_segment() in smgrextend/smgrwrite.
> > 2) redo point falls between write() and register_dirty_segment()
> >
> > But both of these are fine in the context of initially filling a newly
> > created relfilenode, as far as I can tell? Otherwise the current
> > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
>
> Hmm. If the redo point falls between write() and the
> register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> before register_dirty_segment(), you are not safe. That can't happen with
> write from the buffer manager, because the checkpointer would block waiting
> for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b.  It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]).

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

04 мая 2021 г., 00:24:50

So, I've written a patch which avoids doing the immediate fsync for
index builds either by using shared buffers or by queueing sync requests
for the checkpointer. If a checkpoint starts during the index build and
the backend is not using shared buffers for the index build, it will
need to do the fsync.

The reviewer will notice that _bt_load() extends the index relation for
the metapage before beginning the actual load of leaf pages but does not
actually write the metapage until the end of the index build. When using
shared buffers, it was difficult to create block 0 of the index after
creating all of the other blocks, as the block number is assigned inside
of ReadBuffer_common(), and it doesn't really work with the current
bufmgr API to extend a relation with a caller-specified block number.

I am not entirely sure of the correctness of doing an smgrextend() (when
not using shared buffers) without writing any WAL. However, the metapage
contents are not written until after WAL logging them later in
_bt_blwritepage(), so, perhaps it is okay?

I am also not fond of the change to the signature of _bt_uppershutdown()
that this implementation forces. Now, I must pass the shared buffer
(when using shared buffers) that I've reserved (pinned and locked) for
the metapage and, if not using shared buffers, the page I've allocated
for the metapage, before doing the index build to _bt_uppershutdown()
after doing the rest of the index build. I don't know that it seems
incorrect -- more that it feels a bit messy (and inefficient) to hold
onto that shared buffer or memory for the duration of the index build,
during which I have no intention of doing anything with that buffer or
memory. However, the alternative I devised was to change
ReadBuffer_common() or to add a new ReadBufferExtended() mode which
indicated that the caller would specify the block number and whether or
not it was an extend, which also didn't seem right.

For the extensions of the index done during index build, I use
ReadBufferExtended() directly instead of _bt_getbuf() for a few reasons.
I thought (am not sure) that I don't need to do
LockRelationForExtension() during index build. Also, I decided to use
RBM_ZERO_AND_LOCK mode so that I had an exclusive lock on the buffer
content instead of doing _bt_lockbuf() (which is what _bt_getbuf()
does). And, most of the places I added the call to ReadBufferExtended(),
the non-shared buffer code path is already initializing the page, so it
made more sense to just share that codepath.

I considered whether or not it made sense to add a new btree utility
function which calls ReadBufferExtended() in this way, however, I wasn't
sure how much that would buy me. The other place it might be able to be
used is btvacuumpage(), but that case is different enough that I'm not
even sure what the function would be called -- basically it would just
be an alternative to _bt_getbuf() for a couple of somewhat unrelated edge
cases.

On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> > On 21/01/2021 22:36, Andres Freund wrote:
> > > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > > that the attached script takes 2m17.223s with the smgrimmedsync and
> > > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > > the former case, CPU bound in the latter.
> > >
> > > Creating lots of tables with indexes (directly or indirectly through
> > > relations having a toast table) is pretty common, particularly after the
> > > introduction of partitioning.
> > >

Moving index builds of indexes which would fit in shared buffers back
into shared buffers has the benefit of eliminating the need to write
them out and fsync them if they will be subsequently used and thus read
right back into shared buffers. This avoids some of the unnecessary
fsyncs Andres is talking about here as well as avoiding some of the
extra IO required to write them and then read them into shared buffers.

I have dummy criteria for whether or not to use shared buffers (if the
number of tuples to be indexed is > 1000). I am considering using a
threshold of some percentage of the size of shared buffers as the
actual criteria for determining where to do the index build.

> > >
> > > Thinking through the correctness of replacing smgrimmedsync() with sync
> > > requests, the potential problems that I can see are:
> > >
> > > 1) redo point falls between the log_newpage() and the
> > >     write()/register_dirty_segment() in smgrextend/smgrwrite.
> > > 2) redo point falls between write() and register_dirty_segment()
> > >
> > > But both of these are fine in the context of initially filling a newly
> > > created relfilenode, as far as I can tell? Otherwise the current
> > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> >
> > Hmm. If the redo point falls between write() and the
> > register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> > before register_dirty_segment(), you are not safe. That can't happen with
> > write from the buffer manager, because the checkpointer would block waiting
> > for the flush of the buffer to finish.
>
> Hm, right.
>
> The easiest way to address that race would be to just record the redo
> pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
> checkpoint started since the start of the index build.
>
> Another approach would be to utilize PGPROC.delayChkpt, but I would
> rather not unnecessarily expand the use of that.
>
> It's kind of interesting - in my aio branch I moved the
> register_dirty_segment() to before the actual asynchronous write (due to
> availability of the necessary data), which ought to be safe because of
> the buffer interlocking. But that doesn't work here, or for other places
> doing writes without going through s_b.  It'd be great if we could come
> up with a general solution, but I don't immediately see anything great.
>
> The best I can come up with is adding helper functions to wrap some of
> the complexity for "unbuffered" writes of doing an immedsync iff the
> redo pointer changed. Something very roughly like
>
> typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
> void unbuffered_prep(UnbufferedWriteState* state);
> void unbuffered_write(UnbufferedWriteState* state, ...);
> void unbuffered_extend(UnbufferedWriteState* state, ...);
> void unbuffered_finish(UnbufferedWriteState* state);
>
> which wouldn't just do the dance to avoid the immedsync() if possible,
> but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
> if we get that [1]).
>

Regarding the implementation, I think having an API to do these
"unbuffered" or "direct" writes outside of shared buffers is a good
idea. In this specific case, the proposed API would not change the code
much. I would just wrap the small diffs I added to the beginning and end
of _bt_load() in the API calls for unbuffered_prep() and
unbuffered_finish() and then tuck away the second half of
_bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
would do so after ensuring the correctness of the logic in this patch.
Then I will work on a patch which implements the unbuffered_write() API
and demonstrates its utility with at least a few of the most compelling
most compelling use cases in the code.

- Melanie

Вложения

v1-0001-Index-build-avoids-immed-fsync.patch

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

29 сентября 2021 г., 21:35:47

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> > > On 21/01/2021 22:36, Andres Freund wrote:
> > > >
> > > > Thinking through the correctness of replacing smgrimmedsync() with sync
> > > > requests, the potential problems that I can see are:
> > > >
> > > > 1) redo point falls between the log_newpage() and the
> > > >     write()/register_dirty_segment() in smgrextend/smgrwrite.
> > > > 2) redo point falls between write() and register_dirty_segment()
> > > >
> > > > But both of these are fine in the context of initially filling a newly
> > > > created relfilenode, as far as I can tell? Otherwise the current
> > > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> > >
> > > Hmm. If the redo point falls between write() and the
> > > register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> > > before register_dirty_segment(), you are not safe. That can't happen with
> > > write from the buffer manager, because the checkpointer would block waiting
> > > for the flush of the buffer to finish.
> >
> > Hm, right.
> >
> > The easiest way to address that race would be to just record the redo
> > pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
> > checkpoint started since the start of the index build.
> >
> > Another approach would be to utilize PGPROC.delayChkpt, but I would
> > rather not unnecessarily expand the use of that.
> >
> > It's kind of interesting - in my aio branch I moved the
> > register_dirty_segment() to before the actual asynchronous write (due to
> > availability of the necessary data), which ought to be safe because of
> > the buffer interlocking. But that doesn't work here, or for other places
> > doing writes without going through s_b.  It'd be great if we could come
> > up with a general solution, but I don't immediately see anything great.
> >
> > The best I can come up with is adding helper functions to wrap some of
> > the complexity for "unbuffered" writes of doing an immedsync iff the
> > redo pointer changed. Something very roughly like
> >
> > typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
> > void unbuffered_prep(UnbufferedWriteState* state);
> > void unbuffered_write(UnbufferedWriteState* state, ...);
> > void unbuffered_extend(UnbufferedWriteState* state, ...);
> > void unbuffered_finish(UnbufferedWriteState* state);
> >
> > which wouldn't just do the dance to avoid the immedsync() if possible,
> > but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
> > if we get that [1]).
> >
>
> Regarding the implementation, I think having an API to do these
> "unbuffered" or "direct" writes outside of shared buffers is a good
> idea. In this specific case, the proposed API would not change the code
> much. I would just wrap the small diffs I added to the beginning and end
> of _bt_load() in the API calls for unbuffered_prep() and
> unbuffered_finish() and then tuck away the second half of
> _bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
> would do so after ensuring the correctness of the logic in this patch.
> Then I will work on a patch which implements the unbuffered_write() API
> and demonstrates its utility with at least a few of the most compelling
> most compelling use cases in the code.
>

I've taken a pass at writing the API for "direct" or "unbuffered" writes
and extends. It introduces the suggested functions: unbuffered_prep(),
unbuffered_finish(), unbuffered_write(), and unbuffered_extend().

This is a rough cut -- corrections welcome and encouraged!

unbuffered_prep() saves the xlog redo pointer at the time it is called.
Then, if the redo pointer hasn't changed by the time unbuffered_finish()
is called, the backend can avoid calling smgrimmedsync(). Note that this
only works if intervening calls to smgrwrite() and smgrextend() pass
skipFsync=False.

unbuffered_write() and unbuffered_extend() might be able to be used even
if unbuffered_prep() and unbuffered_finish() are not used -- for example
hash indexes do something I don't entirely understand in which they call
smgrextend() directly when allocating buckets but then initialize the
new bucket pages using the bufmgr machinery.

I also intend to move accounting of pages written and extended into the
unbuffered_write() and unbuffered_extend() functions using the functions
I propose in [1] to populate a new view, pg_stat_buffers. Then this
"unbuffered" IO would be counted in stats.

I picked the name "direct" for the directory in /src/backend/storage
because I thought that these functions are analogous to direct IO in
Linux -- in that they are doing IO without going through Postgres bufmgr
-- unPGbuffered, basically. Other suggestions were "raw" and "relIO".
Raw seemed confusing since raw device IO is pretty far from what is
happening here. RelIO didn't seem like it belonged next to bufmgr (to
me). However, direct and unbuffered will both soon become fraught
terminology with the introduction of AIO and direct IO to Postgres...

- Melanie

[1] https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de

Вложения

v1-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patch

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

19 ноября 2021 г., 23:11:57

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> So, I've written a patch which avoids doing the immediate fsync for
> index builds either by using shared buffers or by queueing sync requests
> for the checkpointer. If a checkpoint starts during the index build and
> the backend is not using shared buffers for the index build, it will
> need to do the fsync.

I've attached a rebased version of the patch (old patch doesn't apply).

With the patch applied (compiled at O2), creating twenty empty tables in
a transaction with a text column and an index on another column (like in
the attached SQL [make a test_idx schema first]) results in a fairly
consistent 15-30% speedup on my laptop (timings still in tens of ms -
avg 50 ms to avg 65 ms so run variation affects the % a lot).
Reducing the number of fsync calls from 40 to 1 was what likely causes
this difference.

- Melanie

Вложения

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

23 ноября 2021 г., 23:51:51

On Fri, Nov 19, 2021 at 3:11 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > So, I've written a patch which avoids doing the immediate fsync for
> > index builds either by using shared buffers or by queueing sync requests
> > for the checkpointer. If a checkpoint starts during the index build and
> > the backend is not using shared buffers for the index build, it will
> > need to do the fsync.
>
> I've attached a rebased version of the patch (old patch doesn't apply).
>
> With the patch applied (compiled at O2), creating twenty empty tables in
> a transaction with a text column and an index on another column (like in
> the attached SQL [make a test_idx schema first]) results in a fairly
> consistent 15-30% speedup on my laptop (timings still in tens of ms -
> avg 50 ms to avg 65 ms so run variation affects the % a lot).
> Reducing the number of fsync calls from 40 to 1 was what likely causes
> this difference.

Correction for the above: I haven't worked on mac in a while and didn't
realize that wal_sync_method=fsync was not enough to ensure that all
buffered data would actually be flushed to disk on mac (which was
required for my test).

Setting wal_sync_method to fsync_writethrough with my small test I see
over a 5-6X improvement in time taken - from 1 second average to 0.2
seconds average. And running Andres' "createlots.sql" test, I see around
a 16x improvement - from around 11 minutes to around 40 seconds. I ran
it on a laptop running macos and other than wal_sync_method, I only
changed shared_buffers (to 1GB).

- Melanie

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Andres Freund

Дата:

09 декабря 2021 г., 22:33:51

Hi,

On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
> From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 15 Apr 2021 07:01:01 -0400
> Subject: [PATCH v2] Index build avoids immed fsync
> 
> Avoid immediate fsync for just built indexes either by using shared
> buffers or by leveraging checkpointer's SyncRequest queue. When a
> checkpoint begins during the index build, if not using shared buffers,
> the backend will have to do its own fsync.
> ---
>  src/backend/access/nbtree/nbtree.c  |  39 +++---
>  src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
>  src/backend/access/transam/xlog.c   |  14 +++
>  src/include/access/xlog.h           |   1 +
>  4 files changed, 188 insertions(+), 52 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index 40ad0956e0..a2e32f18e6 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -150,30 +150,29 @@ void
>  btbuildempty(Relation index)
>  {
>      Page        metapage;
> +    Buffer metabuf;
>  
> -    /* Construct metapage. */
> -    metapage = (Page) palloc(BLCKSZ);
> -    _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
> -
> +    // TODO: test this.

Shouldn't this path have plenty coverage?


>      /*
> -     * Write the page and log it.  It might seem that an immediate sync would
> -     * be sufficient to guarantee that the file exists on disk, but recovery
> -     * itself might remove it while replaying, for example, an
> -     * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> -     * this even when wal_level=minimal.
> +     * Construct metapage.
> +     * Because we don't need to lock the relation for extension (since
> +     * noone knows about it yet) and we don't need to initialize the
> +     * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
> +     * (with P_NEW and BT_WRITE) is overkill.

Isn't the more relevant operation the log_newpage_buffer()?


> However, it might be worth
> +     * either modifying it or adding a new helper function instead of
> +     * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
> +     * because we want to hold an exclusive lock on the buffer content
>       */

"modifying it" refers to what?

I don't see a problem using ReadBufferExtended() here. Why would you like to
replace it with something else?



> +    /*
> +     * Based on the number of tuples, either create a buffered or unbuffered
> +     * write state. if the number of tuples is small, make a buffered write
> +     * if the number of tuples is larger, then we make an unbuffered write state
> +     * and must ensure that we check the redo pointer to know whether or not we
> +     * need to fsync ourselves
> +     */
>  
>      /*
>       * Finish the build by (1) completing the sort of the spool file, (2)
>       * inserting the sorted tuples into btree pages and (3) building the upper
>       * levels.  Finally, it may also be necessary to end use of parallelism.
>       */
> -    _bt_leafbuild(buildstate.spool, buildstate.spool2);
> +    if (reltuples > 1000)

I'm ok with some random magic constant here, but it seems worht putting it in
some constant / #define to make it more obvious.

> +        _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
> +    else
> +        _bt_leafbuild(buildstate.spool, buildstate.spool2, true);

Why duplicate the function call?


>  /*
>   * allocate workspace for a new, clean btree page, not linked to any siblings.
> + * If index is not built in shared buffers, buf should be InvalidBuffer
>   */
>  static Page
> -_bt_blnewpage(uint32 level)
> +_bt_blnewpage(uint32 level, Buffer buf)
>  {
>      Page        page;
>      BTPageOpaque opaque;
>  
> -    page = (Page) palloc(BLCKSZ);
> +    if (buf)
> +        page = BufferGetPage(buf);
> +    else
> +        page = (Page) palloc(BLCKSZ);
>  
>      /* Zero the page and set up standard page header info */
>      _bt_pageinit(page, BLCKSZ);

Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
pfree()ing a page that's actually in shared buffers. I think it'd make sense
to separate the allocation from the initialization bits?


> @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
>   * emit a completed btree page, and release the working storage.
>   */
>  static void
> -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
>  {
> +    if (wstate->use_shared_buffers)
> +    {
> +        Assert(buf);
> +        START_CRIT_SECTION();
> +        MarkBufferDirty(buf);
> +        if (wstate->btws_use_wal)
> +            log_newpage_buffer(buf, true);
> +        END_CRIT_SECTION();
> +        _bt_relbuf(wstate->index, buf);
> +        return;
> +    }
> +
>      /* XLOG stuff */
>      if (wstate->btws_use_wal)
>      {
> @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>          smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
>                     wstate->btws_pages_written++,
>                     (char *) wstate->btws_zeropage,
> -                   true);
> +                   false);
>      }

Is there a good place to document the way we ensure durability for this path?


> +    /*
> +     * Extend the index relation upfront to reserve the metapage
> +     */
> +    if (wstate->use_shared_buffers)
> +    {
> +        /*
> +         * We should not need to LockRelationForExtension() as no one else knows
> +         * about this index yet?
> +         * Extend the index relation by one block for the metapage. _bt_getbuf()
> +         * is not used here as it does _bt_pageinit() which is one later by

*done


> +         * _bt_initmetapage(). We will fill in the metapage and write it out at
> +         * the end of index build when we have all of the information required
> +         * for the metapage. However, we initially extend the relation for it to
> +         * occupy block 0 because it is much easier when using shared buffers to
> +         * extend the relation with a block number that is always increasing by
> +         * 1.

Not quite following what you're trying to get at here. There generally is no
way to extend a relation except by increasing block numbers?


> @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
>       * still not be on disk when the crash occurs.
>       */
>      if (wstate->btws_use_wal)
> -        smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +    {
> +        if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
> +            smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +    }
>  }
>  
>  /*

This needs documentation. The whole comment above isn't accurate anymore afaict?


> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 1616448368..63fd212787 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
>      return RedoRecPtr;
>  }
>  
> +bool
> +RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{
> +    XLogRecPtr    ptr;
> +
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    ptr = XLogCtl->RedoRecPtr;
> +    SpinLockRelease(&XLogCtl->info_lck);
> +
> +    if (RedoRecPtr < ptr)
> +        RedoRecPtr = ptr;
> +    return RedoRecPtr != comparator_ptr;
> +}

What's the deal with the < comparison?


Greetings,

Andres Freund

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

11 января 2022 г., 01:50:40

I have attached a v3 which includes two commits -- one of which
implements the directmgr API and uses it and the other which adds
functionality to use either directmgr or bufmgr API during index build.

Also registering for march commitfest.

On Thu, Dec 9, 2021 at 2:33 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
> > From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Thu, 15 Apr 2021 07:01:01 -0400
> > Subject: [PATCH v2] Index build avoids immed fsync
> >
> > Avoid immediate fsync for just built indexes either by using shared
> > buffers or by leveraging checkpointer's SyncRequest queue. When a
> > checkpoint begins during the index build, if not using shared buffers,
> > the backend will have to do its own fsync.
> > ---
> >  src/backend/access/nbtree/nbtree.c  |  39 +++---
> >  src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
> >  src/backend/access/transam/xlog.c   |  14 +++
> >  src/include/access/xlog.h           |   1 +
> >  4 files changed, 188 insertions(+), 52 deletions(-)
> >
> > diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> > index 40ad0956e0..a2e32f18e6 100644
> > --- a/src/backend/access/nbtree/nbtree.c
> > +++ b/src/backend/access/nbtree/nbtree.c
> > @@ -150,30 +150,29 @@ void
> >  btbuildempty(Relation index)
> >  {
> >       Page            metapage;
> > +     Buffer metabuf;
> >
> > -     /* Construct metapage. */
> > -     metapage = (Page) palloc(BLCKSZ);
> > -     _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
> > -
> > +     // TODO: test this.
>
> Shouldn't this path have plenty coverage?

Yep. Sorry.

> >       /*
> > -      * Write the page and log it.  It might seem that an immediate sync would
> > -      * be sufficient to guarantee that the file exists on disk, but recovery
> > -      * itself might remove it while replaying, for example, an
> > -      * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> > -      * this even when wal_level=minimal.
> > +      * Construct metapage.
> > +      * Because we don't need to lock the relation for extension (since
> > +      * noone knows about it yet) and we don't need to initialize the
> > +      * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
> > +      * (with P_NEW and BT_WRITE) is overkill.
>
> Isn't the more relevant operation the log_newpage_buffer()?

Returning to this after some time away, many of my comments no longer
make sense to me either. I can't actually tell which diff your question
applies to because this comment was copy-pasted in two different places
in my code. Also, I've removed this comment and added new ones. So,
given all that, is there still something about log_newpage_buffer() I
should be commenting on?

> > However, it might be worth
> > +      * either modifying it or adding a new helper function instead of
> > +      * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
> > +      * because we want to hold an exclusive lock on the buffer content
> >        */
>
> "modifying it" refers to what?
>
> I don't see a problem using ReadBufferExtended() here. Why would you like to
> replace it with something else?

I would just disregard these comments now.

> > +     /*
> > +      * Based on the number of tuples, either create a buffered or unbuffered
> > +      * write state. if the number of tuples is small, make a buffered write
> > +      * if the number of tuples is larger, then we make an unbuffered write state
> > +      * and must ensure that we check the redo pointer to know whether or not we
> > +      * need to fsync ourselves
> > +      */
> >
> >       /*
> >        * Finish the build by (1) completing the sort of the spool file, (2)
> >        * inserting the sorted tuples into btree pages and (3) building the upper
> >        * levels.  Finally, it may also be necessary to end use of parallelism.
> >        */
> > -     _bt_leafbuild(buildstate.spool, buildstate.spool2);
> > +     if (reltuples > 1000)
>
> I'm ok with some random magic constant here, but it seems worht putting it in
> some constant / #define to make it more obvious.

Done.

> > +             _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
> > +     else
> > +             _bt_leafbuild(buildstate.spool, buildstate.spool2, true);
>
> Why duplicate the function call?

Fixed.

> >  /*
> >   * allocate workspace for a new, clean btree page, not linked to any siblings.
> > + * If index is not built in shared buffers, buf should be InvalidBuffer
> >   */
> >  static Page
> > -_bt_blnewpage(uint32 level)
> > +_bt_blnewpage(uint32 level, Buffer buf)
> >  {
> >       Page            page;
> >       BTPageOpaque opaque;
> >
> > -     page = (Page) palloc(BLCKSZ);
> > +     if (buf)
> > +             page = BufferGetPage(buf);
> > +     else
> > +             page = (Page) palloc(BLCKSZ);
> >
> >       /* Zero the page and set up standard page header info */
> >       _bt_pageinit(page, BLCKSZ);
>
> Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
> pfree()ing a page that's actually in shared buffers. I think it'd make sense
> to separate the allocation from the initialization bits?

Fixed.

> > @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
> >   * emit a completed btree page, and release the working storage.
> >   */
> >  static void
> > -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> > +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
> >  {
> > +     if (wstate->use_shared_buffers)
> > +     {
> > +             Assert(buf);
> > +             START_CRIT_SECTION();
> > +             MarkBufferDirty(buf);
> > +             if (wstate->btws_use_wal)
> > +                     log_newpage_buffer(buf, true);
> > +             END_CRIT_SECTION();
> > +             _bt_relbuf(wstate->index, buf);
> > +             return;
> > +     }
> > +
> >       /* XLOG stuff */
> >       if (wstate->btws_use_wal)
> >       {
> > @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> >               smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> >                                  wstate->btws_pages_written++,
> >                                  (char *) wstate->btws_zeropage,
> > -                                true);
> > +                                false);
> >       }
>
> Is there a good place to document the way we ensure durability for this path?

I added some new comments. Let me know if you think that I am still
missing this documentation.

> > +     /*
> > +      * Extend the index relation upfront to reserve the metapage
> > +      */
> > +     if (wstate->use_shared_buffers)
> > +     {
> > +             /*
> > +              * We should not need to LockRelationForExtension() as no one else knows
> > +              * about this index yet?
> > +              * Extend the index relation by one block for the metapage. _bt_getbuf()
> > +              * is not used here as it does _bt_pageinit() which is one later by
>
> *done
>
>
> > +              * _bt_initmetapage(). We will fill in the metapage and write it out at
> > +              * the end of index build when we have all of the information required
> > +              * for the metapage. However, we initially extend the relation for it to
> > +              * occupy block 0 because it is much easier when using shared buffers to
> > +              * extend the relation with a block number that is always increasing by
> > +              * 1.
>
> Not quite following what you're trying to get at here. There generally is no
> way to extend a relation except by increasing block numbers?

I've updated this comment too. It should make more sense now.

> > @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> >        * still not be on disk when the crash occurs.
> >        */
> >       if (wstate->btws_use_wal)
> > -             smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> > +     {
> > +             if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
> > +                     smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> > +     }
> >  }
> >
> >  /*
>
> This needs documentation. The whole comment above isn't accurate anymore afaict?

Should be correct now.

> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > index 1616448368..63fd212787 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
> >       return RedoRecPtr;
> >  }
> >
> > +bool
> > +RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> > +{
> > +     XLogRecPtr      ptr;
> > +
> > +     SpinLockAcquire(&XLogCtl->info_lck);
> > +     ptr = XLogCtl->RedoRecPtr;
> > +     SpinLockRelease(&XLogCtl->info_lck);
> > +
> > +     if (RedoRecPtr < ptr)
> > +             RedoRecPtr = ptr;
> > +     return RedoRecPtr != comparator_ptr;
> > +}
>
> What's the deal with the < comparison?

I saw that GetRedoRecPtr() does this and thought maybe I should do the
same in this function. I'm not quite sure where I should be getting the
redo pointer.

Maybe I should just call GetRedoRecPtr() and compare it to the one I
saved? I suppose I also thought that maybe someone else in the future
would like to have a function like RedoRecPtrChanged().

- Melanie

Вложения

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

11 января 2022 г., 20:10:54

On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I have attached a v3 which includes two commits -- one of which
> implements the directmgr API and uses it and the other which adds
> functionality to use either directmgr or bufmgr API during index build.
>
> Also registering for march commitfest.

Forgot directmgr.h. Attached v4

- Melanie

Вложения

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Justin Pryzby

Дата:

13 января 2022 г., 18:52:55

On Tue, Jan 11, 2022 at 12:10:54PM -0500, Melanie Plageman wrote:
> On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > I have attached a v3 which includes two commits -- one of which
> > implements the directmgr API and uses it and the other which adds
> > functionality to use either directmgr or bufmgr API during index build.
> >
> > Also registering for march commitfest.
> 
> Forgot directmgr.h. Attached v4

Thanks - I had reconstructed it first ;)

I think the ifndef should be outside the includes:

> +++ b/src/include/storage/directmgr.h
..
> +#include "access/xlogdefs.h"
..
> +#ifndef DIRECTMGR_H
> +#define DIRECTMGR_H

This is failing on windows CI when I use initdb --data-checksums, as attached.

https://cirrus-ci.com/task/5612464120266752
https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs

+++ c:/cirrus/src/test/regress/results/bitmapops.out    2022-01-13 00:47:46.704621200 +0000
..
+ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes

-- 
Justin

Вложения

0003-cirrus-run-initdb-with-data-checksums-for-windows.txt

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Robert Haas

Дата:

13 января 2022 г., 20:18:39

On Wed, Sep 29, 2021 at 2:36 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> unbuffered_write() and unbuffered_extend() might be able to be used even
> if unbuffered_prep() and unbuffered_finish() are not used -- for example
> hash indexes do something I don't entirely understand in which they call
> smgrextend() directly when allocating buckets but then initialize the
> new bucket pages using the bufmgr machinery.

My first thought was that someone might do this to make sure that we
don't run out of disk space after initializing some but not all of the
buckets. Someone might have some reason for wanting to avoid that
corner case. However, in _hash_init() that explanation doesn't make
any sense, because an abort would destroy the entire relation. And in
_hash_alloc_buckets() the variable "zerobuf" points to a buffer that
is not, in fact, all zeroes. So my guess is this is just old, crufty
code - either whatever reasons somebody had for doing it that way are
no longer valid, or there wasn't any good reason even at the time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Justin Pryzby

Дата:

16 января 2022 г., 23:25:59

On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
> This is failing on windows CI when I use initdb --data-checksums, as attached.
> 
> https://cirrus-ci.com/task/5612464120266752
> https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
> 
> +++ c:/cirrus/src/test/regress/results/bitmapops.out    2022-01-13 00:47:46.704621200 +0000
> ..
> +ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes

The failure isn't consistent, so I double checked my report.  I have some more
details:

The problem occurs maybe only ~25% of the time.

The issue is in the 0001 patch.

data-checksums isn't necessary to hit the issue.

errlocation says: LOCATION:  mdread, md.c:686 (the only place the error
exists)

With Andres' windows crash patch, I obtained a backtrace - attached.
https://cirrus-ci.com/task/5978171861368832

https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt

Maybe its a race condition or synchronization problem that nowhere else tends
to hit.

Separate from this issue, I wonder if it'd be useful to write a DEBUG log
showing when btree uses shared_buffers vs fsync.  And a regression test which
first SETs client_min_messages=debug to capture the debug log to demonstrate
when/that new code path is being hit.  I'm not sure if that would be good to
merge, but it may be useful for now.

-- 
Justin

Вложения

crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Justin Pryzby

Дата:

17 января 2022 г., 20:22:07

On Sun, Jan 16, 2022 at 02:25:59PM -0600, Justin Pryzby wrote:
> On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
> > This is failing on windows CI when I use initdb --data-checksums, as attached.
> > 
> > https://cirrus-ci.com/task/5612464120266752
> > https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
> > 
> > +++ c:/cirrus/src/test/regress/results/bitmapops.out    2022-01-13 00:47:46.704621200 +0000
> > ..
> > +ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes
> 
> The failure isn't consistent, so I double checked my report.  I have some more
> details:
> 
> The problem occurs maybe only ~25% of the time.
> 
> The issue is in the 0001 patch.
> 
> data-checksums isn't necessary to hit the issue.
> 
> errlocation says: LOCATION:  mdread, md.c:686 (the only place the error
> exists)
> 
> With Andres' windows crash patch, I obtained a backtrace - attached.
> https://cirrus-ci.com/task/5978171861368832
>
https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt
> 
> Maybe its a race condition or synchronization problem that nowhere else tends
> to hit.

I meant to say that I had not seen this issue anywhere but windows.

But now, by chance, I still had the 0001 patch in my tree, and hit the same
issue on linux:

https://cirrus-ci.com/task/4550618281934848
+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tuplesort.out    2022-01-17 16:06:35.759108172
+0000
 EXPLAIN (COSTS OFF)
 SELECT id, noabort_increasing, noabort_decreasing FROM abbrev_abort_uuids ORDER BY noabort_increasing LIMIT 5;
+ERROR:  could not read block 0 in file "base/16387/t3_36794": read only 0 of 8192 bytes

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Andres Freund

Дата:

24 января 2022 г., 00:55:41

Hi,

On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > I have attached a v3 which includes two commits -- one of which
> > implements the directmgr API and uses it and the other which adds
> > functionality to use either directmgr or bufmgr API during index build.
> >
> > Also registering for march commitfest.
> 
> Forgot directmgr.h. Attached v4

Are you looking at the failures Justin pointed out? Something isn't quite
right yet. See https://postgr.es/m/20220116202559.GW14051%40telsasoft.com and
the subsequent mail about it also triggering on once on linux.


> Thus, the backend must ensure that
> either the Redo pointer has not moved or that the data is fsync'd before
> freeing the page.

"freeing"?


> This is not a problem with pages written in shared buffers because the
> checkpointer will block until all buffers that were dirtied before it
> began finish before it moves the Redo pointer past their associated WAL
> entries.

> This commit makes two main changes:
> 
> 1) It wraps smgrextend() and smgrwrite() in functions from a new API
>    for writing data outside of shared buffers, directmgr.
> 
> 2) It saves the XLOG Redo pointer location before doing the write or
>    extend. It also adds an fsync request for the page to the
>    checkpointer's pending-ops table. Then, after doing the write or
>    extend, if the Redo pointer has moved (meaning a checkpoint has
>    started since it saved it last), then the backend fsync's the page
>    itself. Otherwise, it lets the checkpointer take care of fsync'ing
>    the page the next time it processes the pending-ops table.

Why combine those two into one commit?


> @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
>      /* Now extend the file */
>      while (vm_nblocks_now < vm_nblocks)
>      {
> -        PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
> -
> -        smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
> +        // TODO: aren't these pages empty? why checksum them
> +        unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);

Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:

    /* If we don't need a checksum, just return */
    if (PageIsNew(page) || !DataChecksumsEnabled())
        return;

OTOH, it seems easier to have it there than to later forget it, when
e.g. adding some actual initial content to the pages during the smgrextend().



> @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
>  
>      wstate.heap = btspool->heap;
>      wstate.index = btspool->index;
> +    wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
> +    wstate.ub_wstate.redo = InvalidXLogRecPtr;
>      wstate.inskey = _bt_mkscankey(wstate.index, NULL);
>      /* _bt_mkscankey() won't set allequalimage without metapage */
>      wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
> @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>          if (!wstate->btws_zeropage)
>              wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
>          /* don't set checksum for all-zero page */
> -        smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> -                   wstate->btws_pages_written++,
> -                   (char *) wstate->btws_zeropage,
> -                   true);
> +        unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage,
true);
>      }

There's a bunch of long lines in here...


> -    /*
> -     * When we WAL-logged index pages, we must nonetheless fsync index files.
> -     * Since we're building outside shared buffers, a CHECKPOINT occurring
> -     * during the build has no way to flush the previously written data to
> -     * disk (indeed it won't know the index even exists).  A crash later on
> -     * would replay WAL from the checkpoint, therefore it wouldn't replay our
> -     * earlier WAL entries. If we do not fsync those pages here, they might
> -     * still not be on disk when the crash occurs.
> -     */
>      if (wstate->btws_use_wal)
> -        smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +        unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
>  }

The API of unbuffered_finish() only sometimes getting called, but
unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
it'd make sense to pass in whether the relation needs to be synced or not instead?



>  spgbuildempty(Relation index)
>  {
>      Page        page;
> +    UnBufferedWriteState wstate;
> +
> +    wstate.smgr_rel = RelationGetSmgr(index);
> +    unbuffered_prep(&wstate);

I don't think that's actually safe, and one of the instances could be the
cause cause of the bug CI is seeing:

 * Note: since a relcache flush can cause the file handle to be closed again,
 * it's unwise to hold onto the pointer returned by this function for any
 * long period.  Recommended practice is to just re-execute RelationGetSmgr
 * each time you need to access the SMgrRelation.  It's quite cheap in
 * comparison to whatever an smgr function is going to do.
 */
static inline SMgrRelation
RelationGetSmgr(Relation rel)


Greetings,

Andres Freund

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

09 февраля 2022 г., 21:49:30

Hi,
v5 attached and all email feedback addressed below

On Thu, Jan 13, 2022 at 12:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Sep 29, 2021 at 2:36 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > unbuffered_write() and unbuffered_extend() might be able to be used even
> > if unbuffered_prep() and unbuffered_finish() are not used -- for example
> > hash indexes do something I don't entirely understand in which they call
> > smgrextend() directly when allocating buckets but then initialize the
> > new bucket pages using the bufmgr machinery.
>
> My first thought was that someone might do this to make sure that we
> don't run out of disk space after initializing some but not all of the
> buckets. Someone might have some reason for wanting to avoid that
> corner case. However, in _hash_init() that explanation doesn't make
> any sense, because an abort would destroy the entire relation. And in
> _hash_alloc_buckets() the variable "zerobuf" points to a buffer that
> is not, in fact, all zeroes. So my guess is this is just old, crufty
> code - either whatever reasons somebody had for doing it that way are
> no longer valid, or there wasn't any good reason even at the time.

I notice in the comment before _hash_alloc_buckets() is called, it says

/*
    * We treat allocation of buckets as a separate WAL-logged action.
    * Even if we fail after this operation, won't leak bucket pages;
    * rather, the next split will consume this space. In any case, even
    * without failure we don't use all the space in one split operation.
    */

Does this mean that it is okay that these pages are written outside of
shared buffers and, though skipFsync is passed as false, a checkpoint
starting and finishing between writing the WAL and
register_dirty_segment() followed by a crash could result in lost data?

On Thu, Jan 13, 2022 at 10:52 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I think the ifndef should be outside the includes:

Thanks, fixed!

On Sun, Jan 16, 2022 at 3:26 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Separate from this issue, I wonder if it'd be useful to write a DEBUG log
> showing when btree uses shared_buffers vs fsync.  And a regression test which
> first SETs client_min_messages=debug to capture the debug log to demonstrate
> when/that new code path is being hit.  I'm not sure if that would be good to
> merge, but it may be useful for now.

I will definitely think about doing this.

On Mon, Jan 17, 2022 at 12:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Sun, Jan 16, 2022 at 02:25:59PM -0600, Justin Pryzby wrote:
> > On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
> > > This is failing on windows CI when I use initdb --data-checksums, as attached.
> > >
> > > https://cirrus-ci.com/task/5612464120266752
> > > https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
> > >
> > > +++ c:/cirrus/src/test/regress/results/bitmapops.out        2022-01-13 00:47:46.704621200 +0000
> > > ..
> > > +ERROR:  could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes
> >
> > The failure isn't consistent, so I double checked my report.  I have some more
> > details:
> >
> > The problem occurs maybe only ~25% of the time.
> >
> > The issue is in the 0001 patch.
> >
> > data-checksums isn't necessary to hit the issue.
> >
> > errlocation says: LOCATION:  mdread, md.c:686 (the only place the error
> > exists)
> >
> > With Andres' windows crash patch, I obtained a backtrace - attached.
> > https://cirrus-ci.com/task/5978171861368832
> >
https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt
> >
> > Maybe its a race condition or synchronization problem that nowhere else tends
> > to hit.
>
> I meant to say that I had not seen this issue anywhere but windows.
>
> But now, by chance, I still had the 0001 patch in my tree, and hit the same
> issue on linux:
>
> https://cirrus-ci.com/task/4550618281934848
> +++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tuplesort.out     2022-01-17 16:06:35.759108172
+0000
>  EXPLAIN (COSTS OFF)
>  SELECT id, noabort_increasing, noabort_decreasing FROM abbrev_abort_uuids ORDER BY noabort_increasing LIMIT 5;
> +ERROR:  could not read block 0 in file "base/16387/t3_36794": read only 0 of 8192 bytes

Yes, I think this is due to the problem Andres mentioned with my saving
the SMgrRelation and then trying to use it after a relcache flush. The
new patch version addresses this by always re-executing
RelationGetSmgr() as recommended in the comments.

On Sun, Jan 23, 2022 at 4:55 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> > On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > Thus, the backend must ensure that
> > either the Redo pointer has not moved or that the data is fsync'd before
> > freeing the page.
>
> "freeing"?

Yes, I agree this wording was confusing/incorrect. I meant before it
moves on (I said freeing because it usually pfrees() the page in memory
that it was writing from). I've changed the commit message.

>
> > This is not a problem with pages written in shared buffers because the
> > checkpointer will block until all buffers that were dirtied before it
> > began finish before it moves the Redo pointer past their associated WAL
> > entries.
>
> > This commit makes two main changes:
> >
> > 1) It wraps smgrextend() and smgrwrite() in functions from a new API
> >    for writing data outside of shared buffers, directmgr.
> >
> > 2) It saves the XLOG Redo pointer location before doing the write or
> >    extend. It also adds an fsync request for the page to the
> >    checkpointer's pending-ops table. Then, after doing the write or
> >    extend, if the Redo pointer has moved (meaning a checkpoint has
> >    started since it saved it last), then the backend fsync's the page
> >    itself. Otherwise, it lets the checkpointer take care of fsync'ing
> >    the page the next time it processes the pending-ops table.
>
> Why combine those two into one commit?

I've separated it into three commits -- the above two + a separate
commit that actually has the btree index use the self-fsync
optimization.

> > @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
> >       /* Now extend the file */
> >       while (vm_nblocks_now < vm_nblocks)
> >       {
> > -             PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
> > -
> > -             smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
> > +             // TODO: aren't these pages empty? why checksum them
> > +             unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);
>
> Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:
>
>         /* If we don't need a checksum, just return */
>         if (PageIsNew(page) || !DataChecksumsEnabled())
>                 return;
>
> OTOH, it seems easier to have it there than to later forget it, when
> e.g. adding some actual initial content to the pages during the smgrextend().

I've left these as is and removed the comment.

> > @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
> >
> >       wstate.heap = btspool->heap;
> >       wstate.index = btspool->index;
> > +     wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
> > +     wstate.ub_wstate.redo = InvalidXLogRecPtr;
> >       wstate.inskey = _bt_mkscankey(wstate.index, NULL);
> >       /* _bt_mkscankey() won't set allequalimage without metapage */
> >       wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
> > @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> >               if (!wstate->btws_zeropage)
> >                       wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> >               /* don't set checksum for all-zero page */
> > -             smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> > -                                wstate->btws_pages_written++,
> > -                                (char *) wstate->btws_zeropage,
> > -                                true);
> > +             unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++,
wstate->btws_zeropage,true);
 
> >       }
>
> There's a bunch of long lines in here...

Fixed.

> > -     /*
> > -      * When we WAL-logged index pages, we must nonetheless fsync index files.
> > -      * Since we're building outside shared buffers, a CHECKPOINT occurring
> > -      * during the build has no way to flush the previously written data to
> > -      * disk (indeed it won't know the index even exists).  A crash later on
> > -      * would replay WAL from the checkpoint, therefore it wouldn't replay our
> > -      * earlier WAL entries. If we do not fsync those pages here, they might
> > -      * still not be on disk when the crash occurs.
> > -      */
> >       if (wstate->btws_use_wal)
> > -             smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> > +             unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
> >  }
>
> The API of unbuffered_finish() only sometimes getting called, but
> unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
> it'd make sense to pass in whether the relation needs to be synced or not instead?

I've fixed this. Now unbuffered_prep() and unbuffered_finish() will
always be called. I've added a few options to unbuffered_prep() to
indicate whether or not the smgrimmedsync() should be called in the end
as well as whether or not skipFsync should be passed as true or false to
smgrextend() and smgrwrite() and whether or not the avoiding self-fsync
optimization should be used.

I found it best to do it this way because simply passing whether or not
to do the sync to unbuffered_finish() did not allow me to distinguish
between the case in which the sync should not be done ever (because the
caller did not call smgrimmedsync() or because the relation does not
require WAL) and when smgrimmedsync() should only be done if the redo
pointer has changed (in the case of the optimization).

I thought it actually made for a better API to specify up front (in
unbuffered_prep()) whether or not the caller should be prepared to do
the fsync itself or not and whether it not it wanted to do the
optimization. It feels less prone to error and omission.

> >  spgbuildempty(Relation index)
> >  {
> >       Page            page;
> > +     UnBufferedWriteState wstate;
> > +
> > +     wstate.smgr_rel = RelationGetSmgr(index);
> > +     unbuffered_prep(&wstate);
>
> I don't think that's actually safe, and one of the instances could be the
> cause cause of the bug CI is seeing:
>
>  * Note: since a relcache flush can cause the file handle to be closed again,
>  * it's unwise to hold onto the pointer returned by this function for any
>  * long period.  Recommended practice is to just re-execute RelationGetSmgr
>  * each time you need to access the SMgrRelation.  It's quite cheap in
>  * comparison to whatever an smgr function is going to do.
>  */
> static inline SMgrRelation
> RelationGetSmgr(Relation rel)

Yes, I've changed this in the attached v5.

One question I have is whether or not other callers than btree index
could benefit from the self-fsync avoidance optimization.

Also, after taking another look at gist index build, I notice that
smgrimmedsync() is not done anywhere and skipFsync is always passed as
true, so what happens if a full checkpoint and a crash happens between
WAL-logging and whenever the dirty pages make it to permanent storage?

- Melanie

Вложения

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Dmitry Dolgov

Дата:

13 февраля 2022 г., 17:33:13

> On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:
> Hi,
> v5 attached and all email feedback addressed below

Thanks for the patch, it looks quite good.

I don't see it in the discussion, so naturally curious -- why directmgr
is not used for bloom index, e.g. in blbuildempty?

> On Sun, Jan 16, 2022 at 3:26 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Separate from this issue, I wonder if it'd be useful to write a DEBUG log
> > showing when btree uses shared_buffers vs fsync.  And a regression test which
> > first SETs client_min_messages=debug to capture the debug log to demonstrate
> > when/that new code path is being hit.  I'm not sure if that would be good to
> > merge, but it may be useful for now.

I can't find the thread right away, but I vaguely remember a similar
situation where such approach, as a main way to test the patch, had
caused some disagreement. Of course for the development phase it would
be indeed convenient.

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Justin Pryzby

Дата:

03 марта 2022 г., 04:09:49

Rebased to appease cfbot.

I ran these paches under a branch which shows code coverage in cirrus.  It
looks good to my eyes.
https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

Are these patches being considered for v15 ?

Вложения

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Melanie Plageman

Дата:

05 марта 2022 г., 01:03:09

On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> Rebased to appease cfbot.
>
> I ran these paches under a branch which shows code coverage in cirrus.  It
> looks good to my eyes.
> https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

thanks for doing that and for the rebase! since I made updates, the
attached version 6 is also rebased.

To Dmitry's question:

On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:
>
> I don't see it in the discussion, so naturally curious -- why directmgr
> is not used for bloom index, e.g. in blbuildempty?

thanks for pointing this out. blbuildempty() is also included now. bloom
doesn't seem to use smgr* anywhere except blbuildempty(), so that is the
only place I made changes in bloom index build.

v6 has the following updates/changes:

- removed erroneous extra calls to unbuffered_prep() and
  unbuffered_finish() in GiST and btree index builds

- removed unnecessary includes

- some comments were updated to accurately reflect use of directmgr

- smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I
  think it makes sense that unbuffered_extend() of non-empty pages of
  WAL-logged relations (or the init fork of unlogged relations) do
  log_newpage(), but I wasn't so sure about smgrwrite().

  Currently all callers of smgrwrite() do log_newpage() and anyone using
  mdwrite() will end up writing the whole page. However, it seems
  possible that a caller of the directmgr API might wish to do a write
  to a particular offset and, either because of that, or, for some other
  reason, require different logging than that done in log_newpage().

  I didn't want to make a separate wrapper function for WAL-logging in
  directmgr because it felt like one more step to forget.

- heapam_relation_set_new_filenode doesn't use directmgr API anymore for
  unlogged relations. It doesn't extend or write, so it seemed like a
  special case better left alone.

  Note that the ambuildempty() functions which also write to the init
  fork of an unlogged relation still use the directmgr API. It is a bit
  confusing because they pass do_wal=true to unbuffered_prep() even
  though they are unlogged relations because they must log and fsync.

- interface changes to unbuffered_prep()
  I removed the parameters to unbuffered_prep() which required the user
  to specify if fsync should be added to pendingOps or done with
  smgrimmedsync(). Understanding all of the combinations of these
  parameters and when they were needed was confusing and the interface
  felt like a foot gun. Special cases shouldn't use this interface.

  I prefer the idea that users of this API expect that
  1) empty pages won't be checksummed or WAL logged
  2) for relations that are WAL-logged, when the build is done, the
  relation will be fsync'd by the backend (unless the fsync optimization
  is being used)
  3) the only case in which fsync requests are added to the pendingOps
  queue is when the fsync optimization is being used (which saves the
  redo pointer and check it later to determine if it needs to fsync
  itself)

  I also added the parameter "do_wal" to unbuffered_prep() and the
  UnBufferedWriteState struct. This is used when extending the file to
  determine whether or not to log_newpage(). unbuffered_extend() and
  unbuffered_write() no longer take do_wal as a parameter.

  Note that callers need to pass do_wal=true/false to unbuffered_prep()
  based on whether or not they want log_newpage() called during
  unbuffered_extend()--not simply based on whether or not the relation
  in question is WAL-logged.

  do_wal is the only member of the UnBufferedWriteState struct in the
  first patch in the set, but I think it makes sense to keep the struct
  around since I anticipate that the patch containing the other members
  needed for the fsync optimization will be committed at the same time.

  One final note on unbuffered_prep() -- I am thinking of renaming
  "do_optimization" to "try_optimization" or maybe
  "request_fsync_optimization". The interface (of unbuffered_prep())
  would be better if we always tried to do the optimization when
  relevant (when the relation is WAL-logged).

- freespace map, visimap, and hash index don't use directmgr API anymore
  For most use cases writing/extending outside shared buffers, it isn't
  safe to rely on requesting fsync from checkpointer.

  visimap, fsm, and hash index all request fsync from checkpointer for
  the pages they write with smgrextend() and don't smgrimmedsync() when
  finished adding pages (even when the hash index is WAL-logged).

  Supporting these exceptions made the interface incoherent, so I cut
  them.

- added unbuffered_extend_range()
  This one is a bit unfortunate. Because GiST index build uses
  log_newpages() to log a whole page range but calls smgrextend() for
  each of those pages individually, I couldn't use the
  unbuffered_extend() function easily.

  So, I thought it might be useful in other contexts as well to have a
  function which calls smgrextend() for a range of pages and then calls
  log_newpages(). I've added this.

  However, there are two parts of GiST index build flush ready pages
  that didn't work with this either.

  The first is that it does an error check on the block numbers one at a
  time while looping through them to write the pages. To retain this
  check, I loop through the ready pages an extra time before calling
  unbuffered_extend(), which is probably not acceptable.

  Also, GiST needs to use a custom LSN for the pages. To achieve this, I
  added a "custom_lsn" parameter to unbuffered_extend_range(). This
  isn't great either. If this was a more common case, I could pass the
  custom LSN to unbuffered_prep().

- Melanie

Вложения

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Andres Freund

Дата:

10 марта 2022 г., 21:32:51

Hi,

> From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 4 Mar 2022 14:48:39 -0500
> Subject: [PATCH v6 1/4] Add unbuffered IO API

I think this or one of the following patches should update src/backend/access/transam/README


> @@ -164,6 +164,16 @@ void
>  blbuildempty(Relation index)
>  {
>      Page        metapage;
> +    UnBufferedWriteState wstate;
> +
> +    /*
> +     * Though this is an unlogged relation, pass do_wal=true since the init
> +     * fork of an unlogged index must be wal-logged and fsync'd. This currently
> +     * has no effect, as unbuffered_write() does not do log_newpage()
> +     * internally. However, were this to be replaced with unbuffered_extend(),
> +     * do_wal must be true to ensure the data is logged and fsync'd.
> +     */
> +    unbuffered_prep(&wstate, true);

Wonder if unbuffered_write should have an assert checking that no writes to
INIT_FORKNUM are non-durable? Looks like it's pretty easy to forget...

I'd choose unbuffered_begin over _prep().


>      /* Construct metapage. */
>      metapage = (Page) palloc(BLCKSZ);
> @@ -176,18 +186,13 @@ blbuildempty(Relation index)
>       * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
>       * this even when wal_level=minimal.
>       */
> -    PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
> -    smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> -              (char *) metapage, true);
> +    unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> +            BLOOM_METAPAGE_BLKNO, metapage);
> +
>      log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
>                  BLOOM_METAPAGE_BLKNO, metapage, true);
>  
> -    /*
> -     * An immediate sync is required even if we xlog'd the page, because the
> -     * write did not go through shared_buffers and therefore a concurrent
> -     * checkpoint may have moved the redo pointer past our xlog record.
> -     */
> -    smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
> +    unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
>  }

I mildly prefer complete over finish, but ...



> - * We can't use the normal heap_insert function to insert into the new
> - * heap, because heap_insert overwrites the visibility information.
> - * We use a special-purpose raw_heap_insert function instead, which
> - * is optimized for bulk inserting a lot of tuples, knowing that we have
> - * exclusive access to the heap.  raw_heap_insert builds new pages in
> - * local storage.  When a page is full, or at the end of the process,
> - * we insert it to WAL as a single record and then write it to disk
> - * directly through smgr.  Note, however, that any data sent to the new
> - * heap's TOAST table will go through the normal bufmgr.
> + * We can't use the normal heap_insert function to insert into the new heap,
> + * because heap_insert overwrites the visibility information. We use a
> + * special-purpose raw_heap_insert function instead, which is optimized for
> + * bulk inserting a lot of tuples, knowing that we have exclusive access to the
> + * heap.  raw_heap_insert builds new pages in local storage.  When a page is
> + * full, or at the end of the process, we insert it to WAL as a single record
> + * and then write it to disk directly through directmgr.  Note, however, that
> + * any data sent to the new heap's TOAST table will go through the normal
> + * bufmgr.

Part of this just reflows existing lines that seem otherwise unchanged, making
it harder to see the actual change.



> @@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level)
>  static void
>  _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>  {
> -    /* XLOG stuff */
> -    if (wstate->btws_use_wal)
> -    {
> -        /* We use the XLOG_FPI record type for this */
> -        log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
> -    }
> -
>      /*
>       * If we have to write pages nonsequentially, fill in the space with
>       * zeroes until we come back and overwrite.  This is not logically
> @@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>      {
>          if (!wstate->btws_zeropage)
>              wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> -        /* don't set checksum for all-zero page */
> -        smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> -                   wstate->btws_pages_written++,
> -                   (char *) wstate->btws_zeropage,
> -                   true);
> +
> +        unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
> +                MAIN_FORKNUM, wstate->btws_pages_written++,
> +                wstate->btws_zeropage, true);
>      }

For a bit I thought the true argument to unbuffered_extend was about
durability or registering fsyncs. Perhaps worth making it flags argument with
an enum for flag arguments?


> diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
> new file mode 100644
> index 0000000000..42c37daa7a
> --- /dev/null
> +++ b/src/backend/storage/direct/directmgr.c
> @@ -0,0 +1,98 @@

Now that the API is called unbuffered, the filename / directory seem
confusing.


> +void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +{
> +    wstate->do_wal = do_wal;
> +}
> +
> +void
> +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation
> +        smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
> +        empty)
> +{
> +    /*
> +     * Don't checksum empty pages
> +     */
> +    if (!empty)
> +        PageSetChecksumInplace(page, blocknum);
> +
> +    smgrextend(smgrrel, forknum, blocknum, (char *) page, true);
> +
> +    /*
> +     * Don't WAL-log empty pages
> +     */
> +    if (!empty && wstate->do_wal)
> +        log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
> +                    blocknum, page, true);
> +}
> +
> +void
> +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
> +        ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
> +        bool empty, XLogRecPtr custom_lsn)
> +{
> +    for (int i = 0; i < num_pages; i++)
> +    {
> +        Page        page = pages[i];
> +        BlockNumber blkno = blocknums[i];
> +
> +        if (!XLogRecPtrIsInvalid(custom_lsn))
> +            PageSetLSN(page, custom_lsn);
> +
> +        if (!empty)
> +            PageSetChecksumInplace(page, blkno);
> +
> +        smgrextend(smgrrel, forknum, blkno, (char *) page, true);
> +    }
> +
> +    if (!empty && wstate->do_wal)
> +        log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages,
> +                blocknums, pages, true);
> +}
> +
> +void
> +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
> +        forknum, BlockNumber blocknum, Page page)
> +{
> +    PageSetChecksumInplace(page, blocknum);
> +
> +    smgrwrite(smgrrel, forknum, blocknum, (char *) page, true);
> +}

Seem several of these should have some minimal documentation?


> +/*
> + * When writing data outside shared buffers, a concurrent CHECKPOINT can move
> + * the redo pointer past our WAL entries and won't flush our data to disk. If
> + * the database crashes before the data makes it to disk, our WAL won't be
> + * replayed and the data will be lost.
> + * Thus, if a CHECKPOINT begins between unbuffered_prep() and
> + * unbuffered_finish(), the backend must fsync the data itself.
> + */

Hm. The last sentence sounds like this happens conditionally, but it doesn't
at this point.



> +typedef struct UnBufferedWriteState
> +{
> +    /*
> +     * When writing WAL-logged relation data outside of shared buffers, there
> +     * is a risk of a concurrent CHECKPOINT moving the redo pointer past the
> +     * data's associated WAL entries. To avoid this, callers in this situation
> +     * must fsync the pages they have written themselves. This is necessary
> +     * only if the relation is WAL-logged or in special cases such as the init
> +     * fork of an unlogged index.
> +     */
> +    bool do_wal;
> +} UnBufferedWriteState;
> +/*
> + * prototypes for functions in directmgr.c
> + */

Newline in between end of struct and comment.

> +extern void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);

In headers we don't put the return type on a separate line :/





> --- a/contrib/bloom/blinsert.c
> +++ b/contrib/bloom/blinsert.c
> @@ -173,7 +173,7 @@ blbuildempty(Relation index)
>       * internally. However, were this to be replaced with unbuffered_extend(),
>       * do_wal must be true to ensure the data is logged and fsync'd.
>       */
> -    unbuffered_prep(&wstate, true);
> +    unbuffered_prep(&wstate, true, false);

This makes me think this really should be a flag argument...

I also don't like the current name of the parameter "do_optimization" doesn't
explain much.


> +bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{

newline after return type.

>  void
> -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool
> +        do_optimization)

See earlier comments about documentation and parameter naming...


> +     * These callers can optionally use the following optimization: attempt to
> +     * use the sync request queue and fall back to fsync'ing the pages
> +     * themselves if the Redo pointer moves between the start and finish of
> +     * their write. In order to do this, they must set do_optimization to true
> +     * so that the redo pointer is saved before the write begins.
>       */

When do we not want this?



> From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 4 Mar 2022 15:53:05 -0500
> Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization
> 
> While building a btree index, the backend can avoid fsync'ing all of the
> pages if it uses the optimization introduced in a prior commit.
> 
> This can substantially improve performance when many indexes are being
> built during DDL operations.
> ---
>  src/backend/access/nbtree/nbtree.c  | 2 +-
>  src/backend/access/nbtree/nbtsort.c | 6 +++++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index 6b78acefbe..fc5cce4603 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -161,7 +161,7 @@ btbuildempty(Relation index)
>       * internally. However, were this to be replaced with unbuffered_extend(),
>       * do_wal must be true to ensure the data is logged and fsync'd.
>       */
> -    unbuffered_prep(&wstate, true, false);
> +    unbuffered_prep(&wstate, true, true);
>  
>      /* Construct metapage. */
>      metapage = (Page) palloc(BLCKSZ);
> diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
> index d6d0d4b361..f1b9e2e24e 100644
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
>      int64        tuples_done = 0;
>      bool        deduplicate;
>  
> -    unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
> +    /*
> +     * The fsync optimization done by directmgr is only relevant if
> +     * WAL-logging, so pass btws_use_wal for this parameter.
> +     */
> +    unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
>  
>      deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
>          BTGetDeduplicateItems(wstate->index);

Why just here?



> From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 4 Mar 2022 15:52:45 -0500
> Subject: [PATCH v6 4/4] Use shared buffers when possible for index build
> 
> When there are not too many tuples, building the index in shared buffers
> makes sense. It allows the buffer manager to handle how best to do the
> IO.
> ---

Perhaps it'd be worth making this an independent patch that could be applied
separately?


>  src/backend/access/nbtree/nbtree.c  |  32 ++--
>  src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++-------
>  2 files changed, 223 insertions(+), 82 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index fc5cce4603..d3982b9388 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -152,34 +152,24 @@ void
>  btbuildempty(Relation index)
>  {
>      Page        metapage;
> -    UnBufferedWriteState wstate;
> +    Buffer metabuf;
>  
>      /*
> -     * Though this is an unlogged relation, pass do_wal=true since the init
> -     * fork of an unlogged index must be wal-logged and fsync'd. This currently
> -     * has no effect, as unbuffered_write() does not do log_newpage()
> -     * internally. However, were this to be replaced with unbuffered_extend(),
> -     * do_wal must be true to ensure the data is logged and fsync'd.
> +     * Allocate a buffer for metapage and initialize metapage.
>       */
> -    unbuffered_prep(&wstate, true, true);
> -
> -    /* Construct metapage. */
> -    metapage = (Page) palloc(BLCKSZ);
> +    metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK,
> +            NULL);
> +    metapage = BufferGetPage(metabuf);
>      _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
>  
>      /*
> -     * Write the page and log it.  It might seem that an immediate sync would
> -     * be sufficient to guarantee that the file exists on disk, but recovery
> -     * itself might remove it while replaying, for example, an
> -     * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> -     * this even when wal_level=minimal.
> +     * Mark metapage buffer as dirty and XLOG it
>       */
> -    unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> -            BTREE_METAPAGE, metapage);
> -    log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
> -                BTREE_METAPAGE, metapage, true);
> -
> -    unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
> +    START_CRIT_SECTION();
> +    MarkBufferDirty(metabuf);
> +    log_newpage_buffer(metabuf, true);
> +    END_CRIT_SECTION();
> +    _bt_relbuf(index, metabuf);
>  }

I don't understand why this patch changes btbuildempty()? That data is never
accessed again, so it doesn't really seem beneficial to shuffle it through
shared buffers, even if we benefit from using s_b for the main fork...



> +    /*
> +     * If not using shared buffers, for a WAL-logged relation, save the redo
> +     * pointer location in case a checkpoint begins during the index build.
> +     */
> +    if (wstate->_bt_bl_unbuffered_prep)
> +        wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
> +                wstate->btws_use_wal, wstate->btws_use_wal);

Code would probably look cleaner if an empty callback were used when no
_bt_bl_unbuffered_prep callback is needed.



>  /*
> - * allocate workspace for a new, clean btree page, not linked to any siblings.
> + * Set up workspace for a new, clean btree page, not linked to any siblings.
> + * Caller must allocate the passed in page.

More interesting bit seems to be whether the passed in page needs to be
initialized?


> @@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
>           * back one slot.  Then we can dump out the page.
>           */
>          _bt_slideleft(s->btps_page);
> -        _bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
> +        wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
> +        s->btps_buf = InvalidBuffer;
>          s->btps_page = NULL;    /* writepage freed the workspace */
>      }

Do we really have to add underscores even to struct members? That just looks
wrong.


Greetings,

Andres Freund

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Greg Stark

Дата:

30 марта 2022 г., 21:54:23

It looks like this patch received a review from Andres and hasn't been
updated since. I'm not sure but the review looks to me like it's not
ready to commit and needs some cleanup or at least to check on a few
things.

I guess it's not going to get bumped in the next few days so I'll move
it to the next CF.

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Heikki Linnakangas

Дата:

23 июля 2022 г., 12:34:54

On 05/03/2022 00:03, Melanie Plageman wrote:
> On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>>
>> Rebased to appease cfbot.
>>
>> I ran these paches under a branch which shows code coverage in cirrus.  It
>> looks good to my eyes.
>> https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html
> 
> thanks for doing that and for the rebase! since I made updates, the
> attached version 6 is also rebased.

I'm surprised by all the changes in nbtsort.c to choose between using 
the buffer manager or the new API. I would've expected the new API to 
abstract that away. Otherwise we need to copy that logic to all the 
index AMs.

I'd suggest an API along the lines of:

/*
  * Start building a relation in bulk.
  *
  * If the relation is going to be small, we will use the buffer manager,
  * but if it's going to be large, this will skip the buffer manager and
  * write the pages directly to disk.
  */
bulk_init(SmgrRelation smgr, BlockNumber estimated_size)

/*
  * Extend relation by one page
  */
bulk_extend(SmgrRelation, BlockNumber, Page)

/*
  * Finish building the relation
  *
  * This will fsync() the data to disk, if required.
  */
bulk_finish()

Behind this interface, you can encapsulate the logic to choose whether 
to use the buffer manager or not. And I think bulk_extend() could do the 
WAL-logging too.

Or you could make the interface look more like the buffer manager:

/* as above */
bulk_init(SmgrRelation smgr, BlockNumber estimated_size)
bulk_finish()

/*
  * Get a buffer for writing out a page.
  *
  * The contents of the buffer are uninitialized. The caller
  * must fill it in before releasing it.
  */
BulkBuffer
bulk_getbuf(SmgrRelation smgr, BlockNumber blkno)

/*
  * Release buffer. It will be WAL-logged and written out to disk.
  * Not necessarily immediately, but at bulk_finish() at latest.
  *
  * NOTE: There is no way to read back the page after you release
  * it, until you finish the build with bulk_finish().
  */
void
bulk_releasebuf(SmgrRelation smgr, BulkBuffer buf)

With this interface, you could also batch multiple pages and WAL-log 
them all in one WAL record with log_newpage_range(), for example.

- Heikki

Re: Avoiding smgrimmedsync() during nbtree index builds

От

Jacob Champion

Дата:

02 августа 2022 г., 21:54:21

This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Once you think the patchset is ready for review again, you (or any
interested party) can resurrect the patch entry by visiting

    https://commitfest.postgresql.org/38/3508/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Avoiding smgrimmedsync() during nbtree index builds

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения