Re: Avoiding smgrimmedsync() during nbtree index builds

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Avoiding smgrimmedsync() during nbtree index builds
Дата
Msg-id 20211209193351.e5lzpvxgettg5bx6@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Avoiding smgrimmedsync() during nbtree index builds  (Melanie Plageman <melanieplageman@gmail.com>)
Ответы Re: Avoiding smgrimmedsync() during nbtree index builds  (Melanie Plageman <melanieplageman@gmail.com>)
Список pgsql-hackers
Hi,

On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
> From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Thu, 15 Apr 2021 07:01:01 -0400
> Subject: [PATCH v2] Index build avoids immed fsync
> 
> Avoid immediate fsync for just built indexes either by using shared
> buffers or by leveraging checkpointer's SyncRequest queue. When a
> checkpoint begins during the index build, if not using shared buffers,
> the backend will have to do its own fsync.
> ---
>  src/backend/access/nbtree/nbtree.c  |  39 +++---
>  src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
>  src/backend/access/transam/xlog.c   |  14 +++
>  src/include/access/xlog.h           |   1 +
>  4 files changed, 188 insertions(+), 52 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index 40ad0956e0..a2e32f18e6 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -150,30 +150,29 @@ void
>  btbuildempty(Relation index)
>  {
>      Page        metapage;
> +    Buffer metabuf;
>  
> -    /* Construct metapage. */
> -    metapage = (Page) palloc(BLCKSZ);
> -    _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
> -
> +    // TODO: test this.

Shouldn't this path have plenty coverage?


>      /*
> -     * Write the page and log it.  It might seem that an immediate sync would
> -     * be sufficient to guarantee that the file exists on disk, but recovery
> -     * itself might remove it while replaying, for example, an
> -     * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> -     * this even when wal_level=minimal.
> +     * Construct metapage.
> +     * Because we don't need to lock the relation for extension (since
> +     * noone knows about it yet) and we don't need to initialize the
> +     * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
> +     * (with P_NEW and BT_WRITE) is overkill.

Isn't the more relevant operation the log_newpage_buffer()?


> However, it might be worth
> +     * either modifying it or adding a new helper function instead of
> +     * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
> +     * because we want to hold an exclusive lock on the buffer content
>       */

"modifying it" refers to what?

I don't see a problem using ReadBufferExtended() here. Why would you like to
replace it with something else?



> +    /*
> +     * Based on the number of tuples, either create a buffered or unbuffered
> +     * write state. if the number of tuples is small, make a buffered write
> +     * if the number of tuples is larger, then we make an unbuffered write state
> +     * and must ensure that we check the redo pointer to know whether or not we
> +     * need to fsync ourselves
> +     */
>  
>      /*
>       * Finish the build by (1) completing the sort of the spool file, (2)
>       * inserting the sorted tuples into btree pages and (3) building the upper
>       * levels.  Finally, it may also be necessary to end use of parallelism.
>       */
> -    _bt_leafbuild(buildstate.spool, buildstate.spool2);
> +    if (reltuples > 1000)

I'm ok with some random magic constant here, but it seems worht putting it in
some constant / #define to make it more obvious.

> +        _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
> +    else
> +        _bt_leafbuild(buildstate.spool, buildstate.spool2, true);

Why duplicate the function call?


>  /*
>   * allocate workspace for a new, clean btree page, not linked to any siblings.
> + * If index is not built in shared buffers, buf should be InvalidBuffer
>   */
>  static Page
> -_bt_blnewpage(uint32 level)
> +_bt_blnewpage(uint32 level, Buffer buf)
>  {
>      Page        page;
>      BTPageOpaque opaque;
>  
> -    page = (Page) palloc(BLCKSZ);
> +    if (buf)
> +        page = BufferGetPage(buf);
> +    else
> +        page = (Page) palloc(BLCKSZ);
>  
>      /* Zero the page and set up standard page header info */
>      _bt_pageinit(page, BLCKSZ);

Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
pfree()ing a page that's actually in shared buffers. I think it'd make sense
to separate the allocation from the initialization bits?


> @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
>   * emit a completed btree page, and release the working storage.
>   */
>  static void
> -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
>  {
> +    if (wstate->use_shared_buffers)
> +    {
> +        Assert(buf);
> +        START_CRIT_SECTION();
> +        MarkBufferDirty(buf);
> +        if (wstate->btws_use_wal)
> +            log_newpage_buffer(buf, true);
> +        END_CRIT_SECTION();
> +        _bt_relbuf(wstate->index, buf);
> +        return;
> +    }
> +
>      /* XLOG stuff */
>      if (wstate->btws_use_wal)
>      {
> @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>          smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
>                     wstate->btws_pages_written++,
>                     (char *) wstate->btws_zeropage,
> -                   true);
> +                   false);
>      }

Is there a good place to document the way we ensure durability for this path?


> +    /*
> +     * Extend the index relation upfront to reserve the metapage
> +     */
> +    if (wstate->use_shared_buffers)
> +    {
> +        /*
> +         * We should not need to LockRelationForExtension() as no one else knows
> +         * about this index yet?
> +         * Extend the index relation by one block for the metapage. _bt_getbuf()
> +         * is not used here as it does _bt_pageinit() which is one later by

*done


> +         * _bt_initmetapage(). We will fill in the metapage and write it out at
> +         * the end of index build when we have all of the information required
> +         * for the metapage. However, we initially extend the relation for it to
> +         * occupy block 0 because it is much easier when using shared buffers to
> +         * extend the relation with a block number that is always increasing by
> +         * 1.

Not quite following what you're trying to get at here. There generally is no
way to extend a relation except by increasing block numbers?


> @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
>       * still not be on disk when the crash occurs.
>       */
>      if (wstate->btws_use_wal)
> -        smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +    {
> +        if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
> +            smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +    }
>  }
>  
>  /*

This needs documentation. The whole comment above isn't accurate anymore afaict?


> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 1616448368..63fd212787 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
>      return RedoRecPtr;
>  }
>  
> +bool
> +RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{
> +    XLogRecPtr    ptr;
> +
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    ptr = XLogCtl->RedoRecPtr;
> +    SpinLockRelease(&XLogCtl->info_lck);
> +
> +    if (RedoRecPtr < ptr)
> +        RedoRecPtr = ptr;
> +    return RedoRecPtr != comparator_ptr;
> +}

What's the deal with the < comparison?


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: John Naylor
Дата:
Сообщение: do only critical work during single-user vacuum?
Следующее
От: Andres Freund
Дата:
Сообщение: Re: A test for replay of regression tests