Re: Error with index on unlogged table

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Error with index on unlogged table
Дата
Msg-id 20151203213508.GB28762@awork2.anarazel.de
обсуждение исходный текст
Ответ на Re: Error with index on unlogged table  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: Error with index on unlogged table  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers
On 2015-11-27 16:59:20 +0900, Michael Paquier wrote:
> Attached is a patch that fixes the issue for me in master and 9.5.
> Actually in the last patch I forgot a call to smgrwrite to ensure that
> the INIT_FORKNUM is correctly synced to disk when those pages are
> replayed at recovery, letting the reset routines for unlogged
> relations do their job correctly. I have noticed as well that we need
> to do the same for gin and brin relations. In this case I think that
> we could limit the flush to unlogged relations, my patch does it
> unconditionally though to generalize the logic. Thoughts?

I think it's a good idea to limit this to unlogged relations. For a
dataset that fits into shared_buffers and creates many relations, the
additional disk writes could be noticeable.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..d7964ac 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
>      brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>                         BRIN_CURRENT_VERSION);
>      MarkBufferDirty(metabuf);
> -    log_newpage_buffer(metabuf, false);
> +    /*
> +     * When this full page image is replayed, there is no guarantee that
> +     * this page will be present to disk when replayed particularly for
> +     * unlogged relations, hence enforce it to be flushed to disk.
> +     */

The grammar seems a bit off here.

> +    /*
> +     * Initialize and xlog metabuffer and root buffer. When those full
> +     * pages are replayed, it is not guaranteed that those relation
> +     * init forks will be flushed to disk after replaying them, hence
> +     * enforce those pages to be flushed to disk at replay, only the
> +     * last record will enforce a flush for performance reasons and
> +     * because it is actually unnecessary to do it multiple times.
> +     */

That comment needs some love too. I think it really only makes sense if
you know the current state. There's also some language polishing needed.

> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
>      START_CRIT_SECTION();
>      GinInitMetabuffer(MetaBuffer);
>      MarkBufferDirty(MetaBuffer);
> -    log_newpage_buffer(MetaBuffer, false);
> +    log_newpage_buffer(MetaBuffer, false, false);
>      GinInitBuffer(RootBuffer, GIN_LEAF);
>      MarkBufferDirty(RootBuffer);
> -    log_newpage_buffer(RootBuffer, false);
> +    log_newpage_buffer(RootBuffer, false, true);
>      END_CRIT_SECTION();
>
Why isn't the metapage flushed to disk?

> --- a/src/backend/access/spgist/spginsert.c
> +++ b/src/backend/access/spgist/spginsert.c
> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>                (char *) page, true);
>      if (XLogIsNeeded())
>          log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> -                    SPGIST_METAPAGE_BLKNO, page, false);
> +                    SPGIST_METAPAGE_BLKNO, page, false, false);
>  
>      /* Likewise for the root page. */
>      SpGistInitPage(page, SPGIST_LEAF);
> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>                (char *) page, true);
>      if (XLogIsNeeded())
>          log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> -                    SPGIST_ROOT_BLKNO, page, true);
> +                    SPGIST_ROOT_BLKNO, page, true, false);
>

I don't think it's correct to not flush in these cases. Yes, the primary
does an smgrimmedsync() - but that's not done on the standby.

> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record)
>           * when checksums are enabled. There is no difference in handling
>           * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
>           * code just to distinguish them for statistics purposes.
> +         *
> +         * XLOG_FPI_FOR_SYNC records are generated when a relation needs to
> +         * be sync'ed just after replaying a full page. This is important
> +         * to match the master behavior in the case where a page is written
> +         * directly to disk without going through shared buffers, particularly
> +         * when writing init forks for index relations.
>           */

How about

FPI_FOR_SYNC records indicate that the page immediately needs to be
written to disk, not just to shared buffers. This is important if the
on-disk state is to be the authoritative, not the state in shared
buffers. E.g. because on-disk files may later be copied directly.

> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> index 0ddde72..eb22a51 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
>           * space.
>           */
>          if (use_wal)
> -            log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
> +            log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page,
> +                        false, false);

Shouldn't this flush data if it's an init fork? Currently that's an
academic distinction, given there'll never be a page, but it still seems
prudent.

>  extern void InitXLogInsert(void);
> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
> index ad1eb4b..91445f1 100644
> --- a/src/include/catalog/pg_control.h
> +++ b/src/include/catalog/pg_control.h
> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>  #define XLOG_END_OF_RECOVERY            0x90
>  #define XLOG_FPI_FOR_HINT                0xA0
>  #define XLOG_FPI                        0xB0
> +#define XLOG_FPI_FOR_SYNC                0xC0


I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
instead adding actual record data and a 'flags' field in there? I
slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
different, XLOG_FPI_FOR_SYNC not so much.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Error with index on unlogged table
Следующее
От: Pavel Raiskup
Дата:
Сообщение: Re: broken tests