Re: Error with index on unlogged table

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Error with index on unlogged table
Дата
Msg-id 20151208130954.GT4934@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Error with index on unlogged table  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: Error with index on unlogged table  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers
Hi,

On 2015-12-04 21:57:54 +0900, Michael Paquier wrote:
> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote:
> >> Let's go for XLOG_FPI_FLUSH.
> >
> > I think the other way is a bit better, because we can add new flags
> > without changing the WAL format.
> 
> Hm. On the contrary, I think that it would make more sense to have a
> flag as well for FOR_HINT honestly, those are really the same
> operations, and FOR_HINT is just here for statistic purposes.

I don't think it's all that much the same operation. And WRT statistics
purpose: Being able to easily differentiate FOR_HINT is important for
pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH.

> >> --- a/src/backend/access/transam/xloginsert.c
> >> +++ b/src/backend/access/transam/xloginsert.c
> >> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
> >>   * If the page follows the standard page layout, with a PageHeader and unused
> >>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
> >>   * the unused space to be left out from the WAL record, making it smaller.
> >> + *
> >> + * If 'is_flush' is set to TRUE, relation will be requested to flush
> >> + * immediately its data at replay after replaying this full page image.
> >>   */
> >
> > s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> > OS immediately after replaying the record'?
> 
> s/OS/stable storage?

It's not flushed to stable storage here. Just to the OS.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..fff48ab 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>      brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>                         BRIN_CURRENT_VERSION);
>      MarkBufferDirty(metabuf);
> -    log_newpage_buffer(metabuf, false);
> +
> +    /*
> +     * For unlogged relations, this page should be immediately flushed
> +     * to disk after being replayed. This is necessary to ensure that the
> +     * initial on-disk state of unlogged relations is preserved as the
> +     * on-disk files are copied directly at the end of recovery.
> +     */
> +    log_newpage_buffer(metabuf, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();
>  
>      UnlockReleaseBuffer(metabuf);
> diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
> index f876f62..572fe20 100644
> --- a/src/backend/access/brin/brin_pageops.c
> +++ b/src/backend/access/brin/brin_pageops.c
> @@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
>      page = BufferGetPage(buffer);
>      brin_page_init(page, BRIN_PAGETYPE_REGULAR);
>      MarkBufferDirty(buffer);
> -    log_newpage_buffer(buffer, true);
> +    log_newpage_buffer(buffer, true, false);
>      END_CRIT_SECTION();
>  
>      /*
> diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
> index 49e9185..17c168a 100644
> --- a/src/backend/access/gin/gininsert.c
> +++ b/src/backend/access/gin/gininsert.c
> @@ -450,14 +450,22 @@ ginbuildempty(PG_FUNCTION_ARGS)
>          ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
>      LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
>  
> -    /* Initialize and xlog metabuffer and root buffer. */
> +    /*
> +     * Initialize and xlog metabuffer and root buffer. For unlogged
> +     * relations, those pages need to be immediately flushed to disk
> +     * after being replayed. This is necessary to ensure that the
> +     * initial on-disk state of unlogged relations is preserved when
> +     * they get reset at the end of recovery.
> +     */
>      START_CRIT_SECTION();
>      GinInitMetabuffer(MetaBuffer);
>      MarkBufferDirty(MetaBuffer);
> -    log_newpage_buffer(MetaBuffer, false);
> +    log_newpage_buffer(MetaBuffer, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      GinInitBuffer(RootBuffer, GIN_LEAF);
>      MarkBufferDirty(RootBuffer);
> -    log_newpage_buffer(RootBuffer, false);
> +    log_newpage_buffer(RootBuffer, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();
>  
>      /* Unlock and release the buffers. */
> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
> index 53bccf6..6a20031 100644
> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -84,7 +84,8 @@ gistbuildempty(PG_FUNCTION_ARGS)
>      START_CRIT_SECTION();
>      GISTInitBuffer(buffer, F_LEAF);
>      MarkBufferDirty(buffer);
> -    log_newpage_buffer(buffer, true);
> +    log_newpage_buffer(buffer, true,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();
>

I might be missing something here - but isn't it pretty much guaranteed
that all these are unlogged relations? Otherwise *buildempty() wouldn't
have been called, right?

>      else if (info == XLOG_BACKUP_END)
>      {
> @@ -178,9 +183,6 @@ xlog_identify(uint8 info)
>          case XLOG_FPI:
>              id = "FPI";
>              break;
> -        case XLOG_FPI_FOR_HINT:
> -            id = "FPI_FOR_HINT";
> -            break;
>      }

Really don't want to do that.

> @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record)
>           * resource manager needs to generate conflicts, it has to define a
>           * separate WAL record type and redo routine.
>           *
> -         * XLOG_FPI_FOR_HINT records are generated when a page needs to be
> -         * WAL- logged because of a hint bit update. They are only generated
> -         * when checksums are enabled. There is no difference in handling
> -         * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
> -         * code just to distinguish them for statistics purposes.
> +         * Records flagged with 'for_hint_bits' are generated when a page needs
> +         * to be WAL- logged because of a hint bit update. They are only
> +         * generated when checksums are enabled. There is no difference in
> +         * handling records when this flag is set, it is used for statistics
> +         * purposes.
> +         *
> +         * Records flagged with 'is_flush' indicate that the page immediately
> +         * needs to be written to disk, not just to shared buffers. This is
> +         * important if the on-disk state is to be the authoritative, not the
> +         * state in shared buffers. E.g. because on-disk files may later be
> +         * copied directly.
>           */
>          if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED)
>              elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block");
> +
> +        if (xlrec.is_flush)
> +        {
> +            RelFileNode    rnode;
> +            ForkNumber    forknum;
> +            BlockNumber    blkno;
> +            SMgrRelation srel;
> +
> +            (void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno);
> +            srel = smgropen(rnode, InvalidBackendId);
> +            smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false);
> +            smgrclose(srel);
> +        }

That'd leave the dirty bit set on the buffer...

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Error with index on unlogged table
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Tab-comletion for RLS