Re: Relation bulk write facility

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Relation bulk write facility
Дата
Msg-id 20231119000416.fuyrbo6av7dytmf6@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: Relation bulk write facility  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Relation bulk write facility  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hi,

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> The new facility makes it easier to optimize bulk loading, as the
> logic for buffering, WAL-logging, and syncing the relation only needs
> to be implemented once. It's also less error-prone: We have had a
> number of bugs in how a relation is fsync'd - or not - at the end of a
> bulk loading operation. By centralizing that logic to one place, we
> only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.


> The new facility is faster for small relations: Instead of of calling
> smgrimmedsync(), we register the fsync to happen at next checkpoint,
> which avoids the fsync latency. That can make a big difference if you
> are e.g. restoring a schema-only dump with lots of relations.

I think this would be a huge win for people running their application tests
against postgres.


> +    bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> +
>      nblocks = smgrnblocks(src, forkNum);
>  
>      for (blkno = 0; blkno < nblocks; blkno++)
>      {
> +        Page        page;
> +
>          /* If we got a cancel signal during the copy of the data, quit */
>          CHECK_FOR_INTERRUPTS();
>  
> -        smgrread(src, forkNum, blkno, buf.data);
> +        page = bulkw_alloc_buf(bulkw);
> +        smgrread(src, forkNum, blkno, page);
>  
>          if (!PageIsVerifiedExtended(page, blkno,
>                                      PIV_LOG_WARNING | PIV_REPORT_STAT))
> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
>           * page this is, so we have to log the full page including any unused
>           * space.
>           */
> -        if (use_wal)
> -            log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
> -
> -        PageSetChecksumInplace(page, blkno);
> -
> -        /*
> -         * Now write the page.  We say skipFsync = true because there's no
> -         * need for smgr to schedule an fsync for this write; we'll do it
> -         * ourselves below.
> -         */
> -        smgrextend(dst, forkNum, blkno, buf.data, true);
> +        bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?


> +/*-------------------------------------------------------------------------
> + *
> + * bulk_write.c
> + *      Efficiently and reliably populate a new relation
> + *
> + * The assumption is that no other backends access the relation while we are
> + * loading it, so we can take some shortcuts.  Alternatively, you can use the
> + * buffer manager as usual, if performance is not critical, but you must not
> + * mix operations through the buffer manager and the bulk loading interface at
> + * the same time.

From "Alternatively" onward this is is somewhat confusing.


> + * We bypass the buffer manager to avoid the locking overhead, and call
> + * smgrextend() directly.  A downside is that the pages will need to be
> + * re-read into shared buffers on first use after the build finishes.  That's
> + * usually a good tradeoff for large relations, and for small relations, the
> + * overhead isn't very significant compared to creating the relation in the
> + * first place.

FWIW, I doubt the "isn't very significant" bit is really true.


> + * One tricky point is that because we bypass the buffer manager, we need to
> + * register the relation for fsyncing at the next checkpoint ourselves, and
> + * make sure that the relation is correctly fsync by us or the checkpointer
> + * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.



> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + *      src/backend/storage/smgr/bulk_write.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/xloginsert.h"
> +#include "access/xlogrecord.h"
> +#include "storage/bufmgr.h"
> +#include "storage/bufpage.h"
> +#include "storage/bulk_write.h"
> +#include "storage/proc.h"
> +#include "storage/smgr.h"
> +#include "utils/rel.h"
> +
> +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
> +
> +typedef struct BulkWriteBuffer
> +{
> +    Page        page;
> +    BlockNumber blkno;
> +    bool        page_std;
> +    int16        order;
> +} BulkWriteBuffer;
> +

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...


> +/*
> + * Bulk writer state for one relation fork.
> + */
> +typedef struct BulkWriteState
> +{
> +    /* Information about the target relation we're writing */
> +    SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
is merged?


> +    ForkNumber    forknum;
> +    bool        use_wal;
> +
> +    /* We keep several pages buffered, and WAL-log them in batches */
> +    int            nbuffered;
> +    BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> +
> +    /* Current size of the relation */
> +    BlockNumber pages_written;
> +
> +    /* The RedoRecPtr at the time that the bulk operation started */
> +    XLogRecPtr    start_RedoRecPtr;
> +
> +    Page        zeropage;        /* workspace for filling zeroes */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?


> +/*
> + * Write all buffered pages to disk.
> + */
> +static void
> +bulkw_flush(BulkWriteState *bulkw)
> +{
> +    int            nbuffered = bulkw->nbuffered;
> +    BulkWriteBuffer *buffers = bulkw->buffers;
> +
> +    if (nbuffered == 0)
> +        return;
> +
> +    if (nbuffered > 1)
> +    {
> +        int            o;
> +
> +        qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
> +
> +        /*
> +         * Eliminate duplicates, keeping the last write of each block.
> +         * (buffer_cmp uses 'order' as the last sort key)
> +         */

Huh - which use cases would actually cause duplicate writes?


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alena Rybakina
Дата:
Сообщение: Re: [PoC] Reducing planning time when tables have many partitions
Следующее
От: Noah Misch
Дата:
Сообщение: Re: reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED