Re: Add 64-bit XIDs into PostgreSQL 15

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Add 64-bit XIDs into PostgreSQL 15
Дата
Msg-id 20220128224307.f2h3aebujskzjwcl@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Add 64-bit XIDs into PostgreSQL 15  (Pavel Borisov <pashkin.elfe@gmail.com>)
Ответы Re: Add 64-bit XIDs into PostgreSQL 15  (Pavel Borisov <pashkin.elfe@gmail.com>)
Список pgsql-hackers
Hi,

On 2022-01-24 16:38:54 +0400, Pavel Borisov wrote:
> +64-bit Transaction ID's (XID)
> +=============================
> +
> +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent
> +wraparound every N/2 transactions. This causes performance degradation due
> +to the need to exclusively lock tables while being vacuumed. In each
> +wraparound cycle, SLRU buffers are also being cut.

What exclusive lock?


> +"Double XMAX" page format
> +---------------------------------
> +
> +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL
> +version pd_special area with a size of 16 bytes should be added to a page.
> +Though a page may not have space for this. Then it can be converted to a
> +temporary format called "double XMAX".
>
> +All tuples after pg-upgrade would necessarily have xmin = FrozenTransactionId.

Why would a tuple after pg-upgrade necessarily have xmin =
FrozenTransactionId? A pg_upgrade doesn't scan the tables, so the pg_upgrade
itself doesn't do anything to xmins.

I guess you mean that the xmin cannot be needed anymore, because no older
transaction can be running?


> +In-memory tuple format
> +----------------------
> +
> +In-memory tuple representation consists of two parts:
> +- HeapTupleHeader from disk page (contains all heap tuple contents, not only
> +header)
> +- HeapTuple with additional in-memory fields
> +
> +HeapTuple for each tuple in memory stores t_xid_base/t_multi_base - a copies of
> +page's pd_xid_base/pd_multi_base. With tuple's 32-bit t_xmin and t_xmax from
> +HeapTupleHeader they are used to calculate actual 64-bit XMIN and XMAX:
> +
> +XMIN = t_xmin + t_xid_base.                     (3)
> +XMAX = t_xmax + t_xid_base/t_multi_base.        (4)

What identifies a HeapTuple as having this additional data?


> +The downside of this is that we can not use tuple's XMIN and XMAX right away.
> +We often need to re-read t_xmin and t_xmax - which could actually be pointers
> +into a page in shared buffers and therefore they could be updated by any other
> +backend.

Ugh, that's not great.


> +Upgrade from 32-bit XID versions
> +--------------------------------
> +
> +pg_upgrade doesn't change pages format itself. It is done lazily after.
> +
> +1. At first heap page read, tuples on a page are repacked to free 16 bytes
> +at the end of a page, possibly freeing space from dead tuples.

That will cause a *massive* torrent of writes after an upgrade. Isn't this
practically making pg_upgrade useless?  Imagine a huge cluster where most of
the pages are all-frozen, upgraded using link mode.


What happens if the first access happens on a replica?


What is the approach for dealing with multixact files? They have xids
embedded?  And currently the SLRUs will break if you just let the offsets SLRU
grow without bounds.



> +void
> +convert_page(Relation rel, Page page, Buffer buf, BlockNumber blkno)
> +{
> +    PageHeader    hdr = (PageHeader) page;
> +    GenericXLogState *state = NULL;
> +    Page    tmp_page = page;
> +    uint16    checksum;
> +
> +    if (!rel)
> +        return;
> +
> +    /* Verify checksum */
> +    if (hdr->pd_checksum)
> +    {
> +        checksum = pg_checksum_page((char *) page, blkno);
> +        if (checksum != hdr->pd_checksum)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INDEX_CORRUPTED),
> +                     errmsg("page verification failed, calculated checksum %u but expected %u",
> +                            checksum, hdr->pd_checksum)));
> +    }
> +
> +    /* Start xlog record */
> +    if (!XactReadOnly && XLogIsNeeded() && RelationNeedsWAL(rel))
> +    {
> +        state = GenericXLogStart(rel);
> +        tmp_page = GenericXLogRegisterBuffer(state, buf, GENERIC_XLOG_FULL_IMAGE);
> +    }
> +
> +    PageSetPageSizeAndVersion((hdr), PageGetPageSize(hdr),
> +                              PG_PAGE_LAYOUT_VERSION);
> +
> +    if (was_32bit_xid(hdr))
> +    {
> +        switch (rel->rd_rel->relkind)
> +        {
> +            case 'r':
> +            case 'p':
> +            case 't':
> +            case 'm':
> +                convert_heap(rel, tmp_page, buf, blkno);
> +                break;
> +            case 'i':
> +                /* no need to convert index */
> +            case 'S':
> +                /* no real need to convert sequences */
> +                break;
> +            default:
> +                elog(ERROR,
> +                     "Conversion for relkind '%c' is not implemented",
> +                     rel->rd_rel->relkind);
> +        }
> +    }
> +
> +    /*
> +     * Mark buffer dirty unless this is a read-only transaction (e.g. query
> +     * is running on hot standby instance)
> +     */
> +    if (!XactReadOnly)
> +    {
> +        /* Finish xlog record */
> +        if (XLogIsNeeded() && RelationNeedsWAL(rel))
> +        {
> +            Assert(state != NULL);
> +            GenericXLogFinish(state);
> +        }
> +
> +        MarkBufferDirty(buf);
> +    }
> +
> +    hdr = (PageHeader) page;
> +    hdr->pd_checksum = pg_checksum_page((char *) page, blkno);
> +}

Wait. So you just modify the page without WAL logging or marking it dirty on a
standby? I fail to see how that can be correct.

Imagine the cluster is promoted, the page is dirtied, and we write it
out. You'll have written out a completely changed page, without any WAL
logging. There's plenty other scenarios.


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Zhihong Yu
Дата:
Сообщение: Re: support for MERGE
Следующее
От: Justin Pryzby
Дата:
Сообщение: Re: support for MERGE