Re: finding changed blocks using WAL scanning

Поиск
Список
Период
Сортировка
От Ashwin Agrawal
Тема Re: finding changed blocks using WAL scanning
Дата
Msg-id CALfoeitNTWGZsV8uPN8soHFyvbNM8p5ppVfFgEDdgJMknEAptg@mail.gmail.com
обсуждение исходный текст
Ответ на finding changed blocks using WAL scanning  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers

On Wed, Apr 10, 2019 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote:
Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied.  Ashwin replied in
https://urldefense.proofpoint.com/v2/url?u=http-3A__postgr.es_m_CALfoeitO-2DvkfjubMFQRmgyXghL-2DuUnZLNxbr-3DobrQQsm8kFO4A-40mail.gmail.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=W07oy16p6VEfYKCgfRXQpRz9pfy_of-a_8DAjAg5TGk&s=YAtoa9HWqQ1PPjt1CGui1Fo_a20j0n95LRonCXucBz4&e=
to mention that the same approach could be useful for pg_upgrade:

Thank you for initiating this separate thread. Just typo above not pg_upgrade but pg_rewind.

Let me explain first the thought I have around how to leverage this for pg_rewind, actually any type of incremental recovery to be exact. Would love to hear thoughts on it.

Currently, incremental recovery of any form, if replica goes down and comes up or trying to bring back primary after failover to replica, requires *all* the WAL to be present from point of disconnect. So, its boolean in those terms, if WAL available can incrementally recovery otherwise have to perform full basebackup. If we come up with this mechanism to find and store changed blocks from WAL, we can provide intermediate level of incremental recovery which will be better than full recovery.

WAL allows tuple level granularity for recovery (if we ignore FPI for a moment). Modified blocks from WAL, if WAL is not available will provide block level incremental recovery.

So, pg_basebackup (or some other tool or just option to it) and pg_rewind can leverage the changed blocks if WAL can't be retained due to space constraints and perform the recovery.

pg_rewind can also be optimized as it currently copies blocks from src to target which were present in target WAL to rewind. So, such blocks can be easily skipped from copying again.

Depending on pattern of changes in WAL and size, instead of replaying all the WAL logs for incremental recovery, just copying over the changed blocks could prove more efficient.

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants.  One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs.  You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL.  In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data.  Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more.  The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot.  In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway.  I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me.  You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes.  One way is to
stick a checksum into the output file.  After you finish writing it,
fsync() it before you start writing the next one.  After a restart,
read the latest such file and see if the checksum is OK.  If not,
regenerate it; if not, assume it's good and move on.  Files other than
the last one can be assumed good.  Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again.  The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN.  That's pretty much it.

+1 for first option. Seems simpler and straight-forward.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: setLastTid() and currtid()
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Reducing the runtime of the core regression tests