Обсуждение: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

Поиск
Список
Период
Сортировка

Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Michael Paquier
Дата:
Hi all,

As of today, replication protocol has a command called BASE_BACKUP to
allow a client connecting with the replication protocol to retrieve a
full backup from server through a connection stream. The description
of its current options are here:
http://www.postgresql.org/docs/9.3/static/protocol-replication.html

This command is in charge to put the server in start backup by using
do_pg_start_backup, then it sends the backup, and finalizes the backup
with do_pg_stop_backup. Thanks to that it is as well possible to get
backups from even standby nodes as the stream contains the
backup_label file necessary for recovery. Full backup is sent in tar
format for obvious performance reasons to limit the amount of data
sent through the stream, and server contains necessary coding to send
the data in correct format. This forces the client as well to perform
some decoding if the output of the base backup received needs to be
analyzed on the fly but doing something similar to what now
pg_basebackup does when the backup format is plain.

I would like to propose the following things to extend BASE_BACKUP to
retrieve a backup from a stream:
- Addition of an option FORMAT, to control the output format of
backup, with possible options as 'plain' and 'tar'. Default is tar for
backward compatibility purposes. The purpose of this option is to make
easier for backup tools playing with postgres to retrieve and backup
and analyze it on the fly, the purpose being to filter and analyze the
data while it is being received without all the tar decoding
necessary, what would consist in copying portions of pg_basebackup
code more or less.
- Addition of an option called INCREMENTAL to send an incremental
backup to the client. This option uses as input an LSN, and sends back
to client relation pages (in the shape of reduced relation files) that
are newer than the LSN specified by looking at pd_lsn of
PageHeaderData. In this case the LSN needs to be determined by client
based on the latest full backup taken. This option is particularly
interesting to reduce the amount of data taken between two backups,
even if it increases the restore time as client needs to reconstitute
a base backup depending on the recovery target and the pages modified.
Client would be in charge of rebuilding pages from incremental backup
by scanning all the blocks that need to be updated based on the full
backup as the LSN from which incremental backup is taken is known. But
this is not really something the server cares about... Such things are
actually done by pg_rman as well.

As a next step, I would imagine that pg_basebackup could be extended
to take incremental backups as well. Having another tool able to
rebuild backups based on a full backup with incremental information
would be nice as well.

This is of course not material for 9.4, I just would like for now to
measure the temperature about such things and gather opinions...

Regards
-- 
Michael



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Andres Freund
Дата:
Hi,

On 2014-01-14 21:47:43 +0900, Michael Paquier wrote:
> I would like to propose the following things to extend BASE_BACKUP to
> retrieve a backup from a stream:
> - Addition of an option FORMAT, to control the output format of
> backup, with possible options as 'plain' and 'tar'. Default is tar for
> backward compatibility purposes. The purpose of this option is to make
> easier for backup tools playing with postgres to retrieve and backup
> and analyze it on the fly, the purpose being to filter and analyze the
> data while it is being received without all the tar decoding
> necessary, what would consist in copying portions of pg_basebackup
> code more or less.

We'd need our own serialization format since we're dealing with more
than one file, what would be the point?

> - Addition of an option called INCREMENTAL to send an incremental
> backup to the client. This option uses as input an LSN, and sends back
> to client relation pages (in the shape of reduced relation files) that
> are newer than the LSN specified by looking at pd_lsn of
> PageHeaderData. In this case the LSN needs to be determined by client
> based on the latest full backup taken. This option is particularly
> interesting to reduce the amount of data taken between two backups,
> even if it increases the restore time as client needs to reconstitute
> a base backup depending on the recovery target and the pages modified.
> Client would be in charge of rebuilding pages from incremental backup
> by scanning all the blocks that need to be updated based on the full
> backup as the LSN from which incremental backup is taken is known. But
> this is not really something the server cares about... Such things are
> actually done by pg_rman as well.

Why not just rely on WAL replay since you're relying on the consistency
of the standby anyway?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Heikki Linnakangas
Дата:
On 01/14/2014 02:47 PM, Michael Paquier wrote:
> I would like to propose the following things to extend BASE_BACKUP to
> retrieve a backup from a stream:
> - Addition of an option FORMAT, to control the output format of
> backup, with possible options as 'plain' and 'tar'. Default is tar for
> backward compatibility purposes. The purpose of this option is to make
> easier for backup tools playing with postgres to retrieve and backup
> and analyze it on the fly, the purpose being to filter and analyze the
> data while it is being received without all the tar decoding
> necessary, what would consist in copying portions of pg_basebackup
> code more or less.

Umm, you have to somehow mark in the protocol where one file begins and 
another one ends. The 'tar' format seems perfectly OK for that purpose. 
What exactly would the 'plain' format do?

> - Addition of an option called INCREMENTAL to send an incremental
> backup to the client. This option uses as input an LSN, and sends back
> to client relation pages (in the shape of reduced relation files) that
> are newer than the LSN specified by looking at pd_lsn of
> PageHeaderData. In this case the LSN needs to be determined by client
> based on the latest full backup taken. This option is particularly
> interesting to reduce the amount of data taken between two backups,
> even if it increases the restore time as client needs to reconstitute
> a base backup depending on the recovery target and the pages modified.
> Client would be in charge of rebuilding pages from incremental backup
> by scanning all the blocks that need to be updated based on the full
> backup as the LSN from which incremental backup is taken is known. But
> this is not really something the server cares about... Such things are
> actually done by pg_rman as well.

How does the server find all the pages with LSN > the threshold? If it 
needs to scan the whole database, it's not all that useful. I guess it 
would be better than nothing, but I think you might as well just use rsync.

- Heikki



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Magnus Hagander
Дата:
On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
Hi all,

As of today, replication protocol has a command called BASE_BACKUP to
allow a client connecting with the replication protocol to retrieve a
full backup from server through a connection stream. The description
of its current options are here:
http://www.postgresql.org/docs/9.3/static/protocol-replication.html

This command is in charge to put the server in start backup by using
do_pg_start_backup, then it sends the backup, and finalizes the backup
with do_pg_stop_backup. Thanks to that it is as well possible to get
backups from even standby nodes as the stream contains the
backup_label file necessary for recovery. Full backup is sent in tar
format for obvious performance reasons to limit the amount of data
sent through the stream, and server contains necessary coding to send
the data in correct format. This forces the client as well to perform
some decoding if the output of the base backup received needs to be
analyzed on the fly but doing something similar to what now
pg_basebackup does when the backup format is plain.

I would like to propose the following things to extend BASE_BACKUP to
retrieve a backup from a stream:
- Addition of an option FORMAT, to control the output format of
backup, with possible options as 'plain' and 'tar'. Default is tar for
backward compatibility purposes. The purpose of this option is to make
easier for backup tools playing with postgres to retrieve and backup
and analyze it on the fly, the purpose being to filter and analyze the
data while it is being received without all the tar decoding
necessary, what would consist in copying portions of pg_basebackup
code more or less.

How would this be different/better than the tar format? pg_basebackup already does this analysis, for example, when it comes to recovery.conf. 
The tar format is really easy to analyze as a stream, that's one of the reasons we picked it...

 
- Addition of an option called INCREMENTAL to send an incremental
backup to the client. This option uses as input an LSN, and sends back
to client relation pages (in the shape of reduced relation files) that
are newer than the LSN specified by looking at pd_lsn of
PageHeaderData. In this case the LSN needs to be determined by client
based on the latest full backup taken. This option is particularly
interesting to reduce the amount of data taken between two backups,
even if it increases the restore time as client needs to reconstitute
a base backup depending on the recovery target and the pages modified.
Client would be in charge of rebuilding pages from incremental backup
by scanning all the blocks that need to be updated based on the full
backup as the LSN from which incremental backup is taken is known. But
this is not really something the server cares about... Such things are
actually done by pg_rman as well.

This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the same underlying technology, depending only on if you go back to the full base backup, or to the last incremental one.

But if you look at the terms otherwise, I think incremental often refers to what we call WAL.

Either way - if we can do this in a safe way, it sounds like a good idea. It would be sort of like rsync, except relying on the fact that we can look at the LSN and don't have to compare the actual files, right?


As a next step, I would imagine that pg_basebackup could be extended
to take incremental backups as well. Having another tool able to
rebuild backups based on a full backup with incremental information
would be nice as well.

I would say those are requirements, not just next step and "nice as well" :)


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Michael Paquier
Дата:
On Tue, Jan 14, 2014 at 10:01 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> - Addition of an option called INCREMENTAL to send an incremental
>> backup to the client. This option uses as input an LSN, and sends back
>> to client relation pages (in the shape of reduced relation files) that
>> are newer than the LSN specified by looking at pd_lsn of
>> PageHeaderData. In this case the LSN needs to be determined by client
>> based on the latest full backup taken. This option is particularly
>> interesting to reduce the amount of data taken between two backups,
>> even if it increases the restore time as client needs to reconstitute
>> a base backup depending on the recovery target and the pages modified.
>> Client would be in charge of rebuilding pages from incremental backup
>> by scanning all the blocks that need to be updated based on the full
>> backup as the LSN from which incremental backup is taken is known. But
>> this is not really something the server cares about... Such things are
>> actually done by pg_rman as well.
>
>
> How does the server find all the pages with LSN > the threshold? If it needs
> to scan the whole database, it's not all that useful. I guess it would be
> better than nothing, but I think you might as well just use rsync.
Yes, it would be necessary to scan the whole database as the LSN to be
checked is kept in PageHeaderData :). Perhaps it is not that
performant, but my initial thought was that perhaps the amount of data
necessary to maintain incremental backups could balance with the
amount of WAL necessary to keep and limit the whole amount on disk.
-- 
Michael



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Andres Freund
Дата:
On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
> Either way - if we can do this in a safe way, it sounds like a good idea.
> It would be sort of like rsync, except relying on the fact that we can look
> at the LSN and don't have to compare the actual files, right?

Which is an advantage, yes. On the other hand, it doesn't fix problems
with a subtly broken replica, e.g. after a bug in replay, or disk
corruption.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Michael Paquier
Дата:
On Tue, Jan 14, 2014 at 10:12 PM, Magnus Hagander <magnus@hagander.net> wrote:
> On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> As of today, replication protocol has a command called BASE_BACKUP to
>> allow a client connecting with the replication protocol to retrieve a
>> full backup from server through a connection stream. The description
>> of its current options are here:
>> http://www.postgresql.org/docs/9.3/static/protocol-replication.html
>>
>> This command is in charge to put the server in start backup by using
>> do_pg_start_backup, then it sends the backup, and finalizes the backup
>> with do_pg_stop_backup. Thanks to that it is as well possible to get
>> backups from even standby nodes as the stream contains the
>> backup_label file necessary for recovery. Full backup is sent in tar
>> format for obvious performance reasons to limit the amount of data
>> sent through the stream, and server contains necessary coding to send
>> the data in correct format. This forces the client as well to perform
>> some decoding if the output of the base backup received needs to be
>> analyzed on the fly but doing something similar to what now
>> pg_basebackup does when the backup format is plain.
>>
>> I would like to propose the following things to extend BASE_BACKUP to
>> retrieve a backup from a stream:
>> - Addition of an option FORMAT, to control the output format of
>> backup, with possible options as 'plain' and 'tar'. Default is tar for
>> backward compatibility purposes. The purpose of this option is to make
>> easier for backup tools playing with postgres to retrieve and backup
>> and analyze it on the fly, the purpose being to filter and analyze the
>> data while it is being received without all the tar decoding
>> necessary, what would consist in copying portions of pg_basebackup
>> code more or less.
>
>
> How would this be different/better than the tar format? pg_basebackup
> already does this analysis, for example, when it comes to recovery.conf.
> The tar format is really easy to analyze as a stream, that's one of the
> reasons we picked it...
>
>
>>
>> - Addition of an option called INCREMENTAL to send an incremental
>>
>> backup to the client. This option uses as input an LSN, and sends back
>> to client relation pages (in the shape of reduced relation files) that
>> are newer than the LSN specified by looking at pd_lsn of
>> PageHeaderData. In this case the LSN needs to be determined by client
>> based on the latest full backup taken. This option is particularly
>> interesting to reduce the amount of data taken between two backups,
>> even if it increases the restore time as client needs to reconstitute
>> a base backup depending on the recovery target and the pages modified.
>> Client would be in charge of rebuilding pages from incremental backup
>> by scanning all the blocks that need to be updated based on the full
>> backup as the LSN from which incremental backup is taken is known. But
>> this is not really something the server cares about... Such things are
>> actually done by pg_rman as well.
>
>
> This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the
> same underlying technology, depending only on if you go back to the full
> base backup, or to the last incremental one.
Yes, that's actually a LSN-differential, I got my head in pg_rman for
a couple of weeks, where a similar idea is called incremental there.

>
> But if you look at the terms otherwise, I think incremental often refers to
> what we call WAL.
>
> Either way - if we can do this in a safe way, it sounds like a good idea. It
> would be sort of like rsync, except relying on the fact that we can look at
> the LSN and don't have to compare the actual files, right?
Yep, that's the idea.
-- 
Michael



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Magnus Hagander
Дата:
On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
> Either way - if we can do this in a safe way, it sounds like a good idea.
> It would be sort of like rsync, except relying on the fact that we can look
> at the LSN and don't have to compare the actual files, right?

Which is an advantage, yes. On the other hand, it doesn't fix problems
with a subtly broken replica, e.g. after a bug in replay, or disk
corruption.


Right. But neither does rsync, right? 


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Andres Freund
Дата:
On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
> On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
> > > Either way - if we can do this in a safe way, it sounds like a good idea.
> > > It would be sort of like rsync, except relying on the fact that we can
> > look
> > > at the LSN and don't have to compare the actual files, right?
> >
> > Which is an advantage, yes. On the other hand, it doesn't fix problems
> > with a subtly broken replica, e.g. after a bug in replay, or disk
> > corruption.
> >
> >
> Right. But neither does rsync, right?

Hm? Rsync's really only safe with --checksum and with that it definitely
should fix those?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Magnus Hagander
Дата:
On Tue, Jan 14, 2014 at 2:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Jan 14, 2014 at 10:01 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> - Addition of an option called INCREMENTAL to send an incremental
>> backup to the client. This option uses as input an LSN, and sends back
>> to client relation pages (in the shape of reduced relation files) that
>> are newer than the LSN specified by looking at pd_lsn of
>> PageHeaderData. In this case the LSN needs to be determined by client
>> based on the latest full backup taken. This option is particularly
>> interesting to reduce the amount of data taken between two backups,
>> even if it increases the restore time as client needs to reconstitute
>> a base backup depending on the recovery target and the pages modified.
>> Client would be in charge of rebuilding pages from incremental backup
>> by scanning all the blocks that need to be updated based on the full
>> backup as the LSN from which incremental backup is taken is known. But
>> this is not really something the server cares about... Such things are
>> actually done by pg_rman as well.
>
>
> How does the server find all the pages with LSN > the threshold? If it needs
> to scan the whole database, it's not all that useful. I guess it would be
> better than nothing, but I think you might as well just use rsync.
Yes, it would be necessary to scan the whole database as the LSN to be
checked is kept in PageHeaderData :). Perhaps it is not that
performant, but my initial thought was that perhaps the amount of data
necessary to maintain incremental backups could balance with the
amount of WAL necessary to keep and limit the whole amount on disk.

It wouldn't be worse performance wise than a full backup. That one also has to read all the blocks after all... You're decreasing network traffic and client storage, with the same I/O on the server side. Seems worthwhile. 

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Magnus Hagander
Дата:
On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
> On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
> > > Either way - if we can do this in a safe way, it sounds like a good idea.
> > > It would be sort of like rsync, except relying on the fact that we can
> > look
> > > at the LSN and don't have to compare the actual files, right?
> >
> > Which is an advantage, yes. On the other hand, it doesn't fix problems
> > with a subtly broken replica, e.g. after a bug in replay, or disk
> > corruption.
> >
> >
> Right. But neither does rsync, right?

Hm? Rsync's really only safe with --checksum and with that it definitely
should fix those?


I think we're talking about difference scenarios.

I thought you were talking about a backup taken from a replica, that already has corruption. rsync checksums surely aren't going to help with that? 


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Andres Freund
Дата:
On 2014-01-14 14:42:36 +0100, Magnus Hagander wrote:
> On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
> > > On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund <andres@2ndquadrant.com
> > >wrote:
> > >
> > > > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
> > > > > Either way - if we can do this in a safe way, it sounds like a good
> > idea.
> > > > > It would be sort of like rsync, except relying on the fact that we
> > can
> > > > look
> > > > > at the LSN and don't have to compare the actual files, right?
> > > >
> > > > Which is an advantage, yes. On the other hand, it doesn't fix problems
> > > > with a subtly broken replica, e.g. after a bug in replay, or disk
> > > > corruption.
> > > >
> > > >
> > > Right. But neither does rsync, right?
> >
> > Hm? Rsync's really only safe with --checksum and with that it definitely
> > should fix those?
> >
> >
> I think we're talking about difference scenarios.

Sounds like it.

> I thought you were talking about a backup taken from a replica, that
> already has corruption. rsync checksums surely aren't going to help with
> that?

I was talking about updating a standby using such an incremental or
differential backup from the primary (or a standby higher up in the
cascade). If your standby is corrupted in any way a rsync --checksum
will certainly correct errors if it syncs from a correct source?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Magnus Hagander
Дата:
<p dir="ltr"><br /> On Jan 14, 2014 2:44 PM, "Andres Freund" <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br /> ><br /> > On 2014-01-14 14:42:36
+0100,Magnus Hagander wrote:<br /> > > On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br/> > ><br /> > > > On
2014-01-1414:40:46 +0100, Magnus Hagander wrote:<br /> > > > > On Tue, Jan 14, 2014 at 2:18 PM, Andres
Freund<<a href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a><br /> > > > >wrote:<br /> >
>> ><br /> > > > > > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:<br /> > > >
>> > Either way - if we can do this in a safe way, it sounds like a good<br /> > > > idea.<br /> >
>> > > > It would be sort of like rsync, except relying on the fact that we<br /> > > > can<br
/>> > > > > look<br /> > > > > > > at the LSN and don't have to compare the actual
files,right?<br /> > > > > ><br /> > > > > > Which is an advantage, yes. On the other
hand,it doesn't fix problems<br /> > > > > > with a subtly broken replica, e.g. after a bug in replay,
ordisk<br /> > > > > > corruption.<br /> > > > > ><br /> > > > > ><br />
>> > > Right. But neither does rsync, right?<br /> > > ><br /> > > > Hm? Rsync's really
onlysafe with --checksum and with that it definitely<br /> > > > should fix those?<br /> > > ><br />
>> ><br /> > > I think we're talking about difference scenarios.<br /> ><br /> > Sounds like
it.<br/> ><br /> > > I thought you were talking about a backup taken from a replica, that<br /> > >
alreadyhas corruption. rsync checksums surely aren't going to help with<br /> > > that?<br /> ><br /> > I
wastalking about updating a standby using such an incremental or<br /> > differential backup from the primary (or a
standbyhigher up in the<br /> > cascade). If your standby is corrupted in any way a rsync --checksum<br /> > will
certainlycorrect errors if it syncs from a correct source?<br /><p dir="ltr">Sure, but as I understand it that's not at
allthe scenario that the suggested functionality is for. You can still use rsync for that, I don't think anybody
suggestedremoving that ability. Replicas weren't the target... <p dir="ltr">/Magnus  

Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Jim Nasby
Дата:
On 1/14/14, 7:41 AM, Magnus Hagander wrote:
>     Yes, it would be necessary to scan the whole database as the LSN to be
>     checked is kept in PageHeaderData :). Perhaps it is not that
>     performant, but my initial thought was that perhaps the amount of data
>     necessary to maintain incremental backups could balance with the
>     amount of WAL necessary to keep and limit the whole amount on disk.
>
>
> It wouldn't be worse performance wise than a full backup. That one also has to read all the blocks after all...
You'redecreasing network traffic and client storage, with the same I/O on the server side. Seems worthwhile.
 

If there's enough demand, it probably wouldn't be that hard to keep a copy of the page LSNs in a fork; you only need to
ensurethat the LSN in the fork must be older than the LSN on disk could possibly be, and you wouldn't have to update
thefork every time.
 

BTW, an incremental backup could possibly be useful as a way to catch a streaming replica up that's fallen way behind.
Thewrite IO would be sequential instead of trying to random-write while processing each WAL record.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Extending BASE_BACKUP in replication protocol: incremental backup and backup format

От
Heikki Linnakangas
Дата:
On 01/15/2014 08:46 AM, Jim Nasby wrote:
> On 1/14/14, 7:41 AM, Magnus Hagander wrote:
>>     Yes, it would be necessary to scan the whole database as the LSN
>> to be
>>     checked is kept in PageHeaderData :). Perhaps it is not that
>>     performant, but my initial thought was that perhaps the amount of
>> data
>>     necessary to maintain incremental backups could balance with the
>>     amount of WAL necessary to keep and limit the whole amount on disk.
>>
>> It wouldn't be worse performance wise than a full backup. That one
>> also has to read all the blocks after all... You're decreasing network
>> traffic and client storage, with the same I/O on the server side.
>> Seems worthwhile.
>
> If there's enough demand, it probably wouldn't be that hard to keep a
> copy of the page LSNs in a fork; you only need to ensure that the LSN in
> the fork must be older than the LSN on disk could possibly be, and you
> wouldn't have to update the fork every time.

That's backwards. You need to ensure that the LSN in the fork >= that on 
disk. Otherwise the backup will incorrectly conclude that a page doesn't 
need to be backed up because it hasn't been modified.

> BTW, an incremental backup could possibly be useful as a way to catch a
> streaming replica up that's fallen way behind. The write IO would be
> sequential instead of trying to random-write while processing each WAL
> record.

Yeah. And it would work even if you no longer have all the WAL files 
available.

- Heikki