Re: File based Incremental backup v8

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: File based Incremental backup v8
Дата
Msg-id CAHGQGwGH12XbMB9v2mum2Z0aKfkbVoVU=SLokfkjGaKFcEhTww@mail.gmail.com
обсуждение исходный текст
Ответ на Re: File based Incremental backup v8  (Marco Nenciarini <marco.nenciarini@2ndquadrant.it>)
Ответы Re: File based Incremental backup v8  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-hackers
On Thu, Mar 5, 2015 at 1:59 AM, Marco Nenciarini
<marco.nenciarini@2ndquadrant.it> wrote:
> Hi Fujii,
>
> Il 03/03/15 11:48, Fujii Masao ha scritto:
>> On Tue, Mar 3, 2015 at 12:36 AM, Marco Nenciarini
>> <marco.nenciarini@2ndquadrant.it> wrote:
>>> Il 02/03/15 14:21, Fujii Masao ha scritto:
>>>> On Thu, Feb 12, 2015 at 10:50 PM, Marco Nenciarini
>>>> <marco.nenciarini@2ndquadrant.it> wrote:
>>>>> Hi,
>>>>>
>>>>> I've attached an updated version of the patch.
>>>>
>>>> basebackup.c:1565: warning: format '%lld' expects type 'long long
>>>> int', but argument 8 has type '__off_t'
>>>> basebackup.c:1565: warning: format '%lld' expects type 'long long
>>>> int', but argument 8 has type '__off_t'
>>>> pg_basebackup.c:865: warning: ISO C90 forbids mixed declarations and code
>>>>
>>>
>>> I'll add the an explicit cast at that two lines.
>>>
>>>> When I applied three patches and compiled the code, I got the above warnings.
>>>>
>>>> How can we get the full backup that we can use for the archive recovery, from
>>>> the first full backup and subsequent incremental backups? What commands should
>>>> we use for that, for example? It's better to document that.
>>>>
>>>
>>> I've sent a python PoC that supports the plain format only (not the tar one).
>>> I'm currently rewriting it in C (with also the tar support) and I'll send a new patch containing it ASAP.
>>
>> Yeah, if special tool is required for that purpose, the patch should include it.
>>
>
> I'm working on it. The interface will be exactly the same of the PoC script I've attached to
>
> 54C7CDAD.6060900@2ndquadrant.it
>
>>>> What does "1" of the heading line in backup_profile mean?
>>>>
>>>
>>> Nothing. It's a version number. If you think it's misleading I will remove it.
>>
>> A version number of file format of backup profile? If it's required for
>> the validation of backup profile file as a safe-guard, it should be included
>> in the profile file. For example, it might be useful to check whether
>> pg_basebackup executable is compatible with the "source" backup that
>> you specify. But more info might be needed for such validation.
>>
>
> The current implementation bail out with an error if the header line is different from what it expect.
> It also reports and error if the 2nd line is not the start WAL location. That's all that pg_basebackup needs to start
anew incremental backup. All the other information are useful to reconstruct a full backup in case of an incremental
backup,or maybe to check the completeness of an archived full backup. 
> Initially the profile was present only in incremental backups, but after some discussion on list we agreed to always
writeit. 

Don't we need more checks about the compatibility of the backup-target database
cluster and the source incremental backup? Without such more checks, I'm afraid
we can easily get a corrupted incremental backups. For example, pg_basebackup
should emit an error if the target and source have the different system IDs,
like walreceiver does? What happens if the timeline ID is different between the
source and target? What happens if the source was taken from the standby but
new incremental backup will be taken from the master? Do we need to check them?

>>>> Sorry if this has been already discussed so far. Why is a backup profile file
>>>> necessary? Maybe it's necessary in the future, but currently seems not.
>>>
>>> It's necessary because it's the only way to detect deleted files.
>>
>> Maybe I'm missing something. Seems we can detect that even without a profile.
>> For example, please imagine the case where the file has been deleted since
>> the last full backup and then the incremental backup is taken. In this case,
>> that deleted file exists only in the full backup. We can detect the deletion of
>> the file by checking both full and incremental backups.
>>
>
> When you take an incremental backup, only changed files are sent. Without the backup_profile in the incremental
backup,you cannot detect a deleted file, because it's indistinguishable from a file that is not changed. 

Yeah, you're right!

>>>> We've really gotten the consensus about the current design, especially that
>>>> every files basically need to be read to check whether they have been modified
>>>> since last backup even when *no* modification happens since last backup?
>>>
>>> The real problem here is that there is currently no way to detect that a file is not changed since the last backup.
Weagreed to not use file system timestamps as they are not reliable for that purpose. 
>>
>> TBH I prefer timestamp-based approach in the first version of incremental backup
>> even if's less reliable than LSN-based one. I think that some users who are
>> using timestamp-based rsync (i.e., default mode) for the backup would be
>> satisfied with timestamp-based one.
>
> The original design was to compare size+timestamp+checksums (only if everything else matches and the file has been
modifiedafter the start of the backup), but the feedback from the list was that we cannot trust the filesystem mtime
andwe must use LSN instead. 
>
>>
>>> Using LSN have a significant advantage over using checksum, as we can start the full copy as soon as we found a
blockwhith a LSN greater than the threshold. 
>>> There are two cases: 1) the file is changed, so we can assume that we detect it after reading 50% of the file, then
wesend it taking advantage of file system cache; 2) the file is not changed, so we read it without sending anything. 
>>> It will end up producing an I/O comparable to a normal backup.
>>
>> Yeah, it might make the situation better than today. But I'm afraid that
>> many users might get disappointed about that behavior of an incremental
>> backup after the release...
>
> I don't get what do you mean here. Can you elaborate this point?

The proposed version of LSN-based incremental backup has some limitations
(e.g., every database files need to be read even when there is no modification
in database since last backup, and which may make the backup time longer than
users expect) which may disappoint users. So I'm afraid that users who can
benefit from the feature might be very limited. IOW, I'm just sticking to
the idea of timestamp-based one :) But I should drop it if the majority in
the list prefers the LSN-based one even if it has such limitations.

Regards,

--
Fujii Masao



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kouhei Kaigai
Дата:
Сообщение: Re: Combining Aggregates
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: File based Incremental backup v8