Re: Proposed WAL changes

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Proposed WAL changes
Дата
Msg-id 28407.984079830@sss.pgh.pa.us
обсуждение исходный текст
Ответ на RE: Proposed WAL changes  ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>)
Список pgsql-hackers
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> No, but I want a system that's not brittle. You seem to be content to
>> design a system that is reliable as long as the WAL log is OK 
>> but loses the entire database unrecoverably as soon as one bit goes bad
>> in the log.

> I don't see how absence of old checkpoint forces losing entire database.

As the code stood last week that's what would happen, because the system
would not restart unless pg_control pointed to a valid checkpoint
record.  I addressed that in a way that seemed good to me.

Now, from what you've said in this conversation you would rather have
the system scan XLOG to decide where to replay from if it cannot read
the last checkpoint record.  That would be okay with me, but even with
that approach I do not think it's safe to truncate the log to nothing
as soon as we've written a checkpoint record.  I want to see a reasonable
amount of log data there at all times.  I don't insist that "reasonable
amount" necessarily means "back to the prior checkpoint" --- but that's
a simple and easy-to-implement interpretation.

> You probably will get better consistency by re-applying modifications
> which supposed to be in data files already but it seems questionable
> to me.

It's not a guarantee, no, but it gives you a better probability of
recovering recent changes when things are hosed.

BTW, can we really trust checkpoint to mean that all data file changes
are down on disk?  I see that the actual implementation of checkpoint is
write out all dirty shmem buffers;sync();if (IsUnderPostmaster)    sleep(2);sync();write checkpoint record to
XLOG;fsyncXLOG;
 

Now HP's man page for sync() says
    The writing, although scheduled, is not necessarily complete upon    return from sync.

I can assure you that 2 seconds is nowhere near enough to ensure that a
sync is complete on my workstation... and I doubt that "scheduled" means
"guaranteed to complete before any subsequently-requested I/O is done".
I think it's entirely possible that the checkpoint record will hit the
disk before the last heap buffer does.

Therefore, even without considering disk drive write reordering, I do
not believe that a checkpoint guarantees very much, and so I think it's
pretty foolish to delete the preceding XLOG data immediately afterwards.


>> Perhaps the checkpoint creation rule should be "every M seconds *or*
>> every N megabytes of log, whichever comes first".

> I like this! Regardless usability of keeping older checkpoint (especially
> in future, with log archiving) your rule is worth in any case.

Okay, I'll see if I can do something with this idea.

Other than what we've discussed, do you have any comments/objections to
my proposed patch?  I've been holding off committing it so that you have
time to review it...
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Mikheev, Vadim"
Дата:
Сообщение: RE: Proposed WAL changes
Следующее
От: "Mikheev, Vadim"
Дата:
Сообщение: RE: Checkpoint process signal handling seems wrong