Re: Mount options for Ext3?
| От | Kevin Brown | 
|---|---|
| Тема | Re: Mount options for Ext3? | 
| Дата | |
| Msg-id | 20030125041319.GE28252@filer обсуждение исходный текст | 
| Ответ на | Re: Mount options for Ext3? (Tom Lane <tgl@sss.pgh.pa.us>) | 
| Ответы | WAL replay logic (was Re: Mount options for Ext3?) | 
| Список | pgsql-performance | 
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > I was presuming that when a savepoint occurs, a marker is written to > > the log indicating which transactions had been committed to the data > > files, and that this marker was paid attention to during database > > startup. > > Not quite. The marker says that all datafile updates described by > log entries before point X have been flushed to disk by the checkpoint > --- and, therefore, if we need to restart we need only replay log > entries occurring after the last checkpoint's point X. > > This has nothing directly to do with which transactions are committed > or not committed. If we based checkpoint behavior on that, we'd need > to maintain an indefinitely large amount of WAL log to cope with > long-running transactions. Ah. My apologies for my imprecise wording. I should have said "...indicating which transactions had been written to the data files" instead of "...had been committed to the data files", and meant to say "checkpoint" but instead said "savepoint". I'll try to do better here. > The actual checkpoint algorithm is > > take note of current logical end of WAL (this will be point X) > write() all dirty buffers in shared buffer arena > sync() to ensure that above writes, as well as previous ones, > are on disk > put checkpoint record referencing point X into WAL; write and > fsync WAL > update pg_control with new checkpoint record, fsync it > > Since pg_control is what's examined after restart, the checkpoint is > effectively committed when the pg_control write hits disk. At any > instant before that, a crash would result in replaying from the > prior checkpoint's point X. The algorithm is correct if and only if > the pg_control write hits disk after all the other writes mentioned. [...] > > So suppose the marker makes it to the log but not all of the data the > > marker refers to makes it to the data files. Then the system crashes. > > I think that this analysis is not relevant to what we're doing. Agreed. The context of that analysis is when synchronous writes by the database are turned off and one is left to rely on the operating system to do the right thing. Clearly it doesn't apply when synchronous writes are enabled. As long as only one process handles a checkpoint, an operating system that guarantees that a process' writes are committed to disk in the same order that they were requested, combined with a journalling filesystem that at least wrote all data prior to committing the associated metadata transactions, would be sufficient to guarantee the integrity of the database even if all synchronous writes by the database were turned off. This would hold even if the operating system reordered writes from multiple processes. It suggests an operating system feature that could be considered highly desirable (and relates to the discussion elsewhere about trading off shared buffers against OS file cache: it's often better to rely on the abilities of the OS rather than roll your own mechanism). One question I have is: in the event of a crash, why not simply replay all the transactions found in the WAL? Is the startup time of the database that badly affected if pg_control is ignored? If there exists somewhere a reasonably succinct description of the reasoning behind the current transaction management scheme (including an analysis of the pros and cons), I'd love to read it and quit bugging you. :-) -- Kevin Brown kevin@sysexperts.com
В списке pgsql-performance по дате отправления: