WAL write of full pages

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема WAL write of full pages
Дата
Msg-id 200403151922.i2FJMWb18195@candle.pha.pa.us
обсуждение исходный текст
Ответы Re: WAL write of full pages  (Shridhar Daithankar <shridhar@frodo.hserus.net>)
Re: WAL write of full pages  (Dennis Haney <davh@diku.dk>)
Список pgsql-hackers
Our current WAL implementation writes copies of full pages to WAL before
modifying the page on disk.  This is done to prevent partial pages from
being corrupted in case the operating system crashes during a page
write.  

For example, suppose an 8k block is being written to a heap file.  
First the backend issues a write(), which copies the page into the
kernel buffer cache.  Later, the kernel sends the write request to the
drive. Even if the file system uses 8k blocks, the disk is typically
made up of 512-byte sectors, so the OS translates the 8k block into a
contiguous number of disk sectors, in this case 16.  There is no
guarantee that all 16 sectors will be written --- perhaps 8 could be
written, then the system crashes, or perhaps part of an 512-byte sector
is written, but the remainder left unchanged.  In all these cases,
restarting the system will yield corrupt heap blocks.

The WAL writes copies of full pages so that on restore, it can check
each page to make sure it hasn't been corrupted.  The system records an
LSN (log serial number) on every page.  When a pages is modified, its
pre-change image is written to WAL, but not fsync'ed.  Later, if a
backend wants to write a page, it must make sure the LSN of page page is
between the LSN of the last checkpoint and the LSN of the last fsync by
a committed transactions.  Only in those cases can the page be written
because we are sure that a copy of the page is in the WAL in case there
is a partial write.

Now, as you can image, these WAL page writes take up a considerable
amount of space in the WAL, and cause slowness, but no one has come up
with a way to recover from partial pages write with it.  The only way to
minimze page writes is to increase checkpoint_segments and
checkpoint_timeout so that checkpoints are less frequent, and pages have
to be written fewer times to the WAL because old copies of the pages
remain in WAL longer.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: Further thoughts about warning for costly FK checks
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: listening addresses