Re: WAL write of full pages

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: WAL write of full pages
Дата
Msg-id 200403161342.i2GDgr922671@candle.pha.pa.us
обсуждение исходный текст
Ответ на Re: WAL write of full pages  (Shridhar Daithankar <shridhar@frodo.hserus.net>)
Ответы Re: WAL write of full pages  (Shridhar Daithankar <shridhar@frodo.hserus.net>)
Some one deleted pg_database entry how to fix it?  (Dave Cramer <pg@fastcrypt.com>)
Список pgsql-hackers
Shridhar Daithankar wrote:
> Hi,
> 
> I was thinking other way round. What if we write to WAL pages only to those 
> portions which we need to modify and let kernel do the job the way it sees fit? 
> What will happen if it fails?

So you are saying only write the part of the page that we modify?  I
think the kernel reads in the entire page, makes the modification, then
writes it.  However, we still don't know our 1.5k of changes made it on
to the platters completely.

> > Our current WAL implementation writes copies of full pages to WAL before
> > modifying the page on disk.  This is done to prevent partial pages from
> > being corrupted in case the operating system crashes during a page
> > write.  
> 
> Assuming a WAL page is zero at start and later written say a 128 bytes block. 
> Then how exactly writing 128 bytes is different than writing entire 8K page, 
> especially when we control neither kernel/buffer cache nor disk?
> 
> What is partial? Postgresql will always flush entire data block to WAL page 
> isn't it? If write returns, we can assume it is written.

If write returns, it means the data is in the kernel cache, not on the
disks.  Fsync is the only thing that forces it to disk, and it is slow.

> > For example, suppose an 8k block is being written to a heap file.  
> > First the backend issues a write(), which copies the page into the
> > kernel buffer cache.  Later, the kernel sends the write request to the
> > drive. Even if the file system uses 8k blocks, the disk is typically
> > made up of 512-byte sectors, so the OS translates the 8k block into a
> > contiguous number of disk sectors, in this case 16.  There is no
> > guarantee that all 16 sectors will be written --- perhaps 8 could be
> > written, then the system crashes, or perhaps part of an 512-byte sector
> > is written, but the remainder left unchanged.  In all these cases,
> > restarting the system will yield corrupt heap blocks.
> 
> We are hoping to prevent WAL page corruption which is part of file system 
> corruption. Do we propose to tacle file system corruption in order to guarantee 
> WAL integrity?

We assume the file system will come back with an xlog directory with
files in it because we fsync it.

> > The WAL writes copies of full pages so that on restore, it can check
> > each page to make sure it hasn't been corrupted.  The system records an
> > LSN (log serial number) on every page.  When a pages is modified, its
> > pre-change image is written to WAL, but not fsync'ed.  Later, if a
> > backend wants to write a page, it must make sure the LSN of page page is
> > between the LSN of the last checkpoint and the LSN of the last fsync by
> > a committed transactions.  Only in those cases can the page be written
> > because we are sure that a copy of the page is in the WAL in case there
> > is a partial write.
> 
> Do we have per page checksum? It could be in control log, not necessarily in 
> WAL. But just asking since I don't know.

Yes, in WAL.

> > Now, as you can image, these WAL page writes take up a considerable
> > amount of space in the WAL, and cause slowness, but no one has come up
> > with a way to recover from partial pages write with it.  The only way to
> > minimze page writes is to increase checkpoint_segments and
> > checkpoint_timeout so that checkpoints are less frequent, and pages have
> > to be written fewer times to the WAL because old copies of the pages
> > remain in WAL longer.
> 
> If I am not mistaken, we rely upon WAL being consistent to ensure transaction 
> recovery. We write() WAL and fsync/open/close it to make sure it goes on disk 
> before data pages. What else we can do?
> 
> I can not see why writing an 8K block is any more safe than writing just the 
> changes.
> 
> I may be dead wrong but just putting my thoughts together..

The problem is that we need to record what was on the page before we
made the modification because there is no way to know that a write
hasn't corrupted some part of the page.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Sullivan
Дата:
Сообщение: Re: Further thoughts about warning for costly FK checks
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Feature request: Dumping multiple tables at one step