Re: WAL write of full pages

Поиск
Список
Период
Сортировка
От Marty Scholes
Тема Re: WAL write of full pages
Дата
Msg-id 40560DC3.10004@outputservices.com
обсуждение исходный текст
Ответ на WAL write of full pages  (Bruce Momjian <pgman@candle.pha.pa.us>)
Ответы Re: WAL write of full pages  (Greg Stark <gsstark@mit.edu>)
Re: WAL write of full pages  (Rod Taylor <pg@rbt.ca>)
Re: WAL write of full pages  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: WAL write of full pages  (Manfred Spraul <manfred@colorfullife.com>)
Список pgsql-hackers
If I understand WAL correctly (and I may not), it is essentially a write 
cache for writes to the data files, because:

1. Data file writes are notoriously random, and writing the log is 
sequential.  Ironically, the sectors mapped by the OS to the disk are 
likely not at all sequential, but they likely are more sequential than 
the random data writes.

2. Log writing allows use of small, super fast drives (e.g. Solid State 
Disks) to speed up total database performance.  You can have slower 
drives for the large files in the database and still get acceptable 
performance.

3. WAL allows for syncing only the pages changed.  For example, suppose 
14 transactions are in flight and each one modifies 40 pages of a data 
file.  When one transaction commits, 560 pages are dirty, but only 40 
need to be written.  Without very close control of which buffers get 
dirtied to the OS (and Pg may have this, I am not sure), then all 560 
pages may get written in place of the 40 that actually need to be written.

My only complaint is about larger systems which have a single (or 
mirrored) large arrays.  If I have a very fast array of some sort that 
has proper caching, and my data files are on the array, look at my 
options for log files:

1. Put them on the array.
Pros:
* Fastest "drive" available
* RAID, so most reliable "drive" available
Cons:
* All changes get dumped twice: once for WAL, once at checkpoint.
* The array is no slower on random writes then sequential ones, which 
means that the benefits of writing to WAL vs. the data files are lost.

2. Put them on an actual (or mirrored actual) spindle
Pros:
* Keeps WAL and data file I/O separate
Cons:
* All of the non array drives are still slower than the array

3. Put them on mirrored solid state disks or another array
Pros:
* Very fast
* WAL and data file I/O is separate
Cons:
* Big $.  Extremely large $/GB ratio.
* If an array, hordes of unused space.

I suspect (but cannot prove) that performance would jump for systems 
like ours if WAL was done away with entirely and the individual data 
files were synchronized on commit.

Is there a simple way to turn off WAL in the config files so that I may 
do some benchmarking?


Bruce Momjian wrote:
> Our current WAL implementation writes copies of full pages to WAL before
> modifying the page on disk.  This is done to prevent partial pages from
> being corrupted in case the operating system crashes during a page
> write.  
> 
> For example, suppose an 8k block is being written to a heap file.  
> First the backend issues a write(), which copies the page into the
> kernel buffer cache.  Later, the kernel sends the write request to the
> drive. Even if the file system uses 8k blocks, the disk is typically
> made up of 512-byte sectors, so the OS translates the 8k block into a
> contiguous number of disk sectors, in this case 16.  There is no
> guarantee that all 16 sectors will be written --- perhaps 8 could be
> written, then the system crashes, or perhaps part of an 512-byte sector
> is written, but the remainder left unchanged.  In all these cases,
> restarting the system will yield corrupt heap blocks.
> 
> The WAL writes copies of full pages so that on restore, it can check
> each page to make sure it hasn't been corrupted.  The system records an
> LSN (log serial number) on every page.  When a pages is modified, its
> pre-change image is written to WAL, but not fsync'ed.  Later, if a
> backend wants to write a page, it must make sure the LSN of page page is
> between the LSN of the last checkpoint and the LSN of the last fsync by
> a committed transactions.  Only in those cases can the page be written
> because we are sure that a copy of the page is in the WAL in case there
> is a partial write.
> 
> Now, as you can image, these WAL page writes take up a considerable
> amount of space in the WAL, and cause slowness, but no one has come up
> with a way to recover from partial pages write with it.  The only way to
> minimze page writes is to increase checkpoint_segments and
> checkpoint_timeout so that checkpoints are less frequent, and pages have
> to be written fewer times to the WAL because old copies of the pages
> remain in WAL longer.
> 




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: listening addresses
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: listening addresses