Обсуждение: Hint bits vs. OS readahead

Поиск
Список
Период
Сортировка

Hint bits vs. OS readahead

От
Andrew Gierth
Дата:
While waiting for a large restore to complete (and investigating why
parts of it were so slow), I came across this scenario. This isn't
quite the same as some previous discussion of hint bits, but I thought
it was something that could probably be taken into account in future.
This also may be relevent to the tuplestore discussion.

The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM;
16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in
software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB

The scenario: pg_restore of a dump containing a large partitioned table
(a dozen partitions of ~10GB each). The actual loading of the data
proceeds as expected, the interesting part is the creation of indexes
afterwards.

Watching the progress of the backend, a large proportion of the time
is taken up by the heap scan to retrieve the data. The problem is, of
course, that the backend settles down into an access pattern like this:
 lseek(0x64,0x3ef7c000,SEEK_SET) read(0x64,0x864123340,0x2000) => 8192/0x2000 lseek(0x64,0x3ef3e000,SEEK_SET)
write(0x64,0x864125340,0x2000)=> 8192/0x2000
 

where fd 0x64 is the table heap file; the read is obvious, the write is
caused by writing a previously hinted page back to disk when the backend
wants to reuse the buffer. Notice that this write is happening in the
same backend (and on the same fd).

At least on unpatched FreeBSD this access pattern destroys OS-level
readahead, though lower-level readahead on the actual disk drives
themselves hides this fact to a significant extent (each read() call
forces a SCSI transaction, but this transaction completes quite
quickly due to read caching on the drive).

In order to test how bad the effect was, I patched FreeBSD to use
separate sequential-behaviour tracking for reads and writes (this
patch turns out to be trivial, affecting only a couple of dozen
lines). The effect was fairly dramatic; the total time taken for
CREATE INDEX was cut by a factor of slightly better than 2 (typically
from ~700 seconds per partition to ~320 seconds on my data).

[for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ]

The obvious question is whether this is something which should be left
as the OS'es problem, or whether it would be worth having pg do some
special handling of file opens to distinguish read and write accesses,
or sequential from random accesses when both are likely to be happening
at the same time. I've so far had conflicting answers about how well
Linux handles this case (and not being a Linux user I have no easy way
to test it myself).

-- 
Andrew (irc:RhodiumToad)


Re: Hint bits vs. OS readahead

От
Decibel!
Дата:
On Dec 5, 2008, at 7:50 PM, Andrew Gierth wrote:
> While waiting for a large restore to complete (and investigating why
> parts of it were so slow), I came across this scenario. This isn't
> quite the same as some previous discussion of hint bits, but I thought
> it was something that could probably be taken into account in future.
> This also may be relevent to the tuplestore discussion.
>
> The environment: pg 8.3.5 on FreeBSD/amd64 7.1-prerelease; 32GB RAM;
> 16 cores of 2.93GHz Xeon 7350; 4x300GB 15krpm SAS data drives in
> software RAID10. shared_buffers=1700MB, maintenance_work_mem=1GB
>
> The scenario: pg_restore of a dump containing a large partitioned  
> table
> (a dozen partitions of ~10GB each). The actual loading of the data
> proceeds as expected, the interesting part is the creation of indexes
> afterwards.
>
> Watching the progress of the backend, a large proportion of the time
> is taken up by the heap scan to retrieve the data. The problem is, of
> course, that the backend settles down into an access pattern like  
> this:
>
>   lseek(0x64,0x3ef7c000,SEEK_SET)
>   read(0x64,0x864123340,0x2000) => 8192/0x2000
>   lseek(0x64,0x3ef3e000,SEEK_SET)
>   write(0x64,0x864125340,0x2000) => 8192/0x2000
>
> where fd 0x64 is the table heap file; the read is obvious, the  
> write is
> caused by writing a previously hinted page back to disk when the  
> backend
> wants to reuse the buffer. Notice that this write is happening in the
> same backend (and on the same fd).
>
> At least on unpatched FreeBSD this access pattern destroys OS-level
> readahead, though lower-level readahead on the actual disk drives
> themselves hides this fact to a significant extent (each read() call
> forces a SCSI transaction, but this transaction completes quite
> quickly due to read caching on the drive).
>
> In order to test how bad the effect was, I patched FreeBSD to use
> separate sequential-behaviour tracking for reads and writes (this
> patch turns out to be trivial, affecting only a couple of dozen
> lines). The effect was fairly dramatic; the total time taken for
> CREATE INDEX was cut by a factor of slightly better than 2 (typically
> from ~700 seconds per partition to ~320 seconds on my data).
>
> [for the patch see http://www.rhodiumtoad.org.uk/junk/seq.patch.txt ]
>
> The obvious question is whether this is something which should be left
> as the OS'es problem, or whether it would be worth having pg do some
> special handling of file opens to distinguish read and write accesses,
> or sequential from random accesses when both are likely to be  
> happening
> at the same time. I've so far had conflicting answers about how well
> Linux handles this case (and not being a Linux user I have no easy way
> to test it myself).

We don't do restores very often, but we have noticed that recovery  
mode is painfully slow for us, either from a crash or to bring up a  
PITR snapshot. We're running on 16 core IA64 machines with 96GB  
hitting iSCSI SANs (some SATA, some SAS). Under ideal conditions, PG  
can read or write at 100+MB/s. Typically, we're pushing ~5MB/s, but  
during recovery we'll only do 600-700kB/s. I've never straced a  
backend to see exactly what's going on.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828