Re: Parallel Seq Scan

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: Parallel Seq Scan
Дата
Msg-id CAMkU=1zq17cb-FgXnuRpzTwAL2aoaG8Gqs=AodUh1BTmfH5X9Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel Seq Scan  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Parallel Seq Scan  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/28/2015 04:16 AM, Robert Haas wrote:
On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups.  Moreover,
I'm kind of suspicious about whether your results are actually
physically possible.  Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1].  So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them.  Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6.  So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment.  When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments.  That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

OTOH, spreading the I/O across multiple files is not a good thing, if you don't have a RAID setup like that. With a single spindle, you'll just induce more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the occasional seek doesn't hurt much. But then again, why isn't the OS smart enough to read in large-enough chunks to take advantage of the RAID even when you read just a single file?

In my experience with RAID, it is smart enough to take advantage of that.  If the raid controller detects a sequential access pattern read, it initiates a read ahead on each disk to pre-position the data it will need (or at least, the behavior I observe is as-if it did that).  But maybe if the sequential read is a bunch of "random" reads from different processes which just happen to add up to sequential, that confuses the algorithm?
 
Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Memory leak in gingetbitmap
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: pg_upgrade and rsync