Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Дата
Msg-id CAEepm=1iGXvKo--UkktA1JM3EAfk9EyBEdNFWXpLKCY3HDwSzw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Sat, Feb 4, 2017 at 11:58 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
>> show speed-up relative to that time):
>>
>>  mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
>> -----+--------+-------+-------+-------+-------+-------+-------+-------+-------
>>    1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
>>   64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
>>  256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
>>  512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x
>
> I think that this presorted case doesn't improve much because the
> sorting itself is so cheap, as explained in my last mail. However, the
> improvements as workers are added is still smaller than expected. I
> think that this indicates that there isn't enough I/O capacity
> available here to truly show the full potential of the patch -- I've
> certainly seen better scalability for cases like this when there is a
> lot of I/O bandwidth available, and I/O parallelism is there to be
> taken advantage of. Say, when using a system with a large RAID array
> (I used a RAID0 array with 12 HDDs for my own tests). Another issue is
> that you probably don't have enough data here to really show off the
> patch. I don't want to dismiss the benchmark, which is still quite
> informative, but it's worth pointing out that the feature is going to
> be most compelling for very large indexes, that will take at least
> several minutes to build under any circumstances. (Having a
> reproducible case is also important, which what you have here has
> going for it, on the other hand.)

Right.  My main reason for starting smallish was to allow me to search
a space with several variables without waiting eons.  Next I would
like to run a small subset of those tests with, say, 10, 20 or even
100 times more data loaded, so the tables would be ~20GB, ~40GB or
~200GB.

About read bandwidth:  It shouldn't have been touching the disk at all
for reads: I did a dummy run of the index build before the measured
runs, so that a 2GB table being sorted in ~2 minutes would certainly
have come entirely from the OS page cache since the machine has oodles
of RAM.

About write bandwidth:  The WAL, the index and the temp files all went
to an SSD array, though I don't have the characteristics of that to
hand.  I should also be able to test on multi-spindle HDD array.  I
doubt either can touch your 12 way RAID0 array, but will look into
that.

> I suspect that this system isn't particularly well balanced for the
> task of benchmarking the patch. You would probably see notably better
> scalability than any you've shown in any test if you could add
> additional sequential I/O bandwidth, which is probably an economical,
> practical choice for many users. I suspect that you aren't actually
> saturating available CPUs to the greatest extent that the
> implementations makes possible.

I will look into what IO options I can access before running larger
tests.  Also I will look into running the test with both cold and warm
caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth
enters the picture.

> Another thing I want to point out is that with 1MB of
> maintenance_work_mem, the patch appears to do very well, but that
> isn't terribly meaningful. I would suggest that we avoid testing this
> patch with such a low amount of memory -- it doesn't seem important.
> This is skewed by the fact that you're using replacement selection in
> the serial case only. I think what this actually demonstrates is that
> replacement selection is very slow, even with its putative best case.
> I believe that commit 2459833 was the final nail in the coffin of
> replacement selection. I certainly don't want to relitigate the
> discussion on replacement_sort_tuples, and am not going to push too
> hard, but ISTM that we should fully remove replacement selection from
> tuplesort.c and be done with it.

Interesting.  I haven't grokked this but will go and read about it.

Based on your earlier comments about banana skin effects, I'm
wondering if it would be interesting to add a couple more heap
distributions to the test set that are almost completely sorted except
for a few entries out of order.

-- 
Thomas Munro
http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: [HACKERS] Time to up bgwriter_lru_maxpages?
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] Time to up bgwriter_lru_maxpages?