Re: [HACKERS] Parallel Hash take II

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: [HACKERS] Parallel Hash take II
Дата
Msg-id CAEepm=2158-7OJ-kHe+p4bqFkwmx1wOr7BiwqHBHCEBMtLbjbQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Parallel Hash take II  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
>> I'll report on performance separately.
>
> Looking forward to that ;)

Here are some experimental results from a Xeon E5-2695 v3 with a ton
of RAM and spinning disks (EDB lab server "scylla").  I used TPC-H
dbgen scale 10 with the additional indexes suggested by Tomas
Vondra[1].  10GB of source data (= 23GB pgdata dir) is obviously quite
small as these things go, and I hope we'll run some of these with a
much larger scale soon (it's a big job), but it's enough to runs
queries for tens of seconds to minutes so it's definitely in parallel
query territory and shows some pretty interesting effects IMHO.

First, here's a stupid self-join as a warm-up.  The query is SELECT
COUNT(*) FROM lineitem r JOIN lineitem s USING (l_orderkey,
l_linenumber), where lineitem is ~60 million rows.

(1) With work_mem set sky-high so no batching is required, how much
speed-up does each worker contribute with this feature off (= same as
unpatched master) and on?  In this table, each cell shows the speed-up
compared to w=0 (no workers):
parallel_hash |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
---------------+-------+-------+-------+-------+-------+-------+-------off           | 1.00x | 1.42x | 1.66x | 1.76x |
1.66x| 1.68x | 1.69xon            | 1.00x | 1.76x | 2.54x | 3.21x | 3.80x | 4.30x | 4.79x
 

(2) With work_mem set to 128MB this query needs 32 batches.  Again,
how much speed-up does each worker contribute with the feature off and
on?
parallel_hash |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
---------------+-------+-------+-------+-------+-------+-------+-------off           | 1.00x | 1.37x | 1.49x | 1.32x |
1.67x| 1.63x | 1.64xon            | 1.00x | 1.94x | 2.72x | 2.82x | 3.02x | 4.64x | 5.98x
 

I haven't tried to grok the shape of that curve yet.  Interestingly
(not shown here) the 32 batch parallel hash actually starts to beat
the single-batch parallel hash somewhere around 5-6 workers, and at 15
workers it achieves 9.53x speed-up compared to w=0 and is still
gaining as you add more workers, whereas the single-batch version tops
out around 8 workers.  This may be in part due to the trade-offs
discussed in "Design and Evaluation of Main Memory Hash Join
Algorithms for Multi-core CPUs" (short version: partitioning up front
can pay off by reducing cache misses at various levels and some
research databases would consider that), but I would think we're
probably pretty far away from that frontier and there other probably
other more basic problems.  Investigation/profiling required.

Next, here are some numbers from the TPC-H queries.  I included only
queries where a Parallel Hash was selected by the planner.  I stopped
at w=6 because that's the highest number of workers the planner would
pick by default at that scale.  (If I'm getting the maths right, TPC-H
scale 300's lineitem table should inspire about 10 workers to get out
of bed; you get an extra worker each time a table triples in size.)

(3) What is the speed-up with enable_parallel_hash = on vs
enable_parallel_hash = off?  Here is the result table for various
numbers of workers, with work_mem set to 1GB.
query |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
-------+-------+-------+-------+-------+-------+-------+-------    3 | 1.02x | 1.16x | 1.37x | 1.79x | 1.95x | 2.29x |
2.44x   5 | 1.03x | 1.15x | 1.20x | 1.44x | 1.95x | 2.05x | 1.34x    7 | 1.02x | 1.26x | 1.54x | 2.18x | 2.57x | 1.25x
|1.32x    8 | 1.00x | 1.56x | 1.49x | 1.47x | 1.40x | 0.55x | 0.55x    9 | 1.02x | 1.24x | 1.35x | 1.50x | 1.59x |
1.82x| 1.82x   10 | 1.02x | 1.16x | 1.19x | 1.44x | 1.51x | 1.75x | 1.83x   12 | 1.01x | 1.22x | 1.53x | 0.72x | 0.74x
|0.73x | 0.99x   14 | 1.00x | 1.08x | 1.18x | 1.33x | 1.41x | 1.54x | 1.52x   16 | 1.01x | 1.22x | 1.10x | 1.10x |
1.10x| 1.11x | 1.10x   18 | 0.99x | 1.07x | 1.05x | 0.99x | 0.99x | 0.99x | 1.03x   21 | 1.00x | 1.28x | 1.24x | 1.34x
|0.18x | 0.19x | 0.23x
 

Some commentary on the cases where the performance was apparently hurt
by the feature: for Q21 with w=3 workers and above with
enable_parallel_hash = off the planner switched from a hash join to a
nested loop and that turned out to be better, but with
enable_parallel_hash = on it didn't give up on hash join until there
were 6 workers.  Something similar happened with Q8 around 5 workers.
Q21 has some major cardinality estimation problems as discussed
elsewhere, and on this run I didn't think to apply the patch that
fixes (?) that.  In other words, as far as I can tell, all of those
are cases where there is possibly room for general planner improvement
outside this project: the point at which we flip from one plan type to
another moves around, not necessarily indicating a problem with
Parallel Hash as an executor node.  That isn't to say I'm not
interested in understanding the causes better and trying to fix them
if I can.

(4) The same comparison, with work_mem set to 128MB resulting in more batching:
query |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
-------+-------+-------+-------+-------+-------+-------+-------    3 | 1.03x | 1.23x | 1.44x | 1.76x | 1.97x | 2.23x |
2.55x   5 | 1.01x | 1.07x | 1.25x | 1.44x | 1.79x | 2.05x | 1.31x    7 | 1.02x | 1.42x | 1.73x | 1.26x | 1.20x | 1.28x
|1.33x    8 | 1.01x | 1.57x | 1.51x | 1.49x | 1.41x | 0.55x | 0.52x    9 | 0.99x | 1.14x | 1.43x | 1.56x | 1.82x |
1.96x| 2.06x   10 | 1.02x | 1.08x | 1.24x | 1.38x | 1.51x | 1.54x | 1.65x   12 | 1.02x | 1.02x | 0.71x | 0.73x | 0.73x
|0.99x | 0.99x   14 | 1.03x | 1.06x | 1.19x | 1.37x | 1.59x | 1.58x | 1.59x   16 | 1.00x | 1.21x | 1.10x | 1.09x |
1.13x| 1.12x | 1.12x   18 | 0.98x | 1.22x | 1.28x | 1.21x | 1.10x | 0.98x | 0.95x   21 | 0.96x | 1.25x | 1.56x | 0.41x
|0.41x | 0.87x | 1.18x
 

Similar, with the inflection points moving around a bit.

(5) Another way to look at the data is to see how much speed-up each
new worker gives you with and without this feature, as I did for the
self-join above.  In this table, there are two lines for each query.
The first line shows the speed-up as we add more workers with
enable_parallel_hash = off (should be same as unpatched master), and
the second line shows the speed-up as we add more workers, with
enable_parallel_hash = on.
query |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
-------+-------+-------+-------+-------+-------+-------+-------    3 | 1.00x | 1.60x | 2.00x | 2.07x | 2.27x | 2.23x |
2.39x   3 | 1.00x | 1.83x | 2.68x | 3.64x | 4.35x | 5.02x | 5.72x
 
-------+-------+-------+-------+-------+-------+-------+-------    5 | 1.00x | 1.58x | 2.14x | 2.36x | 2.30x | 2.57x |
8.68x   5 | 1.00x | 1.75x | 2.49x | 3.29x | 4.34x | 5.09x | 11.28x
 
-------+-------+-------+-------+-------+-------+-------+-------    7 | 1.00x | 1.44x | 1.75x | 1.61x | 1.67x | 4.02x |
4.35x   7 | 1.00x | 1.78x | 2.66x | 3.44x | 4.24x | 4.93x | 5.64x
 
-------+-------+-------+-------+-------+-------+-------+-------    8 | 1.00x | 1.19x | 1.28x | 1.31x | 1.36x | 3.30x |
3.34x   8 | 1.00x | 1.85x | 1.90x | 1.93x | 1.91x | 1.82x | 1.85x
 
-------+-------+-------+-------+-------+-------+-------+-------    9 | 1.00x | 1.59x | 2.19x | 2.52x | 2.81x | 2.76x |
2.74x   9 | 1.00x | 1.94x | 2.88x | 3.69x | 4.38x | 4.92x | 4.89x
 
-------+-------+-------+-------+-------+-------+-------+-------   10 | 1.00x | 1.45x | 1.92x | 2.19x | 2.36x | 2.28x |
2.49x  10 | 1.00x | 1.65x | 2.25x | 3.10x | 3.48x | 3.91x | 4.48x
 
-------+-------+-------+-------+-------+-------+-------+-------   12 | 1.00x | 1.50x | 1.76x | 4.71x | 5.66x | 6.61x |
7.61x  12 | 1.00x | 1.81x | 2.65x | 3.36x | 4.14x | 4.78x | 7.43x
 
-------+-------+-------+-------+-------+-------+-------+-------   14 | 1.00x | 1.40x | 1.68x | 1.86x | 1.97x | 1.95x |
1.95x  14 | 1.00x | 1.50x | 1.98x | 2.47x | 2.76x | 2.98x | 2.95x
 
-------+-------+-------+-------+-------+-------+-------+-------   16 | 1.00x | 1.01x | 1.25x | 1.31x | 1.35x | 1.38x |
1.39x  16 | 1.00x | 1.22x | 1.36x | 1.43x | 1.47x | 1.53x | 1.53x
 
-------+-------+-------+-------+-------+-------+-------+-------   18 | 1.00x | 0.86x | 0.93x | 1.08x | 1.11x | 1.22x |
1.15x  18 | 1.00x | 0.93x | 0.98x | 1.08x | 1.11x | 1.22x | 1.20x
 
-------+-------+-------+-------+-------+-------+-------+-------   21 | 1.00x | 1.12x | 0.49x | 0.59x | 5.10x | 5.67x |
5.18x  21 | 1.00x | 1.44x | 0.62x | 0.80x | 0.95x | 1.08x | 1.22x
 

[1] https://github.com/tvondra/pg_tpch/blob/master/dss/tpch-index.sql

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processingBRIN indexes in VACUUM
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] [POC] hash partitioning