Re: [HACKERS] Parallel Hash take II
От | Thomas Munro |
---|---|
Тема | Re: [HACKERS] Parallel Hash take II |
Дата | |
Msg-id | CAEepm=2158-7OJ-kHe+p4bqFkwmx1wOr7BiwqHBHCEBMtLbjbQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] Parallel Hash take II (Andres Freund <andres@anarazel.de>) |
Список | pgsql-hackers |
On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-07-26 20:12:56 +1200, Thomas Munro wrote: >> I'll report on performance separately. > > Looking forward to that ;) Here are some experimental results from a Xeon E5-2695 v3 with a ton of RAM and spinning disks (EDB lab server "scylla"). I used TPC-H dbgen scale 10 with the additional indexes suggested by Tomas Vondra[1]. 10GB of source data (= 23GB pgdata dir) is obviously quite small as these things go, and I hope we'll run some of these with a much larger scale soon (it's a big job), but it's enough to runs queries for tens of seconds to minutes so it's definitely in parallel query territory and shows some pretty interesting effects IMHO. First, here's a stupid self-join as a warm-up. The query is SELECT COUNT(*) FROM lineitem r JOIN lineitem s USING (l_orderkey, l_linenumber), where lineitem is ~60 million rows. (1) With work_mem set sky-high so no batching is required, how much speed-up does each worker contribute with this feature off (= same as unpatched master) and on? In this table, each cell shows the speed-up compared to w=0 (no workers): parallel_hash | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6 ---------------+-------+-------+-------+-------+-------+-------+-------off | 1.00x | 1.42x | 1.66x | 1.76x | 1.66x| 1.68x | 1.69xon | 1.00x | 1.76x | 2.54x | 3.21x | 3.80x | 4.30x | 4.79x (2) With work_mem set to 128MB this query needs 32 batches. Again, how much speed-up does each worker contribute with the feature off and on? parallel_hash | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6 ---------------+-------+-------+-------+-------+-------+-------+-------off | 1.00x | 1.37x | 1.49x | 1.32x | 1.67x| 1.63x | 1.64xon | 1.00x | 1.94x | 2.72x | 2.82x | 3.02x | 4.64x | 5.98x I haven't tried to grok the shape of that curve yet. Interestingly (not shown here) the 32 batch parallel hash actually starts to beat the single-batch parallel hash somewhere around 5-6 workers, and at 15 workers it achieves 9.53x speed-up compared to w=0 and is still gaining as you add more workers, whereas the single-batch version tops out around 8 workers. This may be in part due to the trade-offs discussed in "Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs" (short version: partitioning up front can pay off by reducing cache misses at various levels and some research databases would consider that), but I would think we're probably pretty far away from that frontier and there other probably other more basic problems. Investigation/profiling required. Next, here are some numbers from the TPC-H queries. I included only queries where a Parallel Hash was selected by the planner. I stopped at w=6 because that's the highest number of workers the planner would pick by default at that scale. (If I'm getting the maths right, TPC-H scale 300's lineitem table should inspire about 10 workers to get out of bed; you get an extra worker each time a table triples in size.) (3) What is the speed-up with enable_parallel_hash = on vs enable_parallel_hash = off? Here is the result table for various numbers of workers, with work_mem set to 1GB. query | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6 -------+-------+-------+-------+-------+-------+-------+------- 3 | 1.02x | 1.16x | 1.37x | 1.79x | 1.95x | 2.29x | 2.44x 5 | 1.03x | 1.15x | 1.20x | 1.44x | 1.95x | 2.05x | 1.34x 7 | 1.02x | 1.26x | 1.54x | 2.18x | 2.57x | 1.25x |1.32x 8 | 1.00x | 1.56x | 1.49x | 1.47x | 1.40x | 0.55x | 0.55x 9 | 1.02x | 1.24x | 1.35x | 1.50x | 1.59x | 1.82x| 1.82x 10 | 1.02x | 1.16x | 1.19x | 1.44x | 1.51x | 1.75x | 1.83x 12 | 1.01x | 1.22x | 1.53x | 0.72x | 0.74x |0.73x | 0.99x 14 | 1.00x | 1.08x | 1.18x | 1.33x | 1.41x | 1.54x | 1.52x 16 | 1.01x | 1.22x | 1.10x | 1.10x | 1.10x| 1.11x | 1.10x 18 | 0.99x | 1.07x | 1.05x | 0.99x | 0.99x | 0.99x | 1.03x 21 | 1.00x | 1.28x | 1.24x | 1.34x |0.18x | 0.19x | 0.23x Some commentary on the cases where the performance was apparently hurt by the feature: for Q21 with w=3 workers and above with enable_parallel_hash = off the planner switched from a hash join to a nested loop and that turned out to be better, but with enable_parallel_hash = on it didn't give up on hash join until there were 6 workers. Something similar happened with Q8 around 5 workers. Q21 has some major cardinality estimation problems as discussed elsewhere, and on this run I didn't think to apply the patch that fixes (?) that. In other words, as far as I can tell, all of those are cases where there is possibly room for general planner improvement outside this project: the point at which we flip from one plan type to another moves around, not necessarily indicating a problem with Parallel Hash as an executor node. That isn't to say I'm not interested in understanding the causes better and trying to fix them if I can. (4) The same comparison, with work_mem set to 128MB resulting in more batching: query | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6 -------+-------+-------+-------+-------+-------+-------+------- 3 | 1.03x | 1.23x | 1.44x | 1.76x | 1.97x | 2.23x | 2.55x 5 | 1.01x | 1.07x | 1.25x | 1.44x | 1.79x | 2.05x | 1.31x 7 | 1.02x | 1.42x | 1.73x | 1.26x | 1.20x | 1.28x |1.33x 8 | 1.01x | 1.57x | 1.51x | 1.49x | 1.41x | 0.55x | 0.52x 9 | 0.99x | 1.14x | 1.43x | 1.56x | 1.82x | 1.96x| 2.06x 10 | 1.02x | 1.08x | 1.24x | 1.38x | 1.51x | 1.54x | 1.65x 12 | 1.02x | 1.02x | 0.71x | 0.73x | 0.73x |0.99x | 0.99x 14 | 1.03x | 1.06x | 1.19x | 1.37x | 1.59x | 1.58x | 1.59x 16 | 1.00x | 1.21x | 1.10x | 1.09x | 1.13x| 1.12x | 1.12x 18 | 0.98x | 1.22x | 1.28x | 1.21x | 1.10x | 0.98x | 0.95x 21 | 0.96x | 1.25x | 1.56x | 0.41x |0.41x | 0.87x | 1.18x Similar, with the inflection points moving around a bit. (5) Another way to look at the data is to see how much speed-up each new worker gives you with and without this feature, as I did for the self-join above. In this table, there are two lines for each query. The first line shows the speed-up as we add more workers with enable_parallel_hash = off (should be same as unpatched master), and the second line shows the speed-up as we add more workers, with enable_parallel_hash = on. query | w=0 | w=1 | w=2 | w=3 | w=4 | w=5 | w=6 -------+-------+-------+-------+-------+-------+-------+------- 3 | 1.00x | 1.60x | 2.00x | 2.07x | 2.27x | 2.23x | 2.39x 3 | 1.00x | 1.83x | 2.68x | 3.64x | 4.35x | 5.02x | 5.72x -------+-------+-------+-------+-------+-------+-------+------- 5 | 1.00x | 1.58x | 2.14x | 2.36x | 2.30x | 2.57x | 8.68x 5 | 1.00x | 1.75x | 2.49x | 3.29x | 4.34x | 5.09x | 11.28x -------+-------+-------+-------+-------+-------+-------+------- 7 | 1.00x | 1.44x | 1.75x | 1.61x | 1.67x | 4.02x | 4.35x 7 | 1.00x | 1.78x | 2.66x | 3.44x | 4.24x | 4.93x | 5.64x -------+-------+-------+-------+-------+-------+-------+------- 8 | 1.00x | 1.19x | 1.28x | 1.31x | 1.36x | 3.30x | 3.34x 8 | 1.00x | 1.85x | 1.90x | 1.93x | 1.91x | 1.82x | 1.85x -------+-------+-------+-------+-------+-------+-------+------- 9 | 1.00x | 1.59x | 2.19x | 2.52x | 2.81x | 2.76x | 2.74x 9 | 1.00x | 1.94x | 2.88x | 3.69x | 4.38x | 4.92x | 4.89x -------+-------+-------+-------+-------+-------+-------+------- 10 | 1.00x | 1.45x | 1.92x | 2.19x | 2.36x | 2.28x | 2.49x 10 | 1.00x | 1.65x | 2.25x | 3.10x | 3.48x | 3.91x | 4.48x -------+-------+-------+-------+-------+-------+-------+------- 12 | 1.00x | 1.50x | 1.76x | 4.71x | 5.66x | 6.61x | 7.61x 12 | 1.00x | 1.81x | 2.65x | 3.36x | 4.14x | 4.78x | 7.43x -------+-------+-------+-------+-------+-------+-------+------- 14 | 1.00x | 1.40x | 1.68x | 1.86x | 1.97x | 1.95x | 1.95x 14 | 1.00x | 1.50x | 1.98x | 2.47x | 2.76x | 2.98x | 2.95x -------+-------+-------+-------+-------+-------+-------+------- 16 | 1.00x | 1.01x | 1.25x | 1.31x | 1.35x | 1.38x | 1.39x 16 | 1.00x | 1.22x | 1.36x | 1.43x | 1.47x | 1.53x | 1.53x -------+-------+-------+-------+-------+-------+-------+------- 18 | 1.00x | 0.86x | 0.93x | 1.08x | 1.11x | 1.22x | 1.15x 18 | 1.00x | 0.93x | 0.98x | 1.08x | 1.11x | 1.22x | 1.20x -------+-------+-------+-------+-------+-------+-------+------- 21 | 1.00x | 1.12x | 0.49x | 0.59x | 5.10x | 5.67x | 5.18x 21 | 1.00x | 1.44x | 0.62x | 0.80x | 0.95x | 1.08x | 1.22x [1] https://github.com/tvondra/pg_tpch/blob/master/dss/tpch-index.sql -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Alvaro HerreraДата:
Сообщение: Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processingBRIN indexes in VACUUM