Re: Experimenting with hash join prefetch
От | Thomas Munro |
---|---|
Тема | Re: Experimenting with hash join prefetch |
Дата | |
Msg-id | CA+hUKGKMNN75cu=QSV4y_DxZgKvi-SR-QMq70NT8_wdfbLoZeA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Experimenting with hash join prefetch (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Experimenting with hash join prefetch
(Thomas Munro <thomas.munro@gmail.com>)
|
Список | pgsql-hackers |
On Fri, Apr 12, 2019 at 1:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > It would be interesting to see how this does with moderately-long text > keys, say 32 or 64 byte strings, and with actually-long text keys, say > several kB, and then with gigantic text keys, say several MB. At some > point the key probably gets large enough that computing the hash value > for the next key evicts the current key from the relevant CPU cache, > and if I had to guess, at that point prefetching will become a loser. > But I don't know where that point is. If it turns out for example > that this technique is a winner for pass-by-value datatypes and a > loser for pass-by-reference datatypes, or that it's a winner always, > or some sort of simple rule like that, awesome! But if it turns out > that there's no simple rule that we can use to know whether it wins or > loses, then that might make things a little tricky. Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of L1, 256K of L2, 30M of L3. I used shared_buffers=16GB and prewarmed all tables. The short version: very mixed results. For small hash tables it clearly hurts, for large ones it looks promising. Much more digging required to draw any conclusions. It'd be nice to understand exactly how the hidden parameters work. Hash bucket array size vs hardware cache sizes must be a factor. Another is surely timing; there is an optimal time to begin prefetching (too soon and your line might be replaced by the time you need it, too late and you won't avoid stalling). Here, I am doing a ham-fisted prefetch of t+1's bucket header and not yet trying to prefetch the tuple itself, but the Intel/CMU paper[1] of course shows a deeper pipeline and talks about calibrating the prefetch distance for the hardware. As far as I can see, that's quite tricky in general, and complicated by our executor design: the number of cycles before the node runs again depends on the higher-level plan! Previously I have heard icache-based arguments for why we should be able to emit more than one tuple at a time, and I suppose this is a new argument for that: it makes it actually plausible to calibrate the prefetch distance for "software-pipelined prefetch" techniques. Anyway, here are the results for what they are worth: Table r has 50,000,000 rows. Table s was tested with 3 different sizes. Both tables have the same layout: key (various types and sizes) + 0, 2 or 4 extra columns. I ran each test 3 times, and compared the worst (w) and median (m) times and computed the speed-up provided by the patch. The absolute times (not shown) were all in the range 9-40 seconds depending depending on the parameters. s=1,000 s=100,000 s=10,000,000 ==================== ==================== ==================== int w=-10.86%, m=-11.03% w=-8.56%, m=-9.17% w=+16.79%, m=+19.63% + 2 cols w=-14.19%, m=-16.97% w=-6.59%, m=-7.89% w=+15.88%, m=+16.81% + 4 cols w=-17.14%, m=-14.34% w=-10.01%, m=-8.85% w=+37.38%, m=+24.01% text(8) w=-12.42%, m=-12.32% w=-4.04%, m=-1.96% w=+13.52%, m=+16.17% + 2 cols w=-13.50%, m=-14.98% w=-4.29%, m=-3.40% w=+15.98%, m=+19.45% + 4 cols w=-11.53%, m=-9.61% w=-3.70%, m=-5.91% w=+46.66%, m=+51.06% text(16) w=-11.46%, m=-9.71% w=-4.85%, m=-3.86% w=+16.47%, m=+22.10% + 2 cols w=-13.97%, m=-14.55% w=-7.08%, m=-6.07% w=+20.50%, m=+21.77% + 4 cols w=-9.72%, m=-11.31% w=-1.03%, m=-2.55% w=+8.25%, m=+12.21% text(32) w=-14.86%, m=-15.48% w=-9.36%, m=-9.27% w=+19.86%, m=+15.34% + 2 cols w=-12.35%, m=-11.71% w=-10.61%, m=-9.87% w=+98.29%, m=+97.40% + 4 cols w=-10.71%, m=-10.40% w=-2.42%, m=-1.25% w=-8.34%, m=-10.44% text(64) w=-9.45%, m=-11.36% w=-13.94%, m=-11.42% w=+9.44%, m=+9.57% + 2 cols w=-12.47%, m=-13.17% w=-9.60%, m=-6.61% w=-4.69%, m=+10.06% + 4 cols w=-9.47%, m=-12.20% w=-5.60%, m=-3.55% w=-15.91%, m=-23.29% I'd expect the right-hand column to get a further speed-up (or slow-down) of around 1/5 the given numbers, if we also prefetch during the build phase (= s/r). Maybe 2-stage pipeline would help, though I'm starting to see the complexity of organising a perfecting primed memory pipeline ... ie what this topic is really all about. Well, that's enough learning-basic-facts-about-computers by trying to whack PostgreSQL with database literature for now. Looks like I'll have to work quite a bit harder to make something useful out of this. I think I see where some of the problems lie. I think being able to store multiple tuples in a slot (via TTS-implementation-specific means, as we see with the heap scan in my patch, and as I think hash join would want to do to emit multiple tuples before relinquishing control) and then look at them via ExecScrollSlot() and perhaps also consume them via ExecNextSlot() are promising ideas, but I've only scratched the surface. [1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf -- Thomas Munro https://enterprisedb.com
Вложения
В списке pgsql-hackers по дате отправления: