Re: Experimenting with hash join prefetch

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Experimenting with hash join prefetch
Дата
Msg-id CA+hUKGKMNN75cu=QSV4y_DxZgKvi-SR-QMq70NT8_wdfbLoZeA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Experimenting with hash join prefetch  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Experimenting with hash join prefetch  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Fri, Apr 12, 2019 at 1:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It would be interesting to see how this does with moderately-long text
> keys, say 32 or 64 byte strings, and with actually-long text keys, say
> several kB, and then with gigantic text keys, say several MB.  At some
> point the key probably gets large enough that computing the hash value
> for the next key evicts the current key from the relevant CPU cache,
> and if I had to guess, at that point prefetching will become a loser.
> But I don't know where that point is.  If it turns out for example
> that this technique is a winner for pass-by-value datatypes and a
> loser for pass-by-reference datatypes, or that it's a winner always,
> or some sort of simple rule like that, awesome!  But if it turns out
> that there's no simple rule that we can use to know whether it wins or
> loses, then that might make things a little tricky.

Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of
L1, 256K of L2, 30M of L3.  I used shared_buffers=16GB and prewarmed
all tables.

The short version: very mixed results.  For small hash tables it
clearly hurts, for large ones it looks promising.  Much more digging
required to draw any conclusions.

It'd be nice to understand exactly how the hidden parameters work.
Hash bucket array size vs hardware cache sizes must be a factor.
Another is surely timing; there is an optimal time to begin
prefetching (too soon and your line might be replaced by the time you
need it, too late and you won't avoid stalling).  Here, I am doing a
ham-fisted prefetch of t+1's bucket header and not yet trying to
prefetch the tuple itself, but the Intel/CMU paper[1] of course shows
a deeper pipeline and talks about calibrating the prefetch distance
for the hardware.  As far as I can see, that's quite tricky in
general, and complicated by our executor design: the number of cycles
before the node runs again depends on the higher-level plan!
Previously I have heard icache-based arguments for why we should be
able to emit more than one tuple at a time, and I suppose this is a
new argument for that: it makes it actually plausible to calibrate the
prefetch distance for "software-pipelined prefetch" techniques.

Anyway, here are the results for what they are worth:

Table r has 50,000,000 rows.  Table s was tested with 3 different
sizes.  Both tables have the same layout: key (various types and
sizes) + 0, 2 or 4 extra columns.  I ran each test 3 times, and
compared the worst (w) and median (m) times and computed the speed-up
provided by the patch.  The absolute times (not shown) were all in the
range 9-40 seconds depending depending on the parameters.

            s=1,000              s=100,000            s=10,000,000
            ==================== ==================== ====================
int         w=-10.86%, m=-11.03% w=-8.56%, m=-9.17%   w=+16.79%, m=+19.63%
 + 2 cols   w=-14.19%, m=-16.97% w=-6.59%, m=-7.89%   w=+15.88%, m=+16.81%
 + 4 cols   w=-17.14%, m=-14.34% w=-10.01%, m=-8.85%  w=+37.38%, m=+24.01%
text(8)     w=-12.42%, m=-12.32% w=-4.04%, m=-1.96%   w=+13.52%, m=+16.17%
 + 2 cols   w=-13.50%, m=-14.98% w=-4.29%, m=-3.40%   w=+15.98%, m=+19.45%
 + 4 cols   w=-11.53%, m=-9.61%  w=-3.70%, m=-5.91%   w=+46.66%, m=+51.06%
text(16)    w=-11.46%, m=-9.71%  w=-4.85%, m=-3.86%   w=+16.47%, m=+22.10%
 + 2 cols   w=-13.97%, m=-14.55% w=-7.08%, m=-6.07%   w=+20.50%, m=+21.77%
 + 4 cols   w=-9.72%, m=-11.31%  w=-1.03%, m=-2.55%   w=+8.25%, m=+12.21%
text(32)    w=-14.86%, m=-15.48% w=-9.36%, m=-9.27%   w=+19.86%, m=+15.34%
 + 2 cols   w=-12.35%, m=-11.71% w=-10.61%, m=-9.87%  w=+98.29%, m=+97.40%
 + 4 cols   w=-10.71%, m=-10.40% w=-2.42%, m=-1.25%   w=-8.34%, m=-10.44%
text(64)    w=-9.45%, m=-11.36%  w=-13.94%, m=-11.42% w=+9.44%, m=+9.57%
 + 2 cols   w=-12.47%, m=-13.17% w=-9.60%, m=-6.61%   w=-4.69%, m=+10.06%
 + 4 cols   w=-9.47%, m=-12.20%  w=-5.60%, m=-3.55%   w=-15.91%, m=-23.29%

I'd expect the right-hand column to get a further speed-up (or
slow-down) of around 1/5 the given numbers, if we also prefetch during
the build phase (= s/r).  Maybe 2-stage pipeline would help, though
I'm starting to see the complexity of organising a perfecting primed
memory pipeline ... ie what this topic is really all about.

Well, that's enough learning-basic-facts-about-computers by trying to
whack PostgreSQL with database literature for now.  Looks like I'll
have to work quite a bit harder to make something useful out of this.
I think I see where some of the problems lie.  I think being able to
store multiple tuples in a slot (via TTS-implementation-specific
means, as we see with the heap scan in my patch, and as I think hash
join would want to do to emit multiple tuples before relinquishing
control) and then look at them via ExecScrollSlot() and perhaps also
consume them via ExecNextSlot() are promising ideas, but I've only
scratched the surface.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf


--
Thomas Munro
https://enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Minor fix in reloptions.c comments
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Attempt to consolidate reading of XLOG page