Обсуждение: Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

Поиск
Список
Период
Сортировка

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Robert Haas
Дата:
On Tue, Aug 29, 2017 at 10:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Indeed.  Thank you for working on this!  To summarise a couple of
> ideas that Peter and I discussed off-list a while back:  (1) While
> building the hash table for a hash join we could build a Bloom filter
> per future batch and keep it in memory, and then while reading from
> the outer plan we could skip writing tuples out to future batches if
> there is no chance they'll find a match when read back in later (works
> only for inner joins and only pays off in inverse proportion to the
> join's selectivity);

You could try to use this with LEFT JOINs also, I think.  If you find
that a row can't match later, don't write it to the batch file;
instead, return a null-extended row immediately.

> (2) We could push a Bloom filter down to scans
> (many other databases do this, and at least one person has tried this
> with PostgreSQL and found it to pay off[1]).

I think the hard part is going to be figuring out a query planner
framework for this, because pushing down the Bloom filter down to the
scan changes the cost and the row-count of the scan.  This is
essentially a less-tractable version of the problem of deciding how
many parallel workers to use; in that case, at least, the number of
works must be an integer, so we could resort to trying them all, if
the costing model were otherwise good enough.  But here the thing that
matters is selectivity, which is a continuous rather than discrete.
Consider:

SELECT * FROM A LEFT JOIN (B INNER JOIN C ON B.x = C.x) ON A.y = B.y
LEFT JOIN D ON A.z > D.z;

Since the A-D join is not an equijoin, we'll have to implement it as a
nested loop; a case could also arise where a nested loop is the only
efficient strategy even for an equijoin, for example if D is huge.

Now, suppose the join to B-C is going to knock out 90% of the rows and
the join to D is going to knock out 50% of the rows, and that these
two percentages are independent of each other.  Normally, it would be
a slam-dunk to perform the B-C join first, but the Bloom filter
muddies the waters, because it might well be the case that what we
should really do is build the Bloom filter, apply it, then do the A-D
join, and do the A-(B-C) join last.  However, in order for the planner
to find that plan, it's going to need to be able to cost the nestloop
with an outer path that corresponds to a scan of A with the Bloom
filter applied.  And where is that path going to come from?

Note that it can't use that Bloom-filtered path for A unconditionally
because we might not opt to implement the A-(B-C) join as a hash join
at all; we have to build paths for NestLoop(A,D) and
NestLoop(Bloom(A),D) and we have to keep track of the fact that the
latter path has a dependency on a future hash join.  Things get
rapidly more complicated if you've got 10 or so different joins.  We
have to consider generate scan paths for the driving table with every
combination of Bloom filters that could be generated by subsequent
joins, and then all of those scan paths have to be carried through all
of the possible joinrels that can be generated later.  I think it's
not crazy to think that this could increase the cost of planning a
factor by 2^(# of joins).

Presumably some other method needs to be found, but I'm not clear what
that other method is.  Our planner is fundamentally bottom-up, and
it's not at all clear how to make it capable of changing its mind
later about something like the number of rows that a scan of A will
return, or how much that scan will cost.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheck heapam verification)

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 29, 2017 at 10:22 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> (2) We could push a Bloom filter down to scans
>> (many other databases do this, and at least one person has tried this
>> with PostgreSQL and found it to pay off[1]).

> I think the hard part is going to be figuring out a query planner
> framework for this, because pushing down the Bloom filter down to the
> scan changes the cost and the row-count of the scan.

Uh, why does the planner need to be involved at all?  This seems like
strictly an execution-time optimization.  Even if you wanted to try
to account for it in costing, I think the reliability of the estimate
would be nil, never mind any questions about whether the planner's
structure makes it easy to apply such an adjustment.

Personally though I would not bother with (2); I think (1) would
capture most of the win for a very small fraction of the complication.
Just for starters, I do not think (2) works for batched hashes.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Peter Geoghegan
Дата:
On Mon, Sep 18, 2017 at 10:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Aug 29, 2017 at 10:22 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> (2) We could push a Bloom filter down to scans
>>> (many other databases do this, and at least one person has tried this
>>> with PostgreSQL and found it to pay off[1]).
>
>> I think the hard part is going to be figuring out a query planner
>> framework for this, because pushing down the Bloom filter down to the
>> scan changes the cost and the row-count of the scan.
>
> Uh, why does the planner need to be involved at all?  This seems like
> strictly an execution-time optimization.  Even if you wanted to try
> to account for it in costing, I think the reliability of the estimate
> would be nil, never mind any questions about whether the planner's
> structure makes it easy to apply such an adjustment.

That is how I think about it too, though I'm not at all confident of
being on the right track. (I'm pretty confident that the general idea
of using a Bloom filter for parallel hash join is a good idea,
though.)

I should point out that Bloom filters are quite insensitive to
misestimations in the total number of elements, an estimate of which
must be provided up-front (I suppose that a cardinality estimate might
make more sense for hash join bloom filters than an estimate of the
total number of elements in a batch). "Bloom Filters in Probabilistic
Verification" gives this as a key advantage for bloom filters over
other approaches to formal model verification [1]. If you can afford
to have significant misestimations and still not lose much benefit,
and if the additional overhead of having a Bloom filter is fixed and
reasonably low cost, then maybe it makes sense to apply bloom filters
as an opportunistic or optimistic optimization. Perhaps they can be
applied when there is little to lose but much to gain.

[1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Robert Haas
Дата:
On Mon, Sep 18, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Uh, why does the planner need to be involved at all?

Because it loses if the Bloom filter fails to filter anything.  That's
not at all far-fetched; consider SELECT * FROM a.x, b.x WHERE a.x =
b.x given a foreign key on a.x referencing b(x).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Peter Geoghegan
Дата:
On Mon, Sep 18, 2017 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 18, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Uh, why does the planner need to be involved at all?
>
> Because it loses if the Bloom filter fails to filter anything.  That's
> not at all far-fetched; consider SELECT * FROM a.x, b.x WHERE a.x =
> b.x given a foreign key on a.x referencing b(x).

Wouldn't a merge join be a lot more likely in this case anyway? Low
selectivity hash joins with multiple batches are inherently slow; the
wasted overhead of using a bloom filter may not matter.

Obviously this is all pretty speculative. I suspect that this could be
true, and it seems worth investigating that framing of the problem
first.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Robert Haas
Дата:
On Mon, Sep 18, 2017 at 5:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Sep 18, 2017 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Sep 18, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Uh, why does the planner need to be involved at all?
>>
>> Because it loses if the Bloom filter fails to filter anything.  That's
>> not at all far-fetched; consider SELECT * FROM a.x, b.x WHERE a.x =
>> b.x given a foreign key on a.x referencing b(x).
>
> Wouldn't a merge join be a lot more likely in this case anyway? Low
> selectivity hash joins with multiple batches are inherently slow; the
> wasted overhead of using a bloom filter may not matter.
>
> Obviously this is all pretty speculative. I suspect that this could be
> true, and it seems worth investigating that framing of the problem
> first.

ISTR Tomas Vondra doing some experiments with this a few years ago and
finding that it was, in fact, a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Tomas Vondra
Дата:
Hi,

On 09/19/2017 02:55 AM, Robert Haas wrote:
> On Mon, Sep 18, 2017 at 5:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> On Mon, Sep 18, 2017 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Sep 18, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Uh, why does the planner need to be involved at all?
>>>
>>> Because it loses if the Bloom filter fails to filter anything.  That's
>>> not at all far-fetched; consider SELECT * FROM a.x, b.x WHERE a.x =
>>> b.x given a foreign key on a.x referencing b(x).
>>
>> Wouldn't a merge join be a lot more likely in this case anyway? Low
>> selectivity hash joins with multiple batches are inherently slow; the
>> wasted overhead of using a bloom filter may not matter.
>>
>> Obviously this is all pretty speculative. I suspect that this could be
>> true, and it seems worth investigating that framing of the problem
>> first.
> 
> ISTR Tomas Vondra doing some experiments with this a few years ago and
> finding that it was, in fact, a problem.
> 

You seem to have better memory than me, but you're right - I did some
experiments with this in 2015, the WIP patch and discussion is here:
 https://www.postgresql.org/message-id/5670946E.8070705@2ndquadrant.com

The whole idea was that with a bloom filter we can reduce the amount of
tuples (from the outer relation) written to batches.

The patch is fairly simple, and did not try to push the bloom filters to
scan nodes or anything like that. It might be a meaningful first step,
though, particularly for selective joins (where only small number of
rows from the outer relation has a match in the hash table).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Peter Geoghegan
Дата:
On Tue, Sep 19, 2017 at 6:28 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> The patch is fairly simple, and did not try to push the bloom filters to
> scan nodes or anything like that. It might be a meaningful first step,
> though, particularly for selective joins (where only small number of
> rows from the outer relation has a match in the hash table).

I believe that parallelism makes the use of Bloom filter a lot more
compelling, too. Obviously this is something that wasn't taken into
consideration in 2015.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Tomas Vondra
Дата:
On 09/19/2017 06:03 PM, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 6:28 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> The patch is fairly simple, and did not try to push the bloom filters to
>> scan nodes or anything like that. It might be a meaningful first step,
>> though, particularly for selective joins (where only small number of
>> rows from the outer relation has a match in the hash table).
> 
> I believe that parallelism makes the use of Bloom filter a lot more 
> compelling, too. Obviously this is something that wasn't taken into 
> consideration in 2015.
> 

I haven't thought about it from that point of view. Can you elaborate
why that would be the case? Sorry if this was explained earlier in this
thread (I don't see it in the history, though).

I can't quite remember why I haven't pursued the patch in 2015, but it
was probably clear it wouldn't get in in the last CF, and I never got
back to it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Robert Haas
Дата:
On Tue, Sep 19, 2017 at 4:25 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I haven't thought about it from that point of view. Can you elaborate
> why that would be the case? Sorry if this was explained earlier in this
> thread (I don't see it in the history, though).
>
> I can't quite remember why I haven't pursued the patch in 2015, but it
> was probably clear it wouldn't get in in the last CF, and I never got
> back to it.

IIRC, it was a clear loser performance-wise in the case where the
Bloom filter didn't end up helping, and we didn't have a way to avoid
doing it when it didn't help.  That may or may not be why you didn't
pursue it, but I'm fairly sure it was my motivation for being
unexcited about the whole idea.  I think if we can solve that problem
somehow, we have a winner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)

От
Peter Geoghegan
Дата:
On Tue, Sep 19, 2017 at 1:25 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/19/2017 06:03 PM, Peter Geoghegan wrote:
>> I believe that parallelism makes the use of Bloom filter a lot more
>> compelling, too. Obviously this is something that wasn't taken into
>> consideration in 2015.
>>
>
> I haven't thought about it from that point of view. Can you elaborate
> why that would be the case? Sorry if this was explained earlier in this
> thread (I don't see it in the history, though).

Well, IPC and locking shared state to protect the state's structure is
potentially a big bottleneck for parallel hash join. I think that
Bloom filters were first used in distributed databases in the 1980s,
where a network round trip could be saved, which this is a little
like. That's why my guess is that Bloom filtering will be more
valuable when parallelism is used.

I think that right deep hash joins make this really compelling, if and
when they allow you to build multiple Bloom filters that can be
combined from multiple right deep hash table builds. I think you can
do fancy things like reduce the amount of I/O against a star schema
fact table considerably. You can use one Bloom filter (built against
some dimension table) to drive a bitmap index scan on a fact table
index, and then another Bloom filter (built against some other
dimension table) to drive another bitmap index scan. The executor then
does a bitmap AND to combine the two for a bitmap heap scan on the
fact table.

(Maybe this technique doesn't necessarily use a Bloom filter; it could
be some other type of bitmap.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers