Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)
От | Robert Haas |
---|---|
Тема | Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification) |
Дата | |
Msg-id | CA+TgmoaqvJCkeUbhh0x3GLHJ5uY8xmqLbncjSB3UKavxvAzjnQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: [HACKERS] Boom filters for hash joins (was: A design for amcheck heapam verification)
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
On Tue, Aug 29, 2017 at 10:22 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Indeed. Thank you for working on this! To summarise a couple of > ideas that Peter and I discussed off-list a while back: (1) While > building the hash table for a hash join we could build a Bloom filter > per future batch and keep it in memory, and then while reading from > the outer plan we could skip writing tuples out to future batches if > there is no chance they'll find a match when read back in later (works > only for inner joins and only pays off in inverse proportion to the > join's selectivity); You could try to use this with LEFT JOINs also, I think. If you find that a row can't match later, don't write it to the batch file; instead, return a null-extended row immediately. > (2) We could push a Bloom filter down to scans > (many other databases do this, and at least one person has tried this > with PostgreSQL and found it to pay off[1]). I think the hard part is going to be figuring out a query planner framework for this, because pushing down the Bloom filter down to the scan changes the cost and the row-count of the scan. This is essentially a less-tractable version of the problem of deciding how many parallel workers to use; in that case, at least, the number of works must be an integer, so we could resort to trying them all, if the costing model were otherwise good enough. But here the thing that matters is selectivity, which is a continuous rather than discrete. Consider: SELECT * FROM A LEFT JOIN (B INNER JOIN C ON B.x = C.x) ON A.y = B.y LEFT JOIN D ON A.z > D.z; Since the A-D join is not an equijoin, we'll have to implement it as a nested loop; a case could also arise where a nested loop is the only efficient strategy even for an equijoin, for example if D is huge. Now, suppose the join to B-C is going to knock out 90% of the rows and the join to D is going to knock out 50% of the rows, and that these two percentages are independent of each other. Normally, it would be a slam-dunk to perform the B-C join first, but the Bloom filter muddies the waters, because it might well be the case that what we should really do is build the Bloom filter, apply it, then do the A-D join, and do the A-(B-C) join last. However, in order for the planner to find that plan, it's going to need to be able to cost the nestloop with an outer path that corresponds to a scan of A with the Bloom filter applied. And where is that path going to come from? Note that it can't use that Bloom-filtered path for A unconditionally because we might not opt to implement the A-(B-C) join as a hash join at all; we have to build paths for NestLoop(A,D) and NestLoop(Bloom(A),D) and we have to keep track of the fact that the latter path has a dependency on a future hash join. Things get rapidly more complicated if you've got 10 or so different joins. We have to consider generate scan paths for the driving table with every combination of Bloom filters that could be generated by subsequent joins, and then all of those scan paths have to be carried through all of the possible joinrels that can be generated later. I think it's not crazy to think that this could increase the cost of planning a factor by 2^(# of joins). Presumably some other method needs to be found, but I'm not clear what that other method is. Our planner is fundamentally bottom-up, and it's not at all clear how to make it capable of changing its mind later about something like the number of rows that a scan of A will return, or how much that scan will cost. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
В списке pgsql-hackers по дате отправления: