Обсуждение: [HACKERS] Merge join for GiST
Hello, hackers! ==Spatial joins== Scientific papers from the dawn of R-trees and multidimensional indexes feature a lot of algorithms for spatial joins. I.e. you have two sets of geometries s1 and s2, you need to produce all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees of equal heights with roots R and S most straightforward[1] algorithm look like: SpatialJoin (R,S: Node) { FOR (all E in S) FOR (all ER in R with ER.rect intersecting E.rect) IF (R is a leaf page) { OutputER,ES } ELSE { SpatialJoin (ER.ref, ES.ref) } } ==Motivation== I’m mostly interested in OLAP-style aggregates like: SELECT rows.Id, sum(data.attribute1), sum(data.attribute2), sum(data.attribute3) FROM rows LEFT JOIN data ON rows.IndexedAttribute && data.IndexedAttribute GROUP BY 1 When we have two GiST for rows.IndexedAttribute and data.IndexedAttribute. Currently, this query produces plans with Nested Loop Join, so for every row from “rows” there is one GiST scan into “data”. ==Proposal== Obviously, for this scenario, we can use GiST-based join algorithm identical to the SpatialJoin algorithm above. This algorithm will work iif (key1 && key2) ==> (union(key1,key3) && key2 ) and (union(key2,key3) && key1 ) for any key3. This is quite straightforward for && operator, while I’m not sure this will work for <@ and @> and others. I can try to implement this kind of join as an extension or try to embed this into the planner. I think this idea is somewhat related to this patch [2], but as for now cannot describe how exactly GiST merge and Range Merge features relate. How do you think: 1. Has this idea been considered before? I had not found anything on actual spatial join of GiSTS. 2. What is the easiest way to implement this feature? 3. What kind of operators may be spatial-joined without intervention to existing GiST scan strategies API? Many thanks for reading. Best regards, Andrey Borodin. [1] Brinkhoff, Thomas, Hans-Peter Kriegel, and Bernhard Seeger. Efficient processing of spatial joins using R-trees. Vol. 22. No. 2. ACM, 1993. [2] https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com#CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg@mail.gmail.com
On Mon, Apr 10, 2017 at 7:53 AM, Andrew Borodin <borodin@octonica.com> wrote: > I think this idea is somewhat related to this patch [2], but as for > now cannot describe how exactly GiST merge and Range Merge features > relate. It also seems somewhat related to Peter Moser's work on ALIGN and NORMALIZE. It would be nice if the various groups of people interested in improving PostgreSQL's spatial stuff got together and reviewed each others' patches. As a non-spatial guy myself, it's pretty hard to decide on the relative merits of different proposed approaches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2017-04-10 20:38 GMT+05:00 Robert Haas <robertmhaas@gmail.com>: > On Mon, Apr 10, 2017 at 7:53 AM, Andrew Borodin <borodin@octonica.com> wrote: >> I think this idea is somewhat related to this patch [2], but as for >> now cannot describe how exactly GiST merge and Range Merge features >> relate. > > It also seems somewhat related to Peter Moser's work on ALIGN and > NORMALIZE. It would be nice if the various groups of people > interested in improving PostgreSQL's spatial stuff got together and > reviewed each others' patches. As a non-spatial guy myself, it's > pretty hard to decide on the relative merits of different proposed > approaches. Hi, Robert! Thank you for the pointer. Temporal features are not exactly within my scope, but you are right, topics are close to each other. I'll look into the patch with temporal features and assess whether I can provide a meaningful review. I do not know how to gather the attention of all how is interested in this kind of features. I read hackers@ digest regularly, used search a lot, but that temporal work slipped away from my attention. Best regards, Andrey Borodin.
On Mon, Apr 10, 2017 at 2:53 PM, Andrew Borodin <borodin@octonica.com> wrote:
==Spatial joins==
Scientific papers from the dawn of R-trees and multidimensional
indexes feature a lot of algorithms for spatial joins.
I.e. you have two sets of geometries s1 and s2, you need to produce
all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees
of equal heights with roots R and S most straightforward[1] algorithm
look like:
SpatialJoin (R,S: Node)
{
FOR (all E in S)
FOR (all ER in R with ER.rect intersecting E.rect)
IF (R is a leaf page)
{
Output ER,ES
}
ELSE
{
SpatialJoin (ER.ref, ES.ref)
}
}
FYI, I've implemented this algorithm for pgsphere. See following branch.
https://github.com/akorotkov/pgsphere/tree/experimental
It's implemented as crossmatch() function which takes as arguments names of two indexes over spoint and maximum distance (it checks not overlapping but proximity of points). This function returns setof pairs of TIDs.
It's implemented as crossmatch() function which takes as arguments names of two indexes over spoint and maximum distance (it checks not overlapping but proximity of points). This function returns setof pairs of TIDs.
Later, Dmitry Ivanov made it a custom scan node.
You also can find some experimental evaluation here:
Feel free to experiment with this code and/or ask any question.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Apr 11, 2017 at 2:17 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > FYI, I've implemented this algorithm for pgsphere. See following branch. > https://github.com/akorotkov/pgsphere/tree/experimental > It's implemented as crossmatch() function which takes as arguments names of > two indexes over spoint and maximum distance (it checks not overlapping but > proximity of points). This function returns setof pairs of TIDs. > > Later, Dmitry Ivanov made it a custom scan node. > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > You also can find some experimental evaluation here: > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf Do you have a sense of how this might compare with range merge join? Regards, Jeff Davis
On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis <pgsql@j-davis.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Apr 11, 2017 at 2:17 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> FYI, I've implemented this algorithm for pgsphere. See following branch.
> https://github.com/akorotkov/pgsphere/tree/experimental
> It's implemented as crossmatch() function which takes as arguments names of
> two indexes over spoint and maximum distance (it checks not overlapping but
> proximity of points). This function returns setof pairs of TIDs.
>
> Later, Dmitry Ivanov made it a custom scan node.
> https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode
>
> You also can find some experimental evaluation here:
> http://www.adass2016.inaf.it/images/presentations/10_ Korotkov.pdf
Do you have a sense of how this might compare with range merge join?
If you have GiST indexes over ranges for both sides of join, then this method could be used for range join. Hence, it could be compared with range merge join.
However, particular implementation in pgsphere uses hardcoded datatypes and operations.
Thus, for range join we need either generalized version of GiST-based join or special implementation for ranges.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
2017-04-11 14:17 GMT+05:00 Alexander Korotkov <a.korotkov@postgrespro.ru>: > FYI, I've implemented this algorithm for pgsphere. See following branch. > https://github.com/akorotkov/pgsphere/tree/experimental > It's implemented as crossmatch() function which takes as arguments names of > two indexes over spoint and maximum distance (it checks not overlapping but > proximity of points). This function returns setof pairs of TIDs. > > Later, Dmitry Ivanov made it a custom scan node. > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > You also can find some experimental evaluation here: > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf > > Feel free to experiment with this code and/or ask any question. Wow, that's cool! Thanks! That code actually answers a lot of questions. I'll try to generalize that for && operator Best regards, Andrey Borodin.
Thank you, Alexander!
This is definitely the example we are looking for!On Tue, Apr 11, 2017 at 2:17 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Mon, Apr 10, 2017 at 2:53 PM, Andrew Borodin <borodin@octonica.com> wrote:==Spatial joins==
Scientific papers from the dawn of R-trees and multidimensional
indexes feature a lot of algorithms for spatial joins.
I.e. you have two sets of geometries s1 and s2, you need to produce
all colliding pairs (p1,p2) where p1 in s1 and p2 in s2. For 2 R-trees
of equal heights with roots R and S most straightforward[1] algorithm
look like:
SpatialJoin (R,S: Node)
{
FOR (all E in S)
FOR (all ER in R with ER.rect intersecting E.rect)
IF (R is a leaf page)
{
Output ER,ES
}
ELSE
{
SpatialJoin (ER.ref, ES.ref)
}
}FYI, I've implemented this algorithm for pgsphere. See following branch.https://github.com/akorotkov/pgsphere/tree/experimental
It's implemented as crossmatch() function which takes as arguments names of two indexes over spoint and maximum distance (it checks not overlapping but proximity of points). This function returns setof pairs of TIDs.Later, Dmitry Ivanov made it a custom scan node.You also can find some experimental evaluation here:Feel free to experiment with this code and/or ask any question.------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
--Regards, Sergey Mirvoda
On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> Do you have a sense of how this might compare with range merge join? > > > If you have GiST indexes over ranges for both sides of join, then this > method could be used for range join. Hence, it could be compared with range > merge join. > However, particular implementation in pgsphere uses hardcoded datatypes and > operations. > Thus, for range join we need either generalized version of GiST-based join > or special implementation for ranges. Alexander, Andrew, How do you think we should proceed? Which projects do you think should eventually be in core, versus which are fine as extensions? Regards, Jeff Davis
2017-04-13 7:01 GMT+05:00 Jeff Davis <pgsql@j-davis.com>: > On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: >> On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis <pgsql@j-davis.com> wrote: >>> Do you have a sense of how this might compare with range merge join? >> >> >> If you have GiST indexes over ranges for both sides of join, then this >> method could be used for range join. Hence, it could be compared with range >> merge join. >> However, particular implementation in pgsphere uses hardcoded datatypes and >> operations. >> Thus, for range join we need either generalized version of GiST-based join >> or special implementation for ranges. > > Alexander, Andrew, > > How do you think we should proceed? Which projects do you think should > eventually be in core, versus which are fine as extensions? Some points in favor of Range joins via nbtree: 1. It's more efficient than B-tree over GiST 2. It is already in a patch form Point against: 1. Particular implementation is somewhat leaked abstraction. Inside the planner, you check not for capabilities of operator and type, but for specific op and type. But I do not know how to fix this. So, here is my opinion: if we can inside the planner understand that join condition involves range specifics (for all available ranges) and do Range Merge Join, then this is preferred solution. Yes, Spatial Join, when available, will provide roughly the same scan performance. But B-trees are more well known to users new to PostgreSQL, and B-trees are faster. Best regards, Andrey Borodin.
On Wed, Apr 12, 2017 at 10:44 PM, Andrew Borodin <borodin@octonica.com> wrote: >> How do you think we should proceed? Which projects do you think should >> eventually be in core, versus which are fine as extensions? > > Some points in favor of Range joins via nbtree: My patch doesn't require indexes, it can sort the input (and the 5X improvement that I got included the effort to sort). In fact, I expect using sort will be more common than maintaining btree indexes on a range column. > 1. It's more efficient than B-tree over GiST > 2. It is already in a patch form > > Point against: > 1. Particular implementation is somewhat leaked abstraction. Inside > the planner, you check not for capabilities of operator and type, but > for specific op and type. But I do not know how to fix this. It's not a specific type, it's the "anyrange" type, so you can define new range types to take advantage of range merge join. I can drive the planner changes from the catalog rather than hard-coded OIDs if we think range merge join is a good solution. > So, here is my opinion: if we can inside the planner understand that > join condition involves range specifics (for all available ranges) and > do Range Merge Join, then this is preferred solution. Yes, I can do that. > Yes, Spatial Join, when available, will provide roughly the same scan > performance. But B-trees are more well known to users new to > PostgreSQL, and B-trees are faster. I don't quite follow. I don't think any of these proposals uses btree, right? Range merge join doesn't need any index, your proposal uses gist, and PgSphere's crossmatch uses gist. Regards, Jeff Davis
2017-04-13 11:30 GMT+05:00 Jeff Davis <pgsql@j-davis.com>: > I don't quite follow. I don't think any of these proposals uses btree, > right? Range merge join doesn't need any index, your proposal uses > gist, and PgSphere's crossmatch uses gist. Merge join will use presorted data, B-tree provides sorted data. Merge Join cannot join non-sorted data, can it? Indeed, B-tree is not the only sorted data provider. In this sight, Range Merge join is even more generic. Best regards, Andrey Borodin.
Hi, hackers! I've adapted crossmatch join from pgSphere to cube for performance tests. I've placed spatial join code here https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/spatialjoin.c and node code here https://github.com/Octonica/postgres/blob/spatialjoin/contrib/cube/joinnode.c If you have an idea of improving the performance of this code, please do not hesitate to express them. One of the performance bottlenecks is code nearby heap_getattr() in fetch_next_pair(). ==Performance Tests== I've tested performance on queries which aggregate result of the spatial join. See cube_test.sql attached. On 3d data, Spatial Join performs 3x faster than Nested Loop Join + GiST Scan Nested Loop + Index Scan plan: HashAggregate (cost=36841568.00..36841570.00 rows=200 width=40) (actual time=206565.869..206738.307 rows=298731 loops=1) Group Key: r.nx -> Nested Loop (cost=0.41..25591568.00 rows=900000000 width=40) (actual time=0.357..200838.416 rows=8464147 loops=1) -> Seq Scan on regions r (cost=0.00..6410.00 rows=300000 width=40) (actual time=0.015..324.436 rows=300000 loops=1) -> Index Scan using idx on datatable a (cost=0.41..55.28 rows=3000 width=64) (actual time=0.174..0.648 rows=28 loops=300000) Index Cond: (r.c @> c) Planning time: 17.175 ms Execution time: 206806.926 ms Time: 206852,635 ms (03:26,853) Spatial Join plan: HashAggregate (cost=56250001.00..56250003.00 rows=200 width=40) (actual time=67373.644..67553.118 rows=298731 loops=1) Group Key: r.nx -> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=4500000000 width=40) (actual time=0.151..61718.804 rows=8464147 loops=1) Outer index: idx Inner index: idx1 Planning time: 0.182 ms Execution time: 67630.742 ms Time: 67631,557 ms (01:07,632) But on more realistic 7D data with queries emulating OLAP system performance of Spatial Join is 2 times worse than Nested Loop Join + GiST Scan. Which comes as a complete surprise to me. I do not see any algorithmic reason for Spatial Join to be slower. Thus I strongly suspect that my implementation is not efficient, but as for now I have no ideas how to improve it. Here are plans for 7D Nested Loop + Index Scan HashAggregate (cost=3425143.00..3425743.00 rows=60000 width=72) (actual time=122794.715..122822.893 rows=60000 loops=1) Group Key: r.nx -> Nested Loop (cost=0.41..2075143.00 rows=60000000 width=72) (actual time=0.311..100478.710 rows=39817008 loops=1) -> Seq Scan on r1 r (cost=0.00..2419.00 rows=60000 width=128) (actual time=0.043..60.579 rows=60000 loops=1) -> Index Scan using ix_a1_cube on a1 a (cost=0.41..24.55 rows=1000 width=128) (actual time=0.110..1.266 rows=664 loops=60000) Index Cond: (c <@ r.c) Planning time: 0.349 ms Execution time: 122831.353 ms (8 rows) Spatial Join HashAggregate (cost=6750001.00..6750601.00 rows=60000 width=72) (actual time=241832.855..241889.360 rows=60000 loops=1) Group Key: r.nx -> Custom Scan (SpatialJoin) (cost=0.00..1.00 rows=300000000 width=72) (actual time=0.140..216187.111 rows=39817008 loops=1) Outer index: ix_r1_cube Inner index: ix_a1_cube Planning time: 0.533 ms Execution time: 241907.569 ms (7 rows) Time: 241910,440 ms (04:01,910) Any ideas will be highly appreciated. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers