Обсуждение: impact join syntax ?? and gist index ??

Поиск
Список
Период
Сортировка

impact join syntax ?? and gist index ??

От
Marc Millas
Дата:
Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry column y. 
gist index on the geometry column. 
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows. 
no index, 
the geometry do contains points.
all geometries are on 4326 srid.

 If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds  (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.


if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks


Marc MILLAS
Senior Architect
+33607850334

Re: impact join syntax ?? and gist index ??

От
Marc Millas
Дата:
Yes, I know the 2 syntax provide a different result: one provides the 6 meaningful lines, the left join do add 5 lines with a count of 0...
...

Marc MILLAS
Senior Architect
+33607850334



On Sat, Jan 7, 2023 at 8:46 PM Marc Millas <marc.millas@mokadb.com> wrote:
Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry column y. 
gist index on the geometry column. 
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows. 
no index, 
the geometry do contains points.
all geometries are on 4326 srid.

 If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds  (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.


if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks


Marc MILLAS
Senior Architect
+33607850334

Re: impact join syntax ?? and gist index ??

От
Marc Millas
Дата:
on postgres 15 and postgis 3.3, with the very same dataset, 
without gist index on the 420k rows table, the syntax with the left join takes 25 seconds, and without 770 ms.
so to get 5 empty lines its 30 times slower.
if I add the gist index, both syntaxes takes 770 ms...

at least, this close the discussion about the versions my project will use  :-)


Marc MILLAS
Senior Architect
+33607850334



On Sat, Jan 7, 2023 at 8:46 PM Marc Millas <marc.millas@mokadb.com> wrote:
Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry column y. 
gist index on the geometry column. 
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows. 
no index, 
the geometry do contains points.
all geometries are on 4326 srid.

 If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds  (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.


if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks


Marc MILLAS
Senior Architect
+33607850334

Re: impact join syntax ?? and gist index ??

От
Erik Wienhold
Дата:
> On 07/01/2023 20:46 CET Marc Millas <marc.millas@mokadb.com> wrote:
>
> Hi,
>
> postgres 12, postgis 3.0
>
> I have a small table A, 11 rows with a varchar column x and a geometry column y.
> gist index on the geometry column.
> the geometry do contains multipolygons (regions on a map)
> I have a second table B , same structure, around 420 000 rows.
> no index,
> the geometry do contains points.
> all geometries are on 4326 srid.
>
> If i ask to count points in each multipolygons:
>
> select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
> it takes 11 seconds (everything in shared buffers).
> If I do the very same thing as:
> select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
> same result, but 85 seconds (every thing in shared buffers, again)
> if I redo asking with explain analyze, buffers, the plan is very different.
>
>
> if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds.
>
> I get the feeling I am missing something.. (at least 2 things...)
> can someone shed some light ??

Please provide the executions plans for both queries with and without the index on B.y.

--
Erik



Re: impact join syntax ?? and gist index ??

От
Marc Millas
Дата:
here they are: (I replace the column and table names) also I post 2 more remarks, one on left join, and one on the test I did on postgres 15 postgis 3.3...

2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B on st_within(B.geom, A.geom) group by A.x;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=212638398.98..212701792.16 rows=20 width=16) (actual time=86717.857..86757.820 rows=11 loops=1)
   Group Key: A.x
   Buffers: shared hit=4243867
   ->  Sort  (cost=212638398.98..212659529.97 rows=8452398 width=16) (actual time=86717.851..86727.334 rows=421307 loops=1)
         Sort Key: A.x
         Sort Method: quicksort  Memory: 37963kB
         Buffers: shared hit=4243867
         ->  Nested Loop Left Join  (cost=0.00..211521459.31 rows=8452398 width=16) (actual time=17.473..86642.332 rows=421307 loops=1)
               Join Filter: st_within(B.geom, A.geom)
               Rows Removed by Join Filter: 4229377
               Buffers: shared hit=4243867
               ->  Seq Scan on A  (cost=0.00..9.20 rows=20 width=17752) (actual time=0.009..0.043 rows=11 loops=1)
                     Buffers: shared hit=9
               ->  Materialize  (cost=0.00..22309.83 rows=422789 width=40) (actual time=0.001..23.392 rows=422789 loops=11)
                     Buffers: shared hit=15968
                     ->  Seq Scan on B  (cost=0.00..20195.89 rows=422789 width=40) (actual time=0.006..57.651 rows=422789 loops=1)
                           Buffers: shared hit=15968
 Planning Time: 0.693 ms
 Execution Time: 86763.087 ms
(19 lignes)


2023=# explain (analyze, buffers) select A.x, count(B.x) from A, B where st_within(B.geom, A.geom) group by A.x;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=6301606.00..6301608.60 rows=20 width=16) (actual time=11857.363..11863.212 rows=6 loops=1)
   Group Key: A.x
   Buffers: shared hit=2128836
   ->  Gather Merge  (cost=6301606.00..6301608.30 rows=20 width=16) (actual time=11857.359..11863.207 rows=12 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         Buffers: shared hit=2128836
         ->  Sort  (cost=6300605.99..6300606.04 rows=20 width=16) (actual time=11840.355..11840.356 rows=6 loops=2)
               Sort Key: A.x
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 25kB
               Buffers: shared hit=2128836
               ->  Partial HashAggregate  (cost=6300605.36..6300605.56 rows=20 width=16) (actual time=11840.331..11840.332 rows=6 loops=2)
                     Group Key: A.x
                     Buffers: shared hit=2128825
                     ->  Nested Loop  (cost=0.13..6275745.36 rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
                           Buffers: shared hit=2128825
                           ->  Parallel Seq Scan on B  (cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2)
                                 Buffers: shared hit=15968
                           ->  Index Scan using A_geom_idx on A  (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 loops=422789)
                                 Index Cond: (geom ~ B.geom)
                                 Filter: st_within(B.geom, geom)
                                 Rows Removed by Filter: 0
                                 Buffers: shared hit=2112857
 Planning Time: 0.252 ms
 Execution Time: 11863.357 ms
(26 lignes)




Marc MILLAS
Senior Architect
+33607850334



On Sat, Jan 7, 2023 at 9:40 PM Erik Wienhold <ewie@ewie.name> wrote:
> On 07/01/2023 20:46 CET Marc Millas <marc.millas@mokadb.com> wrote:
>
> Hi,
>
> postgres 12, postgis 3.0
>
> I have a small table A, 11 rows with a varchar column x and a geometry column y.
> gist index on the geometry column.
> the geometry do contains multipolygons (regions on a map)
> I have a second table B , same structure, around 420 000 rows.
> no index,
> the geometry do contains points.
> all geometries are on 4326 srid.
>
> If i ask to count points in each multipolygons:
>
> select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
> it takes 11 seconds (everything in shared buffers).
> If I do the very same thing as:
> select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
> same result, but 85 seconds (every thing in shared buffers, again)
> if I redo asking with explain analyze, buffers, the plan is very different.
>
>
> if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds.
>
> I get the feeling I am missing something.. (at least 2 things...)
> can someone shed some light ??

Please provide the executions plans for both queries with and without the index on B.y.

--
Erik

Re: impact join syntax ?? and gist index ??

От
Tom Lane
Дата:
Marc Millas <marc.millas@mokadb.com> writes:
> 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B
> on st_within(B.geom, A.geom) group by A.x;

So the problem with this is that the only decently-performant way to
do the join is like

>                      ->  Nested Loop  (cost=0.13..6275745.36 rows=4971999 width=16) (actual time=0.505..11781.817
rows=210651loops=2) 
>                            ->  Parallel Seq Scan on B  (cost=0.00..18454.99 rows=248699 width=40) (actual
time=0.005..22.859rows=211395 loops=2) 
>                            ->  Index Scan using A_geom_idx on A (cost=0.13..25.15 rows=1 width=17752) (actual
time=0.054..0.055rows=1 loops=422789) 
>                                  Index Cond: (geom ~ B.geom)
>                                  Filter: st_within(B.geom, geom)

(Ignore the parallelism, it's not very relevant here.)  There's no
chance for merge or hash join because those require simple equality
join conditions.  The only way to avoid a stupid
compare-every-row-of-A-to-every-row-of-B nestloop is to use a
parameterized inner indexscan, as this plan does.  But that only works
if the join is inner or has the indexed table on the nullable side.
We have no support for nestloop right join, which is what would be
needed to make things run fast with no index on B.

            regards, tom lane



Re: impact join syntax ?? and gist index ??

От
Marc Millas
Дата:
I read your answer, Tom, but I cannot connect it to my measurements: why adding the index did slow the request twice ??

Marc MILLAS
Senior Architect
+33607850334



On Sat, Jan 7, 2023 at 10:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marc Millas <marc.millas@mokadb.com> writes:
> 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B
> on st_within(B.geom, A.geom) group by A.x;

So the problem with this is that the only decently-performant way to
do the join is like

>                      ->  Nested Loop  (cost=0.13..6275745.36 rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
>                            ->  Parallel Seq Scan on B  (cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2)
>                            ->  Index Scan using A_geom_idx on A (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 loops=422789)
>                                  Index Cond: (geom ~ B.geom)
>                                  Filter: st_within(B.geom, geom)

(Ignore the parallelism, it's not very relevant here.)  There's no
chance for merge or hash join because those require simple equality
join conditions.  The only way to avoid a stupid
compare-every-row-of-A-to-every-row-of-B nestloop is to use a
parameterized inner indexscan, as this plan does.  But that only works
if the join is inner or has the indexed table on the nullable side.
We have no support for nestloop right join, which is what would be
needed to make things run fast with no index on B.

                        regards, tom lane

Re: impact join syntax ?? and gist index ??

От
Tom Lane
Дата:
Marc Millas <marc.millas@mokadb.com> writes:
> I read your answer, Tom, but I cannot connect it to my measurements: why
> adding the index did slow the request twice ??

Are you referring to

>>> if I do create a gist index on geometry column of the big table,
>>> both syntax takes 21 seconds.

?  That result is pretty much what I'd expect.  If the planner has
to form a nestloop-with-inner-indexscan between a small table and
a big one, it's pretty much always going to prefer to put the small
table on the outside of the loop if it can.  The cost of such a
loop is going to be more or less number of outer rows times the
log of the number of inner rows (assuming at great risk of
oversimplification that an index probe into a table of size N
requires about O(log N) work), and it's not hard to see that
S * log(B) is less than B * log(S).  You seem to have lucked into a
case where the other way comes out faster despite that, which perhaps
can be explained by buffering effects, but it's not something to bet
on across-the-board.

(If you see this holding consistently, maybe it'd be advisable to
reduce the planner's effective_cache_size parameter to something
closer to shared_buffers, as it seems to indicate that fetches
from kernel space are pretty expensive on your platform.)

            regards, tom lane