Обсуждение: Table with many NULLS for indexed column yields strange query plan

Поиск
Список
Период
Сортировка

Table with many NULLS for indexed column yields strange query plan

От
greigwise
Дата:
I have a query like this:

SELECT  "table1".* FROM "table1"
INNER JOIN "table2" ON "table2"."table1_id" = "table1"."id"
INNER JOIN "table3" ON "table3"."id" = "table2"."table3_id" WHERE
"table3"."number" = 'xxxx'
AND ("table2"."type") IN ('Standard') ;

table2 has a large number of NULLS in the column table3_id.  There is an
index on this column.  Here is the result of explain analyze:

Merge Join  (cost=1001.20..4076.67 rows=17278 width=167) (actual
time=284.918..300.167 rows=2244 loops=1)
   Merge Cond: (table2.table3_id = table3.id)
   ->  Gather Merge  (cost=1000.93..787825.78 rows=621995 width=175) (actual
time=5.786..283.269 rows=64397 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         ->  Nested Loop  (cost=0.87..712740.12 rows=155499 width=175)
(actual time=0.091..102.708 rows=13107 loops=5)
               ->  Parallel Index Scan using index_table2_on_table3_id on
table2  (cost=0.43..489653.08 rows=155499 width=16) (actual
time=0.027..22.327 rows=13107 loops=5)
                     Filter: ((type)::text = 'Standard'::text)
               ->  Index Scan using table1_pk on table1  (cost=0.44..1.43
rows=1 width=167) (actual time=0.005..0.005 rows=1 loops=65535)
                     Index Cond: (id = table2.table1_id)
   ->  Index Scan using table3_pkey on table3  (cost=0.27..53.40 rows=1
width=8) (actual time=0.041..0.048 rows=1 loops=1)
         Filter: ((number)::text = 'xxxx'::text)
         Rows Removed by Filter: 35
 Planning time: 0.450 ms
 Execution time: 310.230 ms

You can see the row estimate there is way off on the Parallel Index Scan.
I suspect that this is because it's including the rows with null in the
selectivity estimate even though the table3_id can't possibly be null here
due to the inner join.

If I modify the query like this:

SELECT  "table1".* FROM "table1"
INNER JOIN "table2" ON "table2"."table1_id" = "table1"."id"
INNER JOIN "table3" ON "table3"."id" = "table2"."table3_id" WHERE
"table3"."number" = 'xxxx'
AND ("table2"."type") IN ('Standard') and table3_id is not null;

Just adding in table3_id is not null at the end there, I get a much better
plan.

Nested Loop  (cost=1.14..290.04 rows=66 width=167) (actual
time=0.058..11.258 rows=2244 loops=1)
   ->  Nested Loop  (cost=0.70..64.46 rows=66 width=8) (actual
time=0.049..2.873 rows=2244 loops=1)
         ->  Index Scan using table3_pkey on table3  (cost=0.27..53.40
rows=1 width=8) (actual time=0.030..0.035 rows=1 loops=1)
               Filter: ((number)::text = 'xxxx'::text)
               Rows Removed by Filter: 35
         ->  Index Scan using index_table2_on_table3_id on table2
(cost=0.43..11.05 rows=1 width=16) (actual time=0.017..2.102 rows=2244
loops=1)
               Index Cond: ((table3_id = table3.id) AND (table3_id IS NOT
NULL))
               Filter: ((type)::text = 'Standard'::text)
   ->  Index Scan using table1_pk on table1  (cost=0.44..3.42 rows=1
width=167) (actual time=0.003..0.003 rows=1 loops=2244)
         Index Cond: (id = table2. id)
 Planning time: 0.403 ms
 Execution time: 11.672 ms

Can I do anything statistics wise so that I get a better plan here or do I
have to modify the query.  It seems kinda hacky that I would have to specify
is not null on that column since like I said it can't possibly be null.

Thanks,

Greig Wise



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Table with many NULLS for indexed column yields strange queryplan

От
"Peter J. Holzer"
Дата:
On 2020-03-05 18:08:53 -0700, greigwise wrote:
> I have a query like this:
>
> SELECT  "table1".* FROM "table1"
> INNER JOIN "table2" ON "table2"."table1_id" = "table1"."id"
> INNER JOIN "table3" ON "table3"."id" = "table2"."table3_id"
> WHERE "table3"."number" = 'xxxx'
> AND ("table2"."type") IN ('Standard') ;
>
> table2 has a large number of NULLS in the column table3_id.  There is an
> index on this column.  Here is the result of explain analyze:
>
> Merge Join  (cost=1001.20..4076.67 rows=17278 width=167) (actual time=284.918..300.167 rows=2244 loops=1)
>    Merge Cond: (table2.table3_id = table3.id)
>    ->  Gather Merge  (cost=1000.93..787825.78 rows=621995 width=175) (actual time=5.786..283.269 rows=64397 loops=1)
>          Workers Planned: 4
>          Workers Launched: 4
>          ->  Nested Loop  (cost=0.87..712740.12 rows=155499 width=175) (actual time=0.091..102.708 rows=13107
loops=5)
>                ->  Parallel Index Scan using index_table2_on_table3_id on table2  (cost=0.43..489653.08 rows=155499
width=16)(actual time=0.027..22.327 rows=13107 loops=5) 
>                      Filter: ((type)::text = 'Standard'::text)

It looks like it postgresql is scanning the index here to get the
entries in the right order for the merge join. It's strange that it
thinks this is a good strategy even though it has to visit every row in
the table (no index cond).

How is the selectivity of "type"? Would an index on that column help?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Table with many NULLS for indexed column yields strange query plan

От
Tom Lane
Дата:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> How is the selectivity of "type"? Would an index on that column help?

The EXPLAIN results say that the "type = 'Standard'" condition is
completely not selective: in both plans, there is no "Rows Removed by
Filter" indication where it's applied, indicating that it did not
filter out any rows.  Which is odd, because if that isn't removing
any rows, why is the planner overestimating the number of rows
retrieved from table2 by circa 10x?  Maybe a lack of stats for the
"type" column?

I also find it odd that the second plan is uselessly using an index for
table3 (I say "useless" because there's no index condition and no
apparent need for the result to be sorted).  I suspect the OP has been
putting his thumb on the scales in ways he hasn't bothered to tell us
about, like fooling with the cost parameters and/or disabling seqscans.

            regards, tom lane



Re: Table with many NULLS for indexed column yields strange queryplan

От
greigwise
Дата:
Seqscans are not disabled.  Also, this is PostgreSQL 10.11 if that helps.

Costs are as follows:

 seq_page_cost
---------------
 1

 random_page_cost
------------------
 1.5

It is odd that it does not just do a seqscan on table3.   It's a very small
table... only like 36 rows.   I'd think the plan *should* seq scan table3,
get the id where number = 'xxxx', then use the index
index_table2_on_table3_id on table2 to get the matching rows for that id. 
It does use that index when I specify that table3_id is not null, but not
otherwise.  

table3_id is very selective into table2 for any non-null value, so I don't
know why it would choose to scan that entire index in the case of the first
query where the table3_id clearly can't be null due to the inner join.  
Check out this:

select tablename, attname, inherited, null_frac, avg_width, n_distinct, 
most_common_vals, most_common_freqs from pg_stats where tablename = 'table2'
and attname = 'table3_id';
   tablename   |     attname      | inherited | null_frac | avg_width |
n_distinct |
---------------+------------------+-----------+-----------+-----------+------------+
 table2           | table3_id        | f             |  0.996167 |           
8 |           39 |

most_common_vals: 
{985,363,990,991,992,45,81,8,126,307,378,739,855,993,994,190,338,366,369,537,663,805,846,155,277,803,870,988}
                                        
 
most_common_freqs:

{0.000233333,0.0002,0.0002,0.0002,0.0002,0.000166667,0.000166667,0.000133333,0.000133333,0.000133333,0.000133333,0.000133333,0.000133333,0.000133333,0.000133333,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}

Thanks again for any help.

Greig



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html