Обсуждение: row estimate very wrong for array type

Поиск
Список
Период
Сортировка

row estimate very wrong for array type

От
Denis de Bernardy
Дата:
Hi,

I'm running into erroneous row estimates when using an array type, and I'm running out of ideas on how to steer
postgresinto the right direction... I've tried setting statistics to 1000, vacuuming and analyzing over and over,
rewritingthe query differently... to no avail. 

The table looks like this:

create table test (
id serial primary key,
sortcol float unique,
intarr1 int[] not null default '{}',
intarr2 int[] not null default '{}'

);

It contains 40k rows of random data which, for the sake of the queries that follow, aren't too far from what they'd
containin production. 


# select intarr1, intarr2, count(*) from test group by intarr1, intarr2;


 intarr1 | intarr2 | count 
--------------+-------------+-------
 {0}          | {2,3}       | 40018


The stats seem right:

# select * from pg_stats where tablename = 'test' and attname = 'intarr1';

 schemaname | tablename |   attname    | inherited | null_frac | avg_width | n_distinct | most_common_vals |
most_common_freqs| histogram_bounds | correlation  

------------+-----------+--------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 test       | test      | intarr1 | f         |         0 |        25 |          1 | {"{0}"}          | {1}            
 |                  |           1 


A query without any condition on the array results in reasonable estimates and the proper use of the index on the sort
column:

# explain analyze select * from test order by sortcol limit 10;

                                                             QUERY PLAN                                                
            

------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.00 rows=10 width=217) (actual time=0.098..0.109 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12019.08 rows=40018 width=217) (actual time=0.096..0.105
rows=10loops=1) 
 Total runtime: 0.200 ms


After adding a condition on the array, however, the row estimate is completely off:


# explain analyze select * from test where intarr1 && '{0,1}' order by sortcol limit 10;

                                                            QUERY PLAN                                                
           

----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..605.96 rows=10 width=217) (actual time=0.131..0.175 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12119.13 rows=200 width=217) (actual time=0.129..0.169
rows=10loops=1) 
         Filter: (intarr1 && '{0,1}'::integer[])


When there's a condition on both arrays, this then leads to a seq scan:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && '{2,4}' order by sortcol limit 10;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.260..88.265 rows=10 loops=1)
   ->  Sort  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.258..88.260 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Seq Scan on test  (cost=0.00..3241.27 rows=1 width=217) (actual time=0.169..68.785 rows=40018 loops=1)
               Filter: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
 Total runtime: 88.339 ms


Adding a GIN index on the two arrays results in similar ugliness:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && '{2,4}' order by lft limit 10;
                                                                         QUERY PLAN                                    
                                    

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.29..8.29 rows=1 width=217) (actual time=56.122..56.127 rows=10 loops=1)
   ->  Sort  (cost=8.29..8.29 rows=1 width=217) (actual time=56.120..56.122 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Bitmap Heap Scan on test  (cost=4.26..8.28 rows=1 width=217) (actual time=19.635..39.824 rows=40018
loops=1)
               Recheck Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
               ->  Bitmap Index Scan on test_intarr1_intarr2_idx  (cost=0.00..4.26 rows=1 width=0) (actual
time=19.387..19.387rows=40018 loops=1) 
                     Index Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
 Total runtime: 56.210 ms


Might this be a bug in the operator's selectivity, or am I doing something wrong?

Thanks in advance,
Denis


Re: row estimate very wrong for array type

От
Tom Lane
Дата:
Denis de Bernardy <ddebernardy@yahoo.com> writes:
> [ estimates for array && suck ]
> Might this be a bug in the operator's selectivity, or am I doing something wrong?

Array && uses areasel() which is only a stub :-(

In the particular case here it'd be possible to get decent answers
just by trying the operator against all the MCV-list entries, but it's
unlikely that that would fix things for enough people to be worth the
trouble.  Really you'd need to maintain statistics about the element
values appearing in the array column in order to get useful estimates
for && queries.  Jan Urbanski did something similar for tsvector columns
a year or two ago, but nobody's gotten around to doing it for array
columns.

            regards, tom lane

Re: row estimate very wrong for array type

От
Denis de Bernardy
Дата:
That kind of limits the usefulness of aggregating hierarchical dependencies into array columns to avoid enormous join
statements.:-| 


Re your todo item you mention in this thread:

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01864.php

My C is rusty, but I might have enough understanding of the PG internals to massage pre-existing code... Feel free to
messageme off list with pointers if you think I might be able to help. 



----- Original Message -----
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Denis de Bernardy <ddebernardy@yahoo.com>
> Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
> Sent: Wednesday, May 4, 2011 4:12 PM
> Subject: Re: [PERFORM] row estimate very wrong for array type
>
> Denis de Bernardy <ddebernardy@yahoo.com> writes:
>>  [ estimates for array && suck ]
>>  Might this be a bug in the operator's selectivity, or am I doing
> something wrong?
>
> Array && uses areasel() which is only a stub :-(
>
> In the particular case here it'd be possible to get decent answers
> just by trying the operator against all the MCV-list entries, but it's
> unlikely that that would fix things for enough people to be worth the
> trouble.  Really you'd need to maintain statistics about the element
> values appearing in the array column in order to get useful estimates
> for && queries.  Jan Urbanski did something similar for tsvector columns
> a year or two ago, but nobody's gotten around to doing it for array
> columns.
>
>             regards, tom lane
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: row estimate very wrong for array type

От
Denis de Bernardy
Дата:
> ----- Original Message -----

>>  From: Tom Lane <tgl@sss.pgh.pa.us>
>>  To: Denis de Bernardy <ddebernardy@yahoo.com>
>>  Cc: "pgsql-performance@postgresql.org"
> <pgsql-performance@postgresql.org>
>>  Sent: Wednesday, May 4, 2011 4:12 PM
>>  Subject: Re: [PERFORM] row estimate very wrong for array type
>>
>>  Array && uses areasel() which is only a stub :-(


On a separate note, in case this ever gets found via google, I managed to force the use of the correct index in the
meanwhile:

# explain analyze select * from test where (0 = any(intarr1) or 1 = any(intarr1)) and (2 = any(intarr2) or 4 =
any(intarr2))order by sortcol limit 10; 
                                                            QUERY PLAN                                                
            

-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..385.16 rows=10 width=217) (actual time=0.107..0.151 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..14019.98 rows=364 width=217) (actual time=0.106..0.146
rows=10loops=1) 
         Filter: (((0 = ANY (intarr1)) OR (1 = ANY (intarr1))) AND ((2 = ANY (intarr2)) OR (4 = ANY (intarr2))))
 Total runtime: 0.214 ms


I guess I'm in for maintaining counts and rewriting queries as needed. :-(

D