Обсуждение: Query with order by and limit is very slow - wrong index used

Поиск

Список

Период

Сортировка

Query with order by and limit is very slow - wrong index used

От

Nowak Michał

Дата:

03 октября 2011 г., 09:44:49

Since it's my first on this list, I'd like to say "Hi guys" :)

Here is definition of my table:
a9-dev=> \d records;
                             Table "public.records"
                Column                |            Type             | Modifiers
--------------------------------------+-----------------------------+-----------
 id                                   | bigint                      | not null
 checksum                             | character varying(32)       | not null
 data                                 | text                        | not null
 delete_date                          | timestamp without time zone |
 last_processing_date                 | timestamp without time zone |
 object_id                            | character varying(255)      | not null
 processing_path                      | character varying(255)      | not null
 schema_id                            | character varying(255)      | not null
 source_id                            | character varying(255)      | not null
 source_object_last_modification_date | timestamp without time zone | not null
Indexes:
    "records_pkey" PRIMARY KEY, btree (id)
    "unq_records_0" UNIQUE, btree (object_id, schema_id, source_id, processing_path)
    "length_processing_path_id_idx" btree (length(processing_path::text), id)
    "length_processing_path_idx" btree (length(processing_path::text))
    "object_id_id_idx" btree (object_id, id)
    "schema_id_id_idx" btree (schema_id, id)
    "schema_id_idx" btree (schema_id)
    "source_id_id_idx" btree (source_id, id)
    "source_id_idx" btree (source_id)
    "source_object_last_modification_date_id_idx" btree (source_object_last_modification_date, id)
    "source_object_last_modification_date_idx" btree (source_object_last_modification_date)

Average length of value of "data" column =  2991.7947061626100466

When I perform query such as this:  "select * from records where source_id = 'XXX' order by id limit 200;" I expect DB
touse index source_id_id_idx  with XXX as filter. It is true for all but one values of XXX - when I ask for records
withmost common source_id, records_pkey index is used instead and performance is terrible! Explain analyze results
below.

a9-dev=> explain analyze select * from records where source_id ='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'
orderby id limit 200;     

                                                                         QUERY PLAN
                                

---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..755.61 rows=200 width=1127) (actual time=75.292..684.582 rows=200 loops=1)
   ->  Index Scan using source_id_id_idx on records  (cost=0.00..1563542.89 rows=413849 width=1127) (actual
time=75.289..684.495rows=200 loops=1) 
         Index Cond: ((source_id)::text = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
 Total runtime: 690.358 ms
(4 rows)

a9-dev=> explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200;    

                                                                            QUERY PLAN
                                     

-------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..466.22 rows=200 width=1127) (actual time=124093.485..124095.540 rows=200 loops=1)
   ->  Index Scan using records_pkey on records  (cost=0.00..2333280.84 rows=1000937 width=1127) (actual
time=124093.484..124095.501rows=200 loops=1) 
         Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
 Total runtime: 124130.247 ms
(4 rows)


Some info about data distrubution:

a9-dev=> select min(id) from records;
  min
--------
 190830
(1 row)

a9-dev=> select min(id), max(id) from records where source_id='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml';
   min   |   max
---------+---------
 1105217 | 3811326
(1 row)
a9-dev=> select min(id), max(id) from records where source_id='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml';
   min   |   max
---------+---------
 1544991 | 3811413
(1 row)

a9-dev=> select min(id), max(id) from (select id from records where source_id =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'order by id limit 200) as a; 
  min   |   max
---------+---------
1105217 | 1105416
(1 row)

a9-dev=> select min(id), max(id) from (select id from records where source_id =
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200) as a; 
  min   |   max
---------+---------
1544991 | 1545190
(1 row)



a9-dev=> select source_id, count(*) from records where source_id =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'or source_id =
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'group by source_id; 
                       source_id                        | count
--------------------------------------------------------+--------
 http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml    | 427254
 http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
(2 rows)

a9-dev=> select count(*) from records;
  count
---------
 3620311
(1 row)


DB settings:

a9-dev=>  SELECT
a9-dev->    'version'::text AS "name",
a9-dev->    version() AS "current_setting"
a9-dev->  UNION ALL
a9-dev->  SELECT
a9-dev->    name,current_setting(name)
a9-dev->  FROM pg_settings
a9-dev->  WHERE NOT source='default' AND NOT name IN
a9-dev->    ('config_file','data_directory','hba_file','ident_file',
a9-dev(>    'log_timezone','DateStyle','lc_messages','lc_monetary',
a9-dev(>    'lc_numeric','lc_time','timezone_abbreviations',
a9-dev(>    'default_text_search_config','application_name',
a9-dev(>    'transaction_deferrable','transaction_isolation',
a9-dev(>    'transaction_read_only');
           name           |                                                 current_setting
                    

--------------------------+-----------------------------------------------------------------------------------------------------------------
 version                  | PostgreSQL 9.0.4 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.6.0 20110530 (Red
Hat4.6.0-9), 64-bit 
 lc_collate               | en_US.UTF-8
 lc_ctype                 | en_US.UTF-8
 listen_addresses         | *
 log_rotation_age         | 1d
 log_rotation_size        | 0
 log_truncate_on_rotation | on
 logging_collector        | on
 max_connections          | 100
 max_stack_depth          | 2MB
 port                     | 5432
 server_encoding          | UTF8
 shared_buffers           | 24MB
 TimeZone                 | Poland
(14 rows)


This query was always slow. Autovacuum is on, and I ran VACUUM ANALYZE manually few minutes before writing this email.

Please help me with my problem. I'll be happy to provide any additional information if needed.
Michal Nowak

Re: Query with order by and limit is very slow - wrong index used

От

Gregg Jaskiewicz

Дата:

03 октября 2011 г., 10:02:33

How many rows do you have in that table?

I think , that planner thinks that the element you are looking for is
so common - that it will be to expensive to use index to fetch it.
Perhaps try increasing default_statistics_target , and revacuuming the table.

You could also try changing it just for the column:

ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose records;

Re: Query with order by and limit is very slow - wrong index used

От

Nowak Michał

Дата:

03 октября 2011 г., 10:52:04

> How many rows do you have in that table?

a9-dev=> select count(*) from records;
 count
---------
3620311
(1 row)

a9-dev=> select source_id, count(*) from records where source_id =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'or source_id =
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'group by source_id; 
                      source_id                        | count
--------------------------------------------------------+--------
http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml    | 427254
http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
(2 rows)


> ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose records;
Did you mean ALTER TABLE records ALTER id SET STATISTICS 1000;?




Wiadomość napisana przez Gregg Jaskiewicz w dniu 3 paź 2011, o godz. 12:02:

> How many rows do you have in that table?
>
> I think , that planner thinks that the element you are looking for is
> so common - that it will be to expensive to use index to fetch it.
> Perhaps try increasing default_statistics_target , and revacuuming the table.
>
> You could also try changing it just for the column:
>
> ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose records;
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Query with order by and limit is very slow - wrong index used

От

Gregg Jaskiewicz

Дата:

03 октября 2011 г., 11:20:19

2011/10/3 Nowak Michał <michal.nowak@me.com>:
>> How many rows do you have in that table?
>
> a9-dev=> select count(*) from records;
>  count
> ---------
> 3620311
> (1 row)


>
> a9-dev=> select source_id, count(*) from records where source_id =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'or source_id =
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'group by source_id; 
>                      source_id                        | count
> --------------------------------------------------------+--------
> http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml    | 427254
> http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
> (2 rows)

So the second one is roughly 27% of the table.  I don't know the
actual condition under which planner changes over the seqscan, but
that value seems quite common it seems.
The other thing planner is going to look at is the correlation, most
common values, most common frequencies.
In other words, if the value is 27% of all values, but is evenly
spread across - I think planner will go for seq scan regardless.

At the end of the day (afaik), index scan only pics pages for narrowed
down seqscan really. So imagine if your index scan returned all the
pages, you would still have to do a seqscan on all of them. Planner is
trying to avoid that by weighting the costs of both operations.
If it is too slow to run the current queries, you could try
normalizing the table by splitting source_id into separate one and
referencing it by an id. Very often what you'd find is that doing so
lowers I/O required, hence saves a lot of time in queries. Downside
is, that it is bit harder to add/update the tables. But that's where
triggers and procedures come handy.


>
>> ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose records;
> Did you mean ALTER TABLE records ALTER id SET STATISTICS 1000;?

Yup, that's what I meant. Sorry.


--
GJ

Re: Query with order by and limit is very slow - wrong index used

От

Nowak Michał

Дата:

03 октября 2011 г., 13:59:39

Please compare costs and actual times in those queries:

a9-dev=> explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
                                                                    QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..489.57 rows=200 width=1126) (actual time=99701.773..99703.858 rows=200 loops=1)
  ->  Index Scan using records_pkey on records  (cost=0.00..2441698.81 rows=997489 width=1126) (actual
time=99684.878..99686.936rows=200 loops=1) 
        Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
 Total runtime: 99705.916 ms
(4 rows)

a9-dev=> explain analyze select * from records2 where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
                                                                    QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..770.01 rows=200 width=1124) (actual time=0.076..0.205 rows=200 loops=1)
  ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..3735751.15 rows=970308 width=1124) (actual
time=0.074..0.180rows=200 loops=1) 
        Index Cond: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
 Total runtime: 0.235 ms
(4 rows)


First one uses records_pkey, and with estimated cost 2441698.81 runs in over 1,5 minute.
Second uses index on (source_id, id) and with estimated cost 3735751.15 runs in 235 miliseconds.

IMHO, problem lies not in records distribution nor normalization, but in planner's wrong cost estimation. I don't know
totell/force him to use proper index. 


Wiadomość napisana przez Gregg Jaskiewicz w dniu 3 paź 2011, o godz. 15:12:

> You forgot to do 'reply all' mate.
> If what I said about correlation is true (record spread between pages)
> - then increasing stats won't help you.
>
> As a test, try clustering the table by the source_id column. Vacuum it
> again, and retry. Unfortunately even if that helps, it won't actually
> fix it permanently.
> you probably need to normalize the table.
>
>
> 2011/10/3 Nowak Michał <michal.nowak@me.com>:
>> Setting statistics to 1000 on id and source_id didn't solve my problem:
>>
>> a9-dev=> explain analyze select * from records where source_id
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
>>                                                                    QUERY PLAN
>>
---------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..757.51 rows=200 width=1126) (actual time=43.648..564.798 rows=200 loops=1)
>>   ->  Index Scan using source_id_id_idx on records  (cost=0.00..1590267.66 rows=419868 width=1126) (actual
time=43.631..564.700rows=200 loops=1) 
>>         Index Cond: ((source_id)::text = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 564.895 ms
>> (4 rows)
>>
>> a9-dev=> explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
>>                                                                     QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..489.57 rows=200 width=1126) (actual time=99701.773..99703.858 rows=200 loops=1)
>>   ->  Index Scan using records_pkey on records  (cost=0.00..2441698.81 rows=997489 width=1126) (actual
time=99684.878..99686.936rows=200 loops=1) 
>>         Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 99705.916 ms
>> (4 rows)
>>
>>
>> As you can see, query looking for records with most common source_id is still by many magnitudes slower.
>>
>>> In other words, if the value is 27% of all values, but is evenly
>>> spread across - I think planner will go for seq scan regardless.
>>
>> Seq scan is not used - problem is, that planner chooses records_pkey index and checks every record's source_id until
itfinds 200 of them. I think that if source_id_id_idx index was used, query would execute as fast as for every other
valueof source_id. 
>> I even made an experiment: I created table records2 as copy of records, and added additional column id2 with same
valuesas id. I created same indexes on records2 as on records. Difference is, that there is no index on id2. 
>> Here are the results of problematic queries:
>> a9-dev=> explain analyze select * from records2 where source_id
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
>>                                                                     QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..770.15 rows=200 width=1124) (actual time=0.071..0.220 rows=200 loops=1)
>>   ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..1585807.44 rows=411820 width=1124) (actual
time=0.070..0.196rows=200 loops=1) 
>>         Index Cond: ((source_id)::text = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 0.255 ms
>> (4 rows)
>>
>> a9-dev=> explain analyze select * from records2 where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
>>                                                                     QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..770.01 rows=200 width=1124) (actual time=0.076..0.205 rows=200 loops=1)
>>   ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..3735751.15 rows=970308 width=1124) (actual
time=0.074..0.180rows=200 loops=1) 
>>         Index Cond: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 0.235 ms
>> (4 rows)
>>
>> As you can see, even for most common (~27%) values of source_id, planner chooses to use souce_id2_id2_id2x (I
replacedid with id2 when creating indexes on records2 :]) index and query executes as fast as for other values. 
>>
>> So, the question is: why planner chooses records_pkey over source_id_id_idx for the most common value of source_id?
>>
>>
>>
>> Wiadomość napisana przez Gregg Jaskiewicz w dniu 3 paź 2011, o godz. 13:20:
>>
>>> 2011/10/3 Nowak Michał <michal.nowak@me.com>:
>>>>> How many rows do you have in that table?
>>>>
>>>> a9-dev=> select count(*) from records;
>>>>  count
>>>> ---------
>>>> 3620311
>>>> (1 row)
>>>
>>>
>>>>
>>>> a9-dev=> select source_id, count(*) from records where source_id =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'or source_id =
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'group by source_id; 
>>>>                      source_id                        | count
>>>> --------------------------------------------------------+--------
>>>> http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml    | 427254
>>>> http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
>>>> (2 rows)
>>>
>>> So the second one is roughly 27% of the table.  I don't know the
>>> actual condition under which planner changes over the seqscan, but
>>> that value seems quite common it seems.
>>> The other thing planner is going to look at is the correlation, most
>>> common values, most common frequencies.
>>> In other words, if the value is 27% of all values, but is evenly
>>> spread across - I think planner will go for seq scan regardless.
>>>
>>> At the end of the day (afaik), index scan only pics pages for narrowed
>>> down seqscan really. So imagine if your index scan returned all the
>>> pages, you would still have to do a seqscan on all of them. Planner is
>>> trying to avoid that by weighting the costs of both operations.
>>> If it is too slow to run the current queries, you could try
>>> normalizing the table by splitting source_id into separate one and
>>> referencing it by an id. Very often what you'd find is that doing so
>>> lowers I/O required, hence saves a lot of time in queries. Downside
>>> is, that it is bit harder to add/update the tables. But that's where
>>> triggers and procedures come handy.
>>>
>>>
>>>>
>>>>> ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose records;
>>>> Did you mean ALTER TABLE records ALTER id SET STATISTICS 1000;?
>>>
>>> Yup, that's what I meant. Sorry.
>>>
>>>
>>> --
>>> GJ
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>
>
>
> --
> GJ

Re: Query with order by and limit is very slow - wrong index used

От

Cédric Villemain

Дата:

03 октября 2011 г., 14:29:27

> a9-dev=> explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
>                                                                    QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..489.57 rows=200 width=1126) (actual time=99701.773..99703.858 rows=200 loops=1)
>  ->  Index Scan using records_pkey on records  (cost=0.00..2441698.81 rows=997489 width=1126) (actual
time=99684.878..99686.936rows=200 loops=1) 
>        Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
>  Total runtime: 99705.916 ms
> (4 rows)
>
> a9-dev=> explain analyze select * from records2 where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
>                                                                    QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..770.01 rows=200 width=1124) (actual time=0.076..0.205 rows=200 loops=1)
>  ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..3735751.15 rows=970308 width=1124) (actual
time=0.074..0.180rows=200 loops=1) 
>        Index Cond: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
>  Total runtime: 0.235 ms
> (4 rows)
>
>
> First one uses records_pkey, and with estimated cost 2441698.81 runs in over 1,5 minute.
> Second uses index on (source_id, id) and with estimated cost 3735751.15 runs in 235 miliseconds.
>
> IMHO, problem lies not in records distribution nor normalization, but in planner's wrong cost estimation. I don't
knowto tell/force him to use proper index. 

Getting information on your current configuration should help.
Please see http://wiki.postgresql.org/wiki/Slow_Query_Questions

You should take care of the cache effect of your queries between your
tests, here it is not a problem, but this explain  was way longer for
this similar query.

a9-dev=> explain analyze select * from records where source_id
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' order by id
limit 200;
                                                                  QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..757.51 rows=200 width=1126) (actual
time=43.648..564.798 rows=200 loops=1)
 ->  Index Scan using source_id_id_idx on records
(cost=0.00..1590267.66 rows=419868 width=1126) (actual
time=43.631..564.700 rows=200 loops=1)
       Index Cond: ((source_id)::text =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 564.895 ms



--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Re: Query with order by and limit is very slow - wrong index used

От

Tom Lane

Дата:

03 октября 2011 г., 15:12:40

=?iso-8859-2?Q?Nowak_Micha=B3?= <michal.nowak@me.com> writes:
> When I perform query such as this:  "select * from records where source_id = 'XXX' order by id limit 200;" I expect
DBto use index source_id_id_idx  with XXX as filter. It is true for all but one values of XXX - when I ask for records
withmost common source_id, records_pkey index is used instead and performance is terrible! Explain analyze results
below.

I'm thinking it probably sees the pkey index as cheaper because that's
highly correlated with the physical order of the table.  (It would be
useful to see pg_stats.correlation for these columns.)  With a
sufficiently unselective filter, scanning in pkey order looks cheaper
than scanning in source_id order.

If so, what you probably need to do to get the estimates more in line
with reality is to reduce random_page_cost.  That will reduce the
assumed penalty for non-physical-order scanning.

            regards, tom lane

Re: Query with order by and limit is very slow - wrong index used

От

Marcin Mańk

Дата:

03 октября 2011 г., 22:46:16

2011/10/3 Nowak Michał <michal.nowak@me.com>:

> Some info about data distrubution:
>
> a9-dev=> select min(id) from records;
>  min
> --------
>  190830
> (1 row)
>
> a9-dev=> select min(id), max(id) from records where source_id='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml';
>   min   |   max
> ---------+---------
>  1105217 | 3811326
> (1 row)
> a9-dev=> select min(id), max(id) from records where
source_id='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml';
>   min   |   max
> ---------+---------
>  1544991 | 3811413
> (1 row)

PG assumes that the "wbc.poznan.pl" rows are all over the range of
ids, which seems not to be the case. There is no sense of cross-column
correlation in the planner currently.

You are going to have to resort to some more or less cute hacks, like
making an index on (source_id, id - 1) and doing "... order by
source_id, id - 1" .

Greetings
Marcin Mańk

Re: Query with order by and limit is very slow - wrong index used

От

Nowak Michał

Дата:

04 октября 2011 г., 07:11:11

Wiadomość napisana przez Tom Lane w dniu 3 paź 2011, o godz. 17:12:

> I'm thinking it probably sees the pkey index as cheaper because that's
> highly correlated with the physical order of the table.  (It would be
> useful to see pg_stats.correlation for these columns.)  With a
> sufficiently unselective filter, scanning in pkey order looks cheaper
> than scanning in source_id order.

a9-dev=> select  attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename = 'records';
               attname                | null_frac | avg_width | n_distinct | correlation
--------------------------------------+-----------+-----------+------------+-------------
 id                                   |         0 |         8 |         -1 |    0.932887
 last_processing_date                 |  0.886093 |         8 |      38085 |    0.427959
 object_id                            |         0 |        27 |  -0.174273 |    0.227186
 processing_path                      |         0 |        14 |         14 |    0.970166
 schema_id                            |         0 |        17 |         68 |    0.166175
 delete_date                          |  0.999897 |         8 |         29 |     0.63629
 data                                 |         0 |       949 |  -0.267811 |    0.158279
 checksum                             |         0 |        33 |  -0.267495 |   0.0269071
 source_id                            |         0 |        54 |         69 |    0.303059
 source_object_last_modification_date |         0 |         8 |     205183 |    0.137143
(10 rows)


> If so, what you probably need to do to get the estimates more in line
> with reality is to reduce random_page_cost.  That will reduce the
> assumed penalty for non-physical-order scanning.

I'll try that.

Regards,
Michal Nowak

Re: Query with order by and limit is very slow - wrong index used

От

Gregg Jaskiewicz

Дата:

04 октября 2011 г., 08:08:20

2011/10/4 Nowak Michał <michal.nowak@me.com>:
>
> a9-dev=> select  attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename = 'records';
>               attname                | null_frac | avg_width | n_distinct | correlation
> --------------------------------------+-----------+-----------+------------+-------------
>  source_id                            |         0 |        54 |         69 |    0.303059

http://www.postgresql.org/docs/9.0/interactive/view-pg-stats.html

"Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)"

Kind of like I and Tom said, 0.3 correlation there sounds like the cause.
Seriously, try normalisation as well, before discarding it.


--
GJ

Re: Query with order by and limit is very slow - wrong index used

От

Nowak Michał

Дата:

04 октября 2011 г., 08:56:19

Lowering random_page_cost didn't help -- I've tried values 2.0 and 1.5.

Then I tried "order by id -1" hack Marcin Mańk proposed...

a9-dev=> create index foo on records(source_id, (id - 1));
CREATE INDEX
a9-dev=>  explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by (id -1) limit 200; 
                                                            QUERY PLAN
            

-------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..379.42 rows=200 width=1124) (actual time=0.137..255.283 rows=200 loops=1)
  ->  Index Scan using foo on records  (cost=0.00..1864617.14 rows=982887 width=1124) (actual time=0.137..255.237
rows=200loops=1) 
        Index Cond: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 255.347 ms
(4 rows)

Significant improvement :)

As we can see, it is possible to query records fast without changing table structure. Question is: can I do it without
"hacks"? 

Michal Nowak

Re: Query with order by and limit is very slow - wrong index used

От

"Kevin Grittner"

Дата:

04 октября 2011 г., 23:33:33

Nowak Micha*<michal.nowak@me.com> wrote:

> Lowering random_page_cost didn't help -- I've tried values 2.0 and
> 1.5.

First off, I don't remember you saying how much RAM is on the
system, but be sure to set effective_cache_size to the sum of your
shared_buffers and OS cache.  I've often found that the optimizer
undervalues cpu_tuple_cost; try setting that to 0.05.  Then,
depending on how well cached the active portion of your database is,
you may want to drop your random_page_cost down close to or equal to
seq_page_cost.  If your cache hit rate is high enough, you might
want to drop *both* seq_page_cost and random_page_cost to something
as low as 0.1 or even 0.05.

The objective is to model the actual costs of your workload against
your data on your hardware.  Sometimes that takes a bit of
tinkering.

-Kevin

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Query with order by and limit is very slow - wrong index used