Re: Query with order by and limit is very slow - wrong index used

Поиск
Список
Период
Сортировка
От Nowak Michał
Тема Re: Query with order by and limit is very slow - wrong index used
Дата
Msg-id B1B4689B-6F4B-4932-BD79-790B1D4B1414@me.com
обсуждение исходный текст
Ответ на Query with order by and limit is very slow - wrong index used  (Nowak Michał <michal.nowak@me.com>)
Ответы Re: Query with order by and limit is very slow - wrong index used
Список pgsql-performance
Please compare costs and actual times in those queries:

a9-dev=> explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
                                                                    QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..489.57 rows=200 width=1126) (actual time=99701.773..99703.858 rows=200 loops=1)
  ->  Index Scan using records_pkey on records  (cost=0.00..2441698.81 rows=997489 width=1126) (actual
time=99684.878..99686.936rows=200 loops=1) 
        Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
 Total runtime: 99705.916 ms
(4 rows)

a9-dev=> explain analyze select * from records2 where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
                                                                    QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..770.01 rows=200 width=1124) (actual time=0.076..0.205 rows=200 loops=1)
  ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..3735751.15 rows=970308 width=1124) (actual
time=0.074..0.180rows=200 loops=1) 
        Index Cond: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
 Total runtime: 0.235 ms
(4 rows)


First one uses records_pkey, and with estimated cost 2441698.81 runs in over 1,5 minute.
Second uses index on (source_id, id) and with estimated cost 3735751.15 runs in 235 miliseconds.

IMHO, problem lies not in records distribution nor normalization, but in planner's wrong cost estimation. I don't know
totell/force him to use proper index. 


Wiadomość napisana przez Gregg Jaskiewicz w dniu 3 paź 2011, o godz. 15:12:

> You forgot to do 'reply all' mate.
> If what I said about correlation is true (record spread between pages)
> - then increasing stats won't help you.
>
> As a test, try clustering the table by the source_id column. Vacuum it
> again, and retry. Unfortunately even if that helps, it won't actually
> fix it permanently.
> you probably need to normalize the table.
>
>
> 2011/10/3 Nowak Michał <michal.nowak@me.com>:
>> Setting statistics to 1000 on id and source_id didn't solve my problem:
>>
>> a9-dev=> explain analyze select * from records where source_id
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
>>                                                                    QUERY PLAN
>>
---------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..757.51 rows=200 width=1126) (actual time=43.648..564.798 rows=200 loops=1)
>>   ->  Index Scan using source_id_id_idx on records  (cost=0.00..1590267.66 rows=419868 width=1126) (actual
time=43.631..564.700rows=200 loops=1) 
>>         Index Cond: ((source_id)::text = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 564.895 ms
>> (4 rows)
>>
>> a9-dev=> explain analyze select * from records where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id limit 200; 
>>                                                                     QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..489.57 rows=200 width=1126) (actual time=99701.773..99703.858 rows=200 loops=1)
>>   ->  Index Scan using records_pkey on records  (cost=0.00..2441698.81 rows=997489 width=1126) (actual
time=99684.878..99686.936rows=200 loops=1) 
>>         Filter: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 99705.916 ms
>> (4 rows)
>>
>>
>> As you can see, query looking for records with most common source_id is still by many magnitudes slower.
>>
>>> In other words, if the value is 27% of all values, but is evenly
>>> spread across - I think planner will go for seq scan regardless.
>>
>> Seq scan is not used - problem is, that planner chooses records_pkey index and checks every record's source_id until
itfinds 200 of them. I think that if source_id_id_idx index was used, query would execute as fast as for every other
valueof source_id. 
>> I even made an experiment: I created table records2 as copy of records, and added additional column id2 with same
valuesas id. I created same indexes on records2 as on records. Difference is, that there is no index on id2. 
>> Here are the results of problematic queries:
>> a9-dev=> explain analyze select * from records2 where source_id
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
>>                                                                     QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..770.15 rows=200 width=1124) (actual time=0.071..0.220 rows=200 loops=1)
>>   ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..1585807.44 rows=411820 width=1124) (actual
time=0.070..0.196rows=200 loops=1) 
>>         Index Cond: ((source_id)::text = 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 0.255 ms
>> (4 rows)
>>
>> a9-dev=> explain analyze select * from records2 where source_id
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'order by id2 limit 200; 
>>                                                                     QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..770.01 rows=200 width=1124) (actual time=0.076..0.205 rows=200 loops=1)
>>   ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..3735751.15 rows=970308 width=1124) (actual
time=0.074..0.180rows=200 loops=1) 
>>         Index Cond: ((source_id)::text = 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
>>  Total runtime: 0.235 ms
>> (4 rows)
>>
>> As you can see, even for most common (~27%) values of source_id, planner chooses to use souce_id2_id2_id2x (I
replacedid with id2 when creating indexes on records2 :]) index and query executes as fast as for other values. 
>>
>> So, the question is: why planner chooses records_pkey over source_id_id_idx for the most common value of source_id?
>>
>>
>>
>> Wiadomość napisana przez Gregg Jaskiewicz w dniu 3 paź 2011, o godz. 13:20:
>>
>>> 2011/10/3 Nowak Michał <michal.nowak@me.com>:
>>>>> How many rows do you have in that table?
>>>>
>>>> a9-dev=> select count(*) from records;
>>>>  count
>>>> ---------
>>>> 3620311
>>>> (1 row)
>>>
>>>
>>>>
>>>> a9-dev=> select source_id, count(*) from records where source_id =
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'or source_id =
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'group by source_id; 
>>>>                      source_id                        | count
>>>> --------------------------------------------------------+--------
>>>> http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml    | 427254
>>>> http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
>>>> (2 rows)
>>>
>>> So the second one is roughly 27% of the table.  I don't know the
>>> actual condition under which planner changes over the seqscan, but
>>> that value seems quite common it seems.
>>> The other thing planner is going to look at is the correlation, most
>>> common values, most common frequencies.
>>> In other words, if the value is 27% of all values, but is evenly
>>> spread across - I think planner will go for seq scan regardless.
>>>
>>> At the end of the day (afaik), index scan only pics pages for narrowed
>>> down seqscan really. So imagine if your index scan returned all the
>>> pages, you would still have to do a seqscan on all of them. Planner is
>>> trying to avoid that by weighting the costs of both operations.
>>> If it is too slow to run the current queries, you could try
>>> normalizing the table by splitting source_id into separate one and
>>> referencing it by an id. Very often what you'd find is that doing so
>>> lowers I/O required, hence saves a lot of time in queries. Downside
>>> is, that it is bit harder to add/update the tables. But that's where
>>> triggers and procedures come handy.
>>>
>>>
>>>>
>>>>> ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose records;
>>>> Did you mean ALTER TABLE records ALTER id SET STATISTICS 1000;?
>>>
>>> Yup, that's what I meant. Sorry.
>>>
>>>
>>> --
>>> GJ
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>
>
>
> --
> GJ


В списке pgsql-performance по дате отправления:

Предыдущее
От: Nowak Michał
Дата:
Сообщение: Fwd: Query with order by and limit is very slow - wrong index used
Следующее
От: "姜头"
Дата:
Сообщение: How can i get record by data block not by sql?