Re: Window functions and index usage

Поиск
Список
Период
Сортировка
От Robert Klemme
Тема Re: Window functions and index usage
Дата
Msg-id CAM9pMnPds50eKum7LWyxxXFGyiK=XX8cdm=2z53KZcsUKopp2g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Window functions and index usage  (Anssi Kääriäinen <anssi.kaariainen@thl.fi>)
Список pgsql-performance
On Tue, Oct 4, 2011 at 4:06 PM, Anssi Kääriäinen
<anssi.kaariainen@thl.fi> wrote:
> On 10/04/2011 04:27 PM, Robert Klemme wrote:
>>
>> On Tue, Oct 4, 2011 at 11:39 AM, Anssi Kääriäinen
>> <anssi.kaariainen@thl.fi>  wrote:
>>>
>>> I have the following setup:
>>>
>>> create table test(id integer, seq integer);
>>> insert into test select generate_series(0, 100), generate_series(0,
>>> 1000);
>>> create unique index test_idx on test(id, seq);
>>> analyze test;
>>>
>>> Now I try to fetch the latest 5 values per id, ordered by seq from the
>>> table:
>>>
>>> select * from (
>>> select id, seq, row_number() over (partition by id order by seq)
>>>  from test
>>>  where id in (1, 2, 3)
>>> ) where row_number()<= 5;
>>
>> It seems this fetches the *first* 5 values per id - and not the
>> latest.  For that you would need to "order by seq desc" in the window
>> and probably also in the index.
>>
> Yeah. Sorry, wrong order. And the last line is wrong, it should be ") tmp
> where row_number <= 5;".
>>>
>>> This does not use the index on id, seq for sorting the data. It uses a
>>> bitmap index scan, and sequential scan when issued SET enable_bitmapscan
>>> to
>>> false. Tested both on git head and 8.4.8. See end of post for plans. It
>>> seems it would be possible to fetch the first values per id using the
>>> index,
>>> or at least skip the sorting.
>>
>> Just guessing: since row_number is an analytic function and it can be
>> combined with any type of windowing only in "rare" cases do the
>> ordering of index columns and the ordering in the window align.  So
>> while your particular use case could benefit from this optimization
>> the overall judgement might be that it's not worthwhile to make the
>> optimizer more complex to cover this case - or I fail to see the more
>> general pattern here. :-)
>>
> I think there are common use cases for this - see end of message for an
> example.
>>>
>>> Is there some other way to run the query so that it would use the index?
>>> Is
>>> there plans to support the index usage for the above query assuming that
>>> it
>>> is possible to use the index for that query?
>>>
>>> The real world use case would be to fetch latest 5 threads per discussion
>>> forum in one query. Or fetch 3 latest comments for all patches in given
>>> commit fest in single query.
>>
>> Is it really that realistic that someone wants the latest n entries
>> for *all* threads / patches?  It seems since this can result in a very
>> large data set this is probably not the type of query which is done
>> often.
>
> The idea is that the dataset isn't that large.

But then why do require using the second index column in the first
place?  If the data set is small then the query is likely fast if the
selection via id can use any index.

> And you don't have to fetch
> them for all threads / patches. You might fetch them only for patches in
> currently viewed commit fest. See
> https://commitfest.postgresql.org/action/commitfest_view?id=12 for one such
> use. What I have in mind is fetching first all the patches in the commit
> fest in one go. Then issue query which would look something like:
>  select * from
>    (select comment_data, row_number() over (partition by patch_id order by
> comment_date desc)
>       from patch_comments
>     where patch_id in (list of patch_ids fetched in first query)
>   ) tmp where row_number <= 3;

Interesting: I notice that I the query cannot successfully be simplified on 8.4:

rklemme=> select *,
row_number() over (partition by id order by seq desc) as rn
from test
where id in (1,2,3)
and rn <= 3
;
ERROR:  column "rn" does not exist
LINE 5: and rn <= 3
            ^
rklemme=> select *,
row_number() over (partition by id order by seq desc) as rn
from test
where id in (1,2,3)
and row_number() <= 3;
ERROR:  window function call requires an OVER clause
LINE 5: and row_number() <= 3;
            ^
rklemme=> select *,
row_number() over (partition by id order by seq desc) as rn
from test
where id in (1,2,3)
and row_number() over (partition by id order by seq desc) <= 3;
ERROR:  window functions not allowed in WHERE clause
LINE 5: and row_number() over (partition by id order by seq desc) <=...
            ^
rklemme=>

I think I need to switch to 9.1 soon. :-)

> Now you have all the data needed for the first column in the above mentioned
> page.
>
> I guess the commit fest application is fetching all the comments for the
> patches in the commit fest in one query, and then doing the limit in
> application code. This is fine because there aren't that many comments per
> patch. But if you have dozens of forums and thousands of threads per forum
> you can't do that.
>
> This is useful in any situation where you want to show n latest entries
> instead of the last entry in a list view. Latest modifications to an object,
> latest commits to a file, latest messages to a discussion thread or latest
> payments per project. Or 5 most popular videos per category, 10 highest paid
> employees per department and so on.

Again, what is easy for you as a human will likely be quite complex
for the optimizer (knowing that the order by and the row_number output
align).

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

В списке pgsql-performance по дате отправления:

Предыдущее
От: Anssi Kääriäinen
Дата:
Сообщение: Re: Window functions and index usage
Следующее
От: Merlin Moncure
Дата:
Сообщение: Re: array_except -- Find elements that are not common to both arrays