Re: huge disparities in =/IN/BETWEEN performance

Поиск
Список
Период
Сортировка
От George Pavlov
Тема Re: huge disparities in =/IN/BETWEEN performance
Дата
Msg-id 8C5B026B51B6854CBE88121DBF097A868160D3@ehost010-33.exch010.intermedia.net
обсуждение исходный текст
Ответ на Re: huge disparities in =/IN/BETWEEN performance  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: huge disparities in =/IN/BETWEEN performance  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-sql
Thanks all for the various useful thoughts. Let me backtrack a bit and
state my real underlying issue a bit with actual examples. Hope not to
bore you with the length of this. Looks to me like an optimizer issue
unless I am missing something.

So, suppose I have a query:

select *
from stuff
inner join  ( -- just getting the distinct   -- user-stuff associations   -- since there may be multiple;   --
ultimatelyI need to query by user  select stuff_id, user_id   from stuff_user  group by 1,2 -- GROUP BY outperforms
DISTINCT) su using (stuff_id) 
left join ( -- this obtains summary statistics   -- about each stuff item  select stuff_id, count(*)  from stuff_events
group by 1 ) se 
using (stuff_id)
where user_id = 41

This is a very pared down version of what I have. And yes this specific
query can be rewritten as a single GROUP BY, but in the real world I am
gathering the aggregate statistics from several tables, so I actually
have several sub-recordsets similar to the one called "se" above.
Rewriting ALL those as a single GROUP BY is not feasible. I know, all
this cries for a single summarized rollup table, but let's not go there
(for now).

So running the above is inefficient. This particular user_id has only
one associated stuff_id and does not even have much data for that in
stuff_events. The query runs in ~4600ms. Were I to query by stuff_id
instead, things look great (if I change the where clause to the stuff_id
it runs in 25ms). When I query based on stuff_id the optimizer uses an
index on stuff_events.stuff_id. However, when I query by user_id it does
a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to
first figure out which stuff_ids are related to the user_id that is
being asked for and then look ONLY those up in the stuff_events table
using the index on stuff_id.

It would seem (and this is where we get back to my original question)
that one should be able to just say:

select *
from stuff
left join (select stuff_id, count(*)  from stuff_events  group by 1 ) se
using (stuff_id)
where stuff_id in  (select distinct stuff_id  from stuff_user  where user_id = 41 )

You'd think that the subquery in the IN would be (very quickly) resolved
to a list of stuff_ids and then stuff_events would be accessed via its
stuff_id index. Instead, the Seq Scan on stuff_events still happens and
the query actually is even slower than the original, running in ~5500ms.

So one (very ugly) way to optimize the first query is to add an extra
join to stuff_user INSIDE the "se" subquery:

select *
from stuff
inner join  (select stuff_id, user_id   from stuff_user  group by 1,2 ) su using (stuff_id)
left join (select stuff_id, user_id, count(*)  from stuff_events  inner join     ( -- same subquery as above     select
stuff_id,user_id      from stuff_user     group by 1,2      ) su2    using (stuff_id)  group by 1,2 ) se 
using (stuff_id)
where user_id = 41;

This does improve things a lot, bringing the execution time for this
particular user to 3ms (!), but it is quite ugly and not fast enough for
me for a user_id with lots of associated stuff_ids.

George


В списке pgsql-sql по дате отправления:

Предыдущее
От: "Ezequias Rodrigues da Rocha"
Дата:
Сообщение: Have anyone this man e-mail ?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: huge disparities in =/IN/BETWEEN performance