Re: BUG #17964: Missed query planner optimization

Поиск
Список
Период
Сортировка
От Mathias Kunter
Тема Re: BUG #17964: Missed query planner optimization
Дата
Msg-id a29936a0-e446-d4d1-6bd1-42d7212c26a0@gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17964: Missed query planner optimization  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-bugs
> We'd need a way to determine which is cheaper based on estimated costs

Yes, that decision must of course be based on estimated costs, and also 
the cost estimation must be reasonably accurate. In all examples which 
I've seen so far, the cost estimations were adequate.

> that might be, in my opinion, fairly tricky to do based on the order of
> how things currently are done in the query planner.

I don't know the details about the current planner implementation, but 
the planner obviously can already correctly estimate costs for both the 
IN and ANY(ARRAY()) variants.

> I was just trying to show that it's quite a tricky problem to
> fix properly.

Yes, the tricky part obviously is to decide whether or not an IN clause 
should be converted to a semi-join, because that decision depends on the 
other IN clauses of the query as well. So, it's unfortunately not 
possible to decide independently for each IN clause.

As you've already pointed out, doing an exhaustive search of all 
possible combinations is obviously problematic, since we'd have to check 
2^N possible query plans (with N being the number of IN clauses of the 
query). So, this is certainly NOT the way to go, unless maybe N<4 or so.

> Having the planner consider the costs of converting the IN to a semi-
> join and converting or not based on cost seems like a worthy goal.

It obviously is. As shown, the performance improvement can be gigantic 
for certain queries.

> If you think it's quite an easy project, then we'll welcome
> contributions to improve this.

I'm not familiar with the current implementation, but my understanding 
is that this should be quite easy to implement for someone who is 
currently working on planner-related issues.

After all, a reasonable heuristic solution can be as simple as this:


1) Create the query plan for the original query, using the current 
planner implementation. (That's the status quo.)

2) Create a modified version of the original input query where all of 
the IN clauses are replaced with ANY(ARRAY()). Create the query plan for 
that modified query version, using the current planner implementation.

3) Compare the two previously created query plans and execute the one 
with the lower estimated total cost. If you don't trust the planner's 
cost estimations for some reason, then only execute the modified version 
if its estimated cost is SIGNIFICANTLY lower.


A simple solution like this would already be good enough to avoid many 
of the horrifying worst case scenarios, where a query takes minutes when 
it should actually only take milliseconds.

It would of course be possible (and desirable) to come up with a more 
sophisticated heuristic than the one given above. For example, create a 
third version of the original query where only those IN clauses are 
replaced with ANY(ARRAY()) where the estimated number of rows returned 
by the subselect is below a certain threshold, and estimate costs for 
that modified query as well.

> The pgsql-hackers as a good emailing list to bring up the topic

Thank you for the hint. I'll continue the discussion over there.

Mathias



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Junwang Zhao
Дата:
Сообщение: Re: No -d option
Следующее
От: Jelte Fennema
Дата:
Сообщение: Re: Internal error with types changes and prepared statements