Re: [HACKERS] Linear vs. nonlinear planner cost estimates

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: [HACKERS] Linear vs. nonlinear planner cost estimates
Дата	14 декабря 2016 г. 23:46:47
Msg-id	2651.1481748407@sss.pgh.pa.us обсуждение
Ответ на	Re: [HACKERS] Linear vs. nonlinear planner cost estimates (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: [HACKERS] Linear vs. nonlinear planner cost estimates
Список	pgsql-hackers

Дерево обсуждения

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Dec 14, 2016 at 2:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't have a concrete proposal right now about how to fix this.  The
>> most expansive response would be to decorate every path with an explicitly
>> nonlinear cost function, which would need to be able to report the cost
>> to fetch the first N tuples rather than the total scan cost.  That would
>> be a lot of work though.

> And it would have some significant overhead, I bet.

I think the extra cost of running such a function would only come in when
we're actually considering a query with LIMIT and are ready to choose
which path gets the LIMIT stuck on top of it.  That seems okay to me.
My real pain point with extra planner work is when it happens in queries
that get no benefit.

The really *critical* aspect of this, which could cost far more than a
few extra executions of a costing function, comes in if it damages
add_path's ability to prune inferior paths early.  So we would not want
to simply drop the startup_cost field; and we certainly don't want
add_path calling this function during every path comparison, either.
So my thought is that we'd need to keep startup_cost for add_path to
use, though we could (and likely should) redefine it as below.  We
should still be able to assume that if path A dominates path B at
first-tuple and last-tuple costs, it dominates everywhere in between.

A thought that occurred to me after more study of the problem example
is that the "good" index is a bit bigger than the "bad" one, meaning
it will get charged a bit more in index descent costs.  With our current
definition of startup_cost as being only the index descent cost, that
means the good index looks worse on startup_cost than the bad one.  In
this example, the good index should survive the add_path tournament
anyway because it has better total_cost --- but it's easy to imagine
cases where that wouldn't be true.  So I'm thinking that redefining
startup_cost as time to fetch the first tuple would be a good thing
in terms of making sure that desirable plans don't lose out in add_path.

>> Maybe we could make this work just by setting the startup cost of an
>> indexscan differently.  We could redefine "startup cost" as "cost to fetch
>> the first tuple and then stop", and compute that independently of the
>> total cost for plan types where it'd matter.  That would fix the problem
>> for LIMIT 1 cases, and even for cases with a larger limit, scaling
>> linearly from that to the total cost would produce a better estimate than
>> what we get now.  But it feels like it's still a band-aid.

> I think that would be a good place to start.  I bet it would help
> materially, and it's a lot less work than anything more sophisticated.
> There are plenty of cases (semi joins, and many nested loops that are
> not semi joins) where the number of rows that we expect to fetch is 1
> or very close to 1; LIMIT 100 or LIMIT 1000 happens, but it's a lot
> less common.

Yeah.  Also, per the further thoughts above, we'd probably still end up
redefining startup_cost like this even if we then plastered on a function
to do more complex interpolation in between first and last tuple.  So
it'd make sense to do this as a first step even if we add the additional
mechanism later.

I'll put this on my to-do list...
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Linear vs. nonlinear planner cost estimates