Re: Risk Estimation WAS: Planner hints in Postgresql

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Risk Estimation WAS: Planner hints in Postgresql
Дата
Msg-id CA+TgmoYWM4xOW4V9-bHMiBdrmhC7ArBEED8jsUAR55u5Tzn84g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Risk Estimation WAS: Planner hints in Postgresql  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Risk Estimation WAS: Planner hints in Postgresql  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Risk Estimation WAS: Planner hints in Postgresql  (Atri Sharma <atri.jiit@gmail.com>)
Список pgsql-hackers
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Atri Sharma <atri.jiit@gmail.com> writes:
>> One of the factors that leads to bad estimates is that the histogram of the
>> values of a column maintained by the planner gets old by time and the data
>> in the column changes. So, the histogram is no longer a quite accurate view
>> of the data and it leads to bad selectivity.
>
> TBH, this is so far down the list of problems that it'll be a long time
> before we need to worry about it.  It's certainly not the number one
> priority for any project to model risk in the planner.
>
> The thing that I think is probably the number one problem is estimates
> that depend on an assumption of uniform distribution of sought-after rows
> among those encountered by a scan.  This is usually where bad plans for
> LIMIT queries are coming from.  We could certainly add some sort of fudge
> factor to those costs, but I'd like to have a more-or-less principled
> framework for doing so.

I think the problem is, in some sense, more basic than that.  I think
the kind of query we're talking about here is:

SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

Assume for the sake of argument that there are 100 rows that would be
returned in the absence of the limit.  Let SC and TC be the startup
cost and total cost of the index scan.  As a matter of general policy,
we're going to say that the cost of this is SC + 0.01 * (TC - SC).
What makes this path look appealing to the planner is that SC is small
relative to TC.  If we knew, for example, that we weren't going to
find the first match until 90% of the way through the index scan, then
we could set SC = 90% * TC and, all else being equal, the planner
would make the right decision.

So you might think that the problem here is that we're assuming
uniform density.  Let's say there are a million rows in the table, and
there are 100 that match our criteria, so the first one is going to
happen 1/10,000'th of the way through the table.  Thus we set SC =
0.0001 * TC, and that turns out to be an underestimate if the
distribution isn't as favorable as we're hoping.  However, that is NOT
what we are doing.  What we are doing is setting SC = 0.  I mean, not
quite 0, but yeah, effectively 0. Essentially we're assuming that no
matter how selective the filter condition may be, we assume that it
will match *the very first row*.

So we're not assuming the average case and getting hosed when things
come out worse than average.  We're assuming the *best* case.  So
unless things happen to really swing in our favor, we got hosed.

Now it might be that a fudge factor of 2 or 1.5 or 10 or 3 or 17 is
appropriate, so that we actually assume we're going to have to scan a
little more of the index than we expect.  That can perhaps be
justified by the possibility that there may actually be NO rows
matching the filter condition, and we'll have to try scanning the
entire index to get off the ground.  We could also try to come up with
a mathematical model for that.  But that fudge factor would presumably
be a multiplier on the effort of finding the first tuple.  And right
now we assume that finding the first tuple will be trivial.  So I
think we should fix THAT problem first, and then if that turns out to
be insufficient, we can worry about what further fudging is required.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Archive recovery won't be completed on some situation.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Risk Estimation WAS: Planner hints in Postgresql