Re: index problems (again)

Поиск
Список
Период
Сортировка
От Geoff Winkless
Тема Re: index problems (again)
Дата
Msg-id CAEzk6fe2U16yJExFpwwSpSfrEzWHswW=0Z3w13GwByUp7gwUUA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: index problems (again)  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Ответы Re: index problems (again)  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Re: index problems (again)  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-general
On 12 March 2016 at 18:43, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2016-03-08 10:16:57 +0000, Geoff Winkless wrote:
>> On 7 March 2016 at 20:40, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>> > As Tom wrote, the estimate of having to read only about 140 rows is only
>> > valid if sc_id and sc_date are uncorrelated. In reality your query has
>> > to read a lot more than 140 rows, so it is much slower.
>>
>> But as I've said previously, even if I do select from scdate values
>> that I know to be in the first 1% of the data (supposedly the perfect
>> condition) the scan method is insignificantly quicker than the index
>> (scdate,scid) method.
>
> Actually the planner expects find a match within the first 0.0035 %, so
> to find out how fast that would be you would have to use a value from
> that range.

Fair point, although given my data I can't create a query that does
that since the first 0.3% or so all have the same value (although I
suppose I could modify the data to do so); however the thing I was
getting at was more that it doesn't have to be very far off perfect
distribution for the second method to be "better".

> The question is what can be done to improve the situation.
>
> Tom thinks that correlation statistics would help. That seems plausible
> to me.

I'm sure that correlation statistics would help in this one instance
(if Tom thinks so, I certainly wouldn't argue!). I was more looking at
the planner choices in general because I assumed (perhaps incorrectly)
that correlated columns aren't the only time this sort of best/worst
scenario hits it.

> You claim that no statistics are needed.

Well that's a bit confrontational.

> That may or may not be true: You haven't proposed an alternate method
> yet.

You could make an assumption that perfect distribution isn't true:
that actually the distribution is within a certain _deviation_ of that
perfect distribution. It wouldn't have to have been very much to make
the index-only scan win here and would still keep the planner from
choosing less optimal queries most of the time (and where it did end
up making the "wrong" choice it's not going to be far off anyway).

But I'm making assumptions here, I'm aware of that. Chances are that
actually most people's data _does_ fit into this perfect distribution
set. Is there any research that shows that real-world data usually
does?

> I feel fairly certain that using the worst case (the cost for scanning
> the whole table) would be just as bad in and would cause inferior plans
> to be used in many instances.

Oh I certainly agree with that.

As Jeff points out I'd have a much larger win in this instance by
someone spending the time implementing skip index scans rather than
messing with the planner :)

In the end I bit the bullet and converted the code to use
LEAST((SELECT MIN..),(SELECT MIN...),(SELECT MIN...)) which (being an
equality test) happily uses the (scdate,scid) index. Since the query
ends up producing an equivalent set I could (mostly) do a
search-replace, which made it a much simpler change than (for example)
the ,COUNT (*) version.

Geoff


В списке pgsql-general по дате отправления:

Предыдущее
От: Karsten Hilbert
Дата:
Сообщение: Re: Q: extract database name from directory dump
Следующее
От: Karsten Hilbert
Дата:
Сообщение: Re: Q: extract database name from directory dump