Re: Use of additional index columns in rows filtering

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Use of additional index columns in rows filtering
Дата
Msg-id eda1037d-8b10-3c85-6849-112895eefb02@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Use of additional index columns in rows filtering  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
Hi,

It took me a while but I finally got back to working on this patch. I
reread the summary of principles Peter shared in August, and I do agree
with all the main points - the overall qual hierarchy and the various
examples and counter-examples.

I'll respond to a couple things in this sub-thread, and then I plan to
post an updated version of the patch with a discussion of a couple
problems I still haven't managed to solve (not necessarily new ones).

On 8/19/23 00:19, Peter Geoghegan wrote:
> On Thu, Aug 17, 2023 at 4:29 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> On Wed, 2023-08-09 at 17:14 -0700, Peter Geoghegan wrote:
>>> + Index quals are better than equivalent index filters because bitmap
>>> index scans can only use index quals
>>
>> It seems there's consensus that:
>>
>>   * Index Filters (Tomas's patch and the topic of this thread) are more
>> general, because they can work on any expression, e.g. 1/x, which can
>> throw an error if x=0.
> 
> Agreed, but with one small caveat: they're not more general when it
> comes to bitmap index scans, which seem like they'll only ever be able
> to support index quals. But AFAICT that's the one and only exception.
> 

Yeah. Some conditions are never gonna be translated into proper index
quals, either because it's not really possible or because no one just
implemented that.

FWIW it's not immediately obvious to me if bitmap index scans are unable
to support index filters because of some inherent limitation, or whether
it's something we could implement. Some index AMs simply can't support
index filters, because they don't know the "full" index tuple tuple
(e.g. GIN), but maybe other AMs could support that ...

>>   * Index quals are more optimizable, because such operators are not
>> supposed to throw errors or have side effects, so we can evaluate them
>> before determining visibility.
> 
> Right. Though there is a second reason: the index AM can use them to
> navigate the index more efficiently with true index quals. At least in
> the case of SAOPs, and perhaps in other cases in the future.
> 
>> I wouldn't describe one as "better" than the other, but I assume you
>> meant "more optimizable".
> 
> The use of the term "better" is actually descriptive of Tomas' patch.
> It is structured in a way that conditions the use of index filters on
> the absence of equivalent index quals for eligible/indexed clauses. So
> it really does make sense to think of it as a "qual hierarchy" (to use
> Tomas' term).
> 
> It's also true that it will always be fundamentally impossible to use
> index quals for a significant proportion of all clause types, so
> "better when feasible at all" is slightly more accurate.
> 

Yeah, I agree with all of this too. I think that in all practical cases
an index qual is strictly better than an index filter, for exactly these
reasons.

>> Is there any part of the design here that's preventing this patch from
>> moving forward? If not, what are the TODOs for the current patch, or is
>> it just waiting on more review?
> 
> Practically speaking, I have no objections to moving ahead with this.
> I believe that the general structure of the patch makes sense, and I
> don't expect Tomas to do anything that wasn't already expected before
> I chimed in.
> 

I agree the general idea is sound. The main "problem" in the original
patch was about costing changes and the implied risk of picking the
wrong plan (instead of e.g. a bitmap index scan). That's pretty much
what the "risk" in example (4) in the principles.txt is about, AFAICS.

I think the consensus is that if we don't change the cost, this issue
mostly goes away - we won't pick index scan in cases where we'd pick a
different plan without the patch, and for index scans it's (almost)
always correct to replace "table filter" with "index filter".

If that issue goes away, I think there are two smaller ones - picking
which conditions to promote to index filters, and actually tweaking the
index scan code to do that.

The first issue seems similar to the costing issue - in fact, it really
is a costing problem. The goal is to minimize the amount of work needed
to evaluate the conditions on all rows, and if we knew selectivity+cost
for each condition, it'd be a fairly simple matter. But in practice we
don't know this - the selectivity estimates are rather shaky (and doubly
so for correlated conditions), and the procedure cost estimates are
quite crude too ...

The risk is we might move "forward" very expensive condition. Imagine a
condition with procost=1000000, which would normally be evaluated last
after all other clauses. But if we promote it to an index filter, that'd
no longer be true.

What Jeff proposed last week was roughly this:

- find min(procost) for clauses that can't be index filters
- consider all clauses with procost <= min(procost) as index filters

But after looking at this I realized order_qual_clauses() does a very
similar thing for regular clauses, and the proposed logic contradicts
the heuristics order_qual_clauses() a bit.

In particular, order_qual_clauses() orders the clauses by procost, but
then it's careful not to reorder the clauses with the same procost. With
the proposed heuristics, that'd change - the clauses might get reordered
in various ways, possibly with procost values [1, 10, 100, 1, 10, 100]
or something like that.

Not sure if we want to relax the heuristics like this. There's also the
practical issue that order_qual_clauses() gets called quite late, long
after the index filters were built. And it also considers security
levels, which I ignored until now.

But now that I think about it, with the original patch it had to be
decided when building the path, because that's when the costing happens.
With the costing changes removed, we can do probably do that much later
while creating the plan, after order_qual_clauses() does all the work.

We could walk the qpqual list, and stop on the first element that can't
be treated as index filter. That'd also handle the security levels, and
it'd *almost* do what Jeff proposed, I think.

The main difference is that it'd not reorder clauses with the same
procost. With multiple clauses with the same procost, it'd depend on
which one happens to be first in the list, which AFAIK depends on the
order of conditions in the query. That may be a bit surprising, I guess,
because it seems natural that in a declarative language this should not
really matter. Yes, we already do that, but it probably does not have
huge impact. If it affects whether a condition will be treated as an
index filter, the differences are likely to be much more significant.

(FWIW this reminds me the issues we had with GROUP BY optimization,
which ultimately got reverted because the reordering was considered too
risky. I'm afraid we might end in the same situation ...)


As for the other issue, that's more about how to deal with the various
quals in the generic scan code. Imagine we have multiple conditions, and
and some of them can be treated as index filters. So for example we may
end up with this:

  quals = [A, B, C, D]
  index-filters = [A, B]

And now a tuple can fall into three categories:

  a) all-visible=false, i.e. can't use index filter => quals

  b) all-visible=true, and index-filters evaluates to false

  c) all-visible=true, and index-filters evaluates to true

The first two cases are fine, but for (c) we need to also evaluate the
remaining non-filter quals [C, D]. We may build such list, and the 0002
patch does that, and shows the list in explain. But where/how would we
evaluate it? The code in execScan.c uses the ScanState->ps.qual, which
is set to [A,B,C,D]. Which is correct, but it does more work than
strictly necessary - the index filters are evaluated again. We can't
easily change that to [C,D] -> that would break the (a) case. Which
quals are evaluated on the heap tuple depends on visibility map and on
the index-filter result.

I think there's two options. First, we may accept that some of index
filters may get evaluated twice. That's not great, because it increases
the risk of regressions. Second, we could disable ScanState->ps.quals if
there are index filters, and just do all the work info nodeIndexscan.

But that seems quite ugly - in a way, the code already does that, which
is where the two loops

    while (true)
    {
        for (;;)
        {
            ...
        }
    }

come from. I was hoping to simplify / get rid of this, not making it do
even more stuff ... :-(


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: Built-in CTYPE provider
Следующее
От: Richard Guo
Дата:
Сообщение: Re: Update the comment in nodes.h to cover Cardinality