Обсуждение: Okay to tighten definition of oprcanhash?

Поиск
Список
Период
Сортировка

Okay to tighten definition of oprcanhash?

От
Tom Lane
Дата:
I have been looking into the possibility of using a hashtable to speed
up "x IN (SELECT y FROM ...)" operations.  Basically the idea is to run
the subselect once, loading its "y" outputs into an in-memory hashtable
(any duplicates can be discarded); then for each outer row, probe into
the hashtable to see if there's a match for the outer "x" value.

This relatively simple idea gets a lot messier as soon as you start to
consider the fine points of SQL92 "IN" semantics, however.  In
particular, if x is null or any particular y is null, the result of the
IN should be null ("unknown"), not false.  What the spec really says is:
1. If "x = y" yields TRUE for any row of the sub-select,   the IN yields TRUE.
2. If "x = y" yields FALSE for all rows of the sub-select,   or the sub-select contains no rows, the IN yields FALSE.
3. Otherwise (which by elimination means "x = y" yielded UNKNOWN   for at least one row, and TRUE nowhere), the IN
yieldsUNKNOWN.
 

What we need to know about, therefore, is not so much whether x or y is
null as whether the "=" operator yields null.

Also, IN supports multi-column comparisons --- eg,(x1, x2) IN (SELECT y1, y2 FROM ...)
So the x or y value could be partially null as well as completely null.

I think I see how to do this in what will usually be a fairly efficient
fashion, but it requires assuming a good deal about the behavior of the
"=" operator(s) involved.  The actual implementation has to be:

1. Store all-not-null subquery rows in hashtable (combining equal rows
into a single entry).  Store partially or completely null rows in a
separate linear list, combining non-distinct rows into a single entry.

2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN.

2b. For LHS row partially-null: must compare against every element of
both hashtable and linear list, using same logic as for scanning
original subquery output (three-valued combination of OR of ANDs).
We know that we will not get a TRUE result, so we can stop as soon as
we find an UNKNOWN result for any row; this says it's worthwhile to scan
the linear list first, as that's most likely to give us an UNKNOWN.

2c. For LHS all-non-null: probe into hashtable for equality; result is
TRUE if match found.  Otherwise, must scan list of partially-null rows
to see if any non-distinct rows are found; if so return UNKNOWN, else
return FALSE.

This should be reasonably efficient because the slow cases only occur
when there are partially-null LHS or RHS rows, which should be
relatively uncommon.  (In particular, when there is only one column to
compare, case 2b doesn't arise at all, and there can never be more than
one entry in the linear list, namely RHS = null.)  Also, the number of
distinct partially null rows should usually be much less than the number
of distinct no-nulls rows, so the linear list should be much smaller
than the hashtable.  (I tried to think of ways to hash this list, but it
doesn't seem to work...)

Now, the assumptions this makes about the behavior of the "="
operator(s) are:

1. The operators are hashable (ie, only values producing the same
hashkey could possibly be equal).  Else we can't use a hashtable at all.

2. The operators are strict (must yield NULL out for NULL in).  This is
what lets us avoid a table scan in case 2a.

3. The operators do not yield NULL for non-null inputs.  Otherwise,
after case 2c fails to find a TRUE result, we'd have to scan the whole
hashtable, not only the partially-null list, to see if there are rows
that must cause us to return UNKNOWN.

We already have markers in pg_operator and pg_proc to indicate
hashability and strictness, but there's no marker to indicate "reverse
strictness" as per assumption #3 (what's the proper term for that
property, anyway?)

I am leaning towards documenting that the oprcanhash marker implies
"reverse strictness" as well.  All the currently hashable operators meet
that requirement.  Although you could imagine cases where you might like
"=" to return null for non-null inputs (for example, a rigorous
interpretation of IEEE float math would probably say that a NaN compared
to anything else yields UNKNOWN), it's not really a very practical thing
to do in SQL --- one counterexample is that such a datatype could not be
btree-indexable, since btree assumes a total ordering is possible.

The other thing we could do is put a separate "reverse strict" marker
column into pg_proc, but this seems like much more work than is
justified.

Comments?

(Oh BTW: I'm aware that most of the problem goes away for IN appearing
at the top level of WHERE, since there we don't actually need to
distinguish FALSE from UNKNOWN results.  But I'm trying to understand
how to optimize the general case, too.)
        regards, tom lane


Re: Okay to tighten definition of oprcanhash?

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I have been looking into the possibility of using a hashtable to speed
> up "x IN (SELECT y FROM ...)" operations.  

That's certainly one of the join types that Oracle can perform, and it's
frequently by far the fastest.

I'm not sure but I think the way Oracle optimizes subselects is by
transforming them into the equivalent join. Or rather, probably by
transforming both joins and subselects into an equivalent internal
representation.

This applies equally to things likeWHERE x=(select..)
as well as things likeWHERE x IN (select...)

The former is exactly equivalent to a join with an assertion of uniqueness.
The latter is a little different but still equivalent to a join with special
behaviour in case of duplicates. In many cases the database will have a
constraint telling it that no duplicates will appear in which cases it should
be able to optimize out any extra work anyways.

I know when I guided less experienced SQL programmers I urged them not to
worry about which form would perform better, only which form more clearly
expressed the result set they wanted. I promised them the database should
produce the same query plan for equivalent queries regardless of the form
chosen to express that in.

That was almost always true for Oracle. I generally found it impossible to
optimize a query merely by changing it from one equivalent form to another.
Oracle nearly always produced exactly the same plan. I always had to either
find a non-equivalent form that I knew would produce the same results only
because of extra information I had about the data, or add optimizer hints.

So is there some more general internal representation that can represent all
three of these cases in a consistent manner? It seems more powerful to
implement hash joins in a way that helps normal join queries as well as
particular subselect forms of queries.

For what it's worth my limited experience so far with postgres is that what
you're talking about is sorely needed. I'm having trouble getting queries that
I would write without thinking twice about on Oracle to perform reasonably on
postgres even after trying all kinds of contortions. And they're precisely the
types of queries that would be helped by hash joins.

-- 
greg



Re: Okay to tighten definition of oprcanhash?

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> I'm not sure but I think the way Oracle optimizes subselects is by
> transforming them into the equivalent join.

The point here is that there is no exactly equivalent join operation.

(Of course, given Oracle's known lack of standards-compliance on NULL
semantics, I wouldn't be overly surprised if they've misimplemented IN
in a way that doesn't preserve the spec's semantics ...)

It does get a lot simpler when the IN appears as a top-level WHERE
clause, because *in that context* you can ignore the difference between
FALSE and UNKNOWN results from IN.  I have some other plans for
implementing IN in a join-like fashion in that special case.  But what
I'm looking at right now is the general case ...
        regards, tom lane


Re: Okay to tighten definition of oprcanhash?

От
Bruce Momjian
Дата:
I like the idea, and see need to keep a separate list of NULL values.

---------------------------------------------------------------------------

Tom Lane wrote:
> I have been looking into the possibility of using a hashtable to speed
> up "x IN (SELECT y FROM ...)" operations.  Basically the idea is to run
> the subselect once, loading its "y" outputs into an in-memory hashtable
> (any duplicates can be discarded); then for each outer row, probe into
> the hashtable to see if there's a match for the outer "x" value.
> 
> This relatively simple idea gets a lot messier as soon as you start to
> consider the fine points of SQL92 "IN" semantics, however.  In
> particular, if x is null or any particular y is null, the result of the
> IN should be null ("unknown"), not false.  What the spec really says is:
> 
>     1. If "x = y" yields TRUE for any row of the sub-select,
>        the IN yields TRUE.
> 
>     2. If "x = y" yields FALSE for all rows of the sub-select,
>        or the sub-select contains no rows, the IN yields FALSE.
> 
>     3. Otherwise (which by elimination means "x = y" yielded UNKNOWN
>        for at least one row, and TRUE nowhere), the IN yields UNKNOWN.
> 
> What we need to know about, therefore, is not so much whether x or y is
> null as whether the "=" operator yields null.
> 
> Also, IN supports multi-column comparisons --- eg,
>     (x1, x2) IN (SELECT y1, y2 FROM ...)
> So the x or y value could be partially null as well as completely null.
> 
> I think I see how to do this in what will usually be a fairly efficient
> fashion, but it requires assuming a good deal about the behavior of the
> "=" operator(s) involved.  The actual implementation has to be:
> 
> 1. Store all-not-null subquery rows in hashtable (combining equal rows
> into a single entry).  Store partially or completely null rows in a
> separate linear list, combining non-distinct rows into a single entry.
> 
> 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN.
> 
> 2b. For LHS row partially-null: must compare against every element of
> both hashtable and linear list, using same logic as for scanning
> original subquery output (three-valued combination of OR of ANDs).
> We know that we will not get a TRUE result, so we can stop as soon as
> we find an UNKNOWN result for any row; this says it's worthwhile to scan
> the linear list first, as that's most likely to give us an UNKNOWN.
> 
> 2c. For LHS all-non-null: probe into hashtable for equality; result is
> TRUE if match found.  Otherwise, must scan list of partially-null rows
> to see if any non-distinct rows are found; if so return UNKNOWN, else
> return FALSE.
> 
> This should be reasonably efficient because the slow cases only occur
> when there are partially-null LHS or RHS rows, which should be
> relatively uncommon.  (In particular, when there is only one column to
> compare, case 2b doesn't arise at all, and there can never be more than
> one entry in the linear list, namely RHS = null.)  Also, the number of
> distinct partially null rows should usually be much less than the number
> of distinct no-nulls rows, so the linear list should be much smaller
> than the hashtable.  (I tried to think of ways to hash this list, but it
> doesn't seem to work...)
> 
> Now, the assumptions this makes about the behavior of the "="
> operator(s) are:
> 
> 1. The operators are hashable (ie, only values producing the same
> hashkey could possibly be equal).  Else we can't use a hashtable at all.
> 
> 2. The operators are strict (must yield NULL out for NULL in).  This is
> what lets us avoid a table scan in case 2a.
> 
> 3. The operators do not yield NULL for non-null inputs.  Otherwise,
> after case 2c fails to find a TRUE result, we'd have to scan the whole
> hashtable, not only the partially-null list, to see if there are rows
> that must cause us to return UNKNOWN.
> 
> We already have markers in pg_operator and pg_proc to indicate
> hashability and strictness, but there's no marker to indicate "reverse
> strictness" as per assumption #3 (what's the proper term for that
> property, anyway?)
> 
> I am leaning towards documenting that the oprcanhash marker implies
> "reverse strictness" as well.  All the currently hashable operators meet
> that requirement.  Although you could imagine cases where you might like
> "=" to return null for non-null inputs (for example, a rigorous
> interpretation of IEEE float math would probably say that a NaN compared
> to anything else yields UNKNOWN), it's not really a very practical thing
> to do in SQL --- one counterexample is that such a datatype could not be
> btree-indexable, since btree assumes a total ordering is possible.
> 
> The other thing we could do is put a separate "reverse strict" marker
> column into pg_proc, but this seems like much more work than is
> justified.
> 
> Comments?
> 
> (Oh BTW: I'm aware that most of the problem goes away for IN appearing
> at the top level of WHERE, since there we don't actually need to
> distinguish FALSE from UNKNOWN results.  But I'm trying to understand
> how to optimize the general case, too.)
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073