Re: Okay to tighten definition of oprcanhash?

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: Okay to tighten definition of oprcanhash?
Дата
Msg-id 200212202209.gBKM9Bk22937@candle.pha.pa.us
обсуждение исходный текст
Ответ на Okay to tighten definition of oprcanhash?  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
I like the idea, and see need to keep a separate list of NULL values.

---------------------------------------------------------------------------

Tom Lane wrote:
> I have been looking into the possibility of using a hashtable to speed
> up "x IN (SELECT y FROM ...)" operations.  Basically the idea is to run
> the subselect once, loading its "y" outputs into an in-memory hashtable
> (any duplicates can be discarded); then for each outer row, probe into
> the hashtable to see if there's a match for the outer "x" value.
> 
> This relatively simple idea gets a lot messier as soon as you start to
> consider the fine points of SQL92 "IN" semantics, however.  In
> particular, if x is null or any particular y is null, the result of the
> IN should be null ("unknown"), not false.  What the spec really says is:
> 
>     1. If "x = y" yields TRUE for any row of the sub-select,
>        the IN yields TRUE.
> 
>     2. If "x = y" yields FALSE for all rows of the sub-select,
>        or the sub-select contains no rows, the IN yields FALSE.
> 
>     3. Otherwise (which by elimination means "x = y" yielded UNKNOWN
>        for at least one row, and TRUE nowhere), the IN yields UNKNOWN.
> 
> What we need to know about, therefore, is not so much whether x or y is
> null as whether the "=" operator yields null.
> 
> Also, IN supports multi-column comparisons --- eg,
>     (x1, x2) IN (SELECT y1, y2 FROM ...)
> So the x or y value could be partially null as well as completely null.
> 
> I think I see how to do this in what will usually be a fairly efficient
> fashion, but it requires assuming a good deal about the behavior of the
> "=" operator(s) involved.  The actual implementation has to be:
> 
> 1. Store all-not-null subquery rows in hashtable (combining equal rows
> into a single entry).  Store partially or completely null rows in a
> separate linear list, combining non-distinct rows into a single entry.
> 
> 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN.
> 
> 2b. For LHS row partially-null: must compare against every element of
> both hashtable and linear list, using same logic as for scanning
> original subquery output (three-valued combination of OR of ANDs).
> We know that we will not get a TRUE result, so we can stop as soon as
> we find an UNKNOWN result for any row; this says it's worthwhile to scan
> the linear list first, as that's most likely to give us an UNKNOWN.
> 
> 2c. For LHS all-non-null: probe into hashtable for equality; result is
> TRUE if match found.  Otherwise, must scan list of partially-null rows
> to see if any non-distinct rows are found; if so return UNKNOWN, else
> return FALSE.
> 
> This should be reasonably efficient because the slow cases only occur
> when there are partially-null LHS or RHS rows, which should be
> relatively uncommon.  (In particular, when there is only one column to
> compare, case 2b doesn't arise at all, and there can never be more than
> one entry in the linear list, namely RHS = null.)  Also, the number of
> distinct partially null rows should usually be much less than the number
> of distinct no-nulls rows, so the linear list should be much smaller
> than the hashtable.  (I tried to think of ways to hash this list, but it
> doesn't seem to work...)
> 
> Now, the assumptions this makes about the behavior of the "="
> operator(s) are:
> 
> 1. The operators are hashable (ie, only values producing the same
> hashkey could possibly be equal).  Else we can't use a hashtable at all.
> 
> 2. The operators are strict (must yield NULL out for NULL in).  This is
> what lets us avoid a table scan in case 2a.
> 
> 3. The operators do not yield NULL for non-null inputs.  Otherwise,
> after case 2c fails to find a TRUE result, we'd have to scan the whole
> hashtable, not only the partially-null list, to see if there are rows
> that must cause us to return UNKNOWN.
> 
> We already have markers in pg_operator and pg_proc to indicate
> hashability and strictness, but there's no marker to indicate "reverse
> strictness" as per assumption #3 (what's the proper term for that
> property, anyway?)
> 
> I am leaning towards documenting that the oprcanhash marker implies
> "reverse strictness" as well.  All the currently hashable operators meet
> that requirement.  Although you could imagine cases where you might like
> "=" to return null for non-null inputs (for example, a rigorous
> interpretation of IEEE float math would probably say that a NaN compared
> to anything else yields UNKNOWN), it's not really a very practical thing
> to do in SQL --- one counterexample is that such a datatype could not be
> btree-indexable, since btree assumes a total ordering is possible.
> 
> The other thing we could do is put a separate "reverse strict" marker
> column into pg_proc, but this seems like much more work than is
> justified.
> 
> Comments?
> 
> (Oh BTW: I'm aware that most of the problem goes away for IN appearing
> at the top level of WHERE, since there we don't actually need to
> distinguish FALSE from UNKNOWN results.  But I'm trying to understand
> how to optimize the general case, too.)
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Okay to tighten definition of oprcanhash?
Следующее
От: Ryan Mahoney
Дата:
Сообщение: plpgsql and index usage