Okay to tighten definition of oprcanhash?
От | Tom Lane |
---|---|
Тема | Okay to tighten definition of oprcanhash? |
Дата | |
Msg-id | 24613.1040416125@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Okay to tighten definition of oprcanhash?
(Greg Stark <gsstark@mit.edu>)
Re: Okay to tighten definition of oprcanhash? (Bruce Momjian <pgman@candle.pha.pa.us>) |
Список | pgsql-hackers |
I have been looking into the possibility of using a hashtable to speed up "x IN (SELECT y FROM ...)" operations. Basically the idea is to run the subselect once, loading its "y" outputs into an in-memory hashtable (any duplicates can be discarded); then for each outer row, probe into the hashtable to see if there's a match for the outer "x" value. This relatively simple idea gets a lot messier as soon as you start to consider the fine points of SQL92 "IN" semantics, however. In particular, if x is null or any particular y is null, the result of the IN should be null ("unknown"), not false. What the spec really says is: 1. If "x = y" yields TRUE for any row of the sub-select, the IN yields TRUE. 2. If "x = y" yields FALSE for all rows of the sub-select, or the sub-select contains no rows, the IN yields FALSE. 3. Otherwise (which by elimination means "x = y" yielded UNKNOWN for at least one row, and TRUE nowhere), the IN yieldsUNKNOWN. What we need to know about, therefore, is not so much whether x or y is null as whether the "=" operator yields null. Also, IN supports multi-column comparisons --- eg,(x1, x2) IN (SELECT y1, y2 FROM ...) So the x or y value could be partially null as well as completely null. I think I see how to do this in what will usually be a fairly efficient fashion, but it requires assuming a good deal about the behavior of the "=" operator(s) involved. The actual implementation has to be: 1. Store all-not-null subquery rows in hashtable (combining equal rows into a single entry). Store partially or completely null rows in a separate linear list, combining non-distinct rows into a single entry. 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN. 2b. For LHS row partially-null: must compare against every element of both hashtable and linear list, using same logic as for scanning original subquery output (three-valued combination of OR of ANDs). We know that we will not get a TRUE result, so we can stop as soon as we find an UNKNOWN result for any row; this says it's worthwhile to scan the linear list first, as that's most likely to give us an UNKNOWN. 2c. For LHS all-non-null: probe into hashtable for equality; result is TRUE if match found. Otherwise, must scan list of partially-null rows to see if any non-distinct rows are found; if so return UNKNOWN, else return FALSE. This should be reasonably efficient because the slow cases only occur when there are partially-null LHS or RHS rows, which should be relatively uncommon. (In particular, when there is only one column to compare, case 2b doesn't arise at all, and there can never be more than one entry in the linear list, namely RHS = null.) Also, the number of distinct partially null rows should usually be much less than the number of distinct no-nulls rows, so the linear list should be much smaller than the hashtable. (I tried to think of ways to hash this list, but it doesn't seem to work...) Now, the assumptions this makes about the behavior of the "=" operator(s) are: 1. The operators are hashable (ie, only values producing the same hashkey could possibly be equal). Else we can't use a hashtable at all. 2. The operators are strict (must yield NULL out for NULL in). This is what lets us avoid a table scan in case 2a. 3. The operators do not yield NULL for non-null inputs. Otherwise, after case 2c fails to find a TRUE result, we'd have to scan the whole hashtable, not only the partially-null list, to see if there are rows that must cause us to return UNKNOWN. We already have markers in pg_operator and pg_proc to indicate hashability and strictness, but there's no marker to indicate "reverse strictness" as per assumption #3 (what's the proper term for that property, anyway?) I am leaning towards documenting that the oprcanhash marker implies "reverse strictness" as well. All the currently hashable operators meet that requirement. Although you could imagine cases where you might like "=" to return null for non-null inputs (for example, a rigorous interpretation of IEEE float math would probably say that a NaN compared to anything else yields UNKNOWN), it's not really a very practical thing to do in SQL --- one counterexample is that such a datatype could not be btree-indexable, since btree assumes a total ordering is possible. The other thing we could do is put a separate "reverse strict" marker column into pg_proc, but this seems like much more work than is justified. Comments? (Oh BTW: I'm aware that most of the problem goes away for IN appearing at the top level of WHERE, since there we don't actually need to distinguish FALSE from UNKNOWN results. But I'm trying to understand how to optimize the general case, too.) regards, tom lane
В списке pgsql-hackers по дате отправления: