Re: Okay to tighten definition of oprcanhash?
От | Bruce Momjian |
---|---|
Тема | Re: Okay to tighten definition of oprcanhash? |
Дата | |
Msg-id | 200212202209.gBKM9Bk22937@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Okay to tighten definition of oprcanhash? (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
I like the idea, and see need to keep a separate list of NULL values. --------------------------------------------------------------------------- Tom Lane wrote: > I have been looking into the possibility of using a hashtable to speed > up "x IN (SELECT y FROM ...)" operations. Basically the idea is to run > the subselect once, loading its "y" outputs into an in-memory hashtable > (any duplicates can be discarded); then for each outer row, probe into > the hashtable to see if there's a match for the outer "x" value. > > This relatively simple idea gets a lot messier as soon as you start to > consider the fine points of SQL92 "IN" semantics, however. In > particular, if x is null or any particular y is null, the result of the > IN should be null ("unknown"), not false. What the spec really says is: > > 1. If "x = y" yields TRUE for any row of the sub-select, > the IN yields TRUE. > > 2. If "x = y" yields FALSE for all rows of the sub-select, > or the sub-select contains no rows, the IN yields FALSE. > > 3. Otherwise (which by elimination means "x = y" yielded UNKNOWN > for at least one row, and TRUE nowhere), the IN yields UNKNOWN. > > What we need to know about, therefore, is not so much whether x or y is > null as whether the "=" operator yields null. > > Also, IN supports multi-column comparisons --- eg, > (x1, x2) IN (SELECT y1, y2 FROM ...) > So the x or y value could be partially null as well as completely null. > > I think I see how to do this in what will usually be a fairly efficient > fashion, but it requires assuming a good deal about the behavior of the > "=" operator(s) involved. The actual implementation has to be: > > 1. Store all-not-null subquery rows in hashtable (combining equal rows > into a single entry). Store partially or completely null rows in a > separate linear list, combining non-distinct rows into a single entry. > > 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN. > > 2b. For LHS row partially-null: must compare against every element of > both hashtable and linear list, using same logic as for scanning > original subquery output (three-valued combination of OR of ANDs). > We know that we will not get a TRUE result, so we can stop as soon as > we find an UNKNOWN result for any row; this says it's worthwhile to scan > the linear list first, as that's most likely to give us an UNKNOWN. > > 2c. For LHS all-non-null: probe into hashtable for equality; result is > TRUE if match found. Otherwise, must scan list of partially-null rows > to see if any non-distinct rows are found; if so return UNKNOWN, else > return FALSE. > > This should be reasonably efficient because the slow cases only occur > when there are partially-null LHS or RHS rows, which should be > relatively uncommon. (In particular, when there is only one column to > compare, case 2b doesn't arise at all, and there can never be more than > one entry in the linear list, namely RHS = null.) Also, the number of > distinct partially null rows should usually be much less than the number > of distinct no-nulls rows, so the linear list should be much smaller > than the hashtable. (I tried to think of ways to hash this list, but it > doesn't seem to work...) > > Now, the assumptions this makes about the behavior of the "=" > operator(s) are: > > 1. The operators are hashable (ie, only values producing the same > hashkey could possibly be equal). Else we can't use a hashtable at all. > > 2. The operators are strict (must yield NULL out for NULL in). This is > what lets us avoid a table scan in case 2a. > > 3. The operators do not yield NULL for non-null inputs. Otherwise, > after case 2c fails to find a TRUE result, we'd have to scan the whole > hashtable, not only the partially-null list, to see if there are rows > that must cause us to return UNKNOWN. > > We already have markers in pg_operator and pg_proc to indicate > hashability and strictness, but there's no marker to indicate "reverse > strictness" as per assumption #3 (what's the proper term for that > property, anyway?) > > I am leaning towards documenting that the oprcanhash marker implies > "reverse strictness" as well. All the currently hashable operators meet > that requirement. Although you could imagine cases where you might like > "=" to return null for non-null inputs (for example, a rigorous > interpretation of IEEE float math would probably say that a NaN compared > to anything else yields UNKNOWN), it's not really a very practical thing > to do in SQL --- one counterexample is that such a datatype could not be > btree-indexable, since btree assumes a total ordering is possible. > > The other thing we could do is put a separate "reverse strict" marker > column into pg_proc, but this seems like much more work than is > justified. > > Comments? > > (Oh BTW: I'm aware that most of the problem goes away for IN appearing > at the top level of WHERE, since there we don't actually need to > distinguish FALSE from UNKNOWN results. But I'm trying to understand > how to optimize the general case, too.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
В списке pgsql-hackers по дате отправления: