Re: Optimization rules for semi and anti joins

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: Optimization rules for semi and anti joins
Дата
Msg-id 4993CA7E.EE98.0025.0@wicourts.gov
обсуждение исходный текст
Ответ на Re: Optimization rules for semi and anti joins  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Optimization rules for semi and anti joins
Список pgsql-hackers
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> (A semijoin B on (Pab)) antijoin C on (Pbc)
>> = A semijoin (B antijoin C on (Pbc)) on (Pab)
>  
>> I think this one is true, and it doesn't seem to be mentioned,
>> unless I'm missing something.  It seems potentially useful.
> 
> Hmm, it doesn't seem terribly well-defined --- the values of B are
> indeterminate above the semijoin in the first case, so having Pbc
> refer to them doesn't seem like a good idea.  In particular, it
> seems like in the first case the semijoin could randomly choose a B
> row that has a join partner in C, causing the A row to disappear
> from the result, when the same A row has another B partner that does
> not join to C --- and the second form would find that B partner and
> allow the A row to be output.
I was looking at it from the abstraction that A semijoin B could be
treated as the equivalent of an inner join with duplicate A rows from
the result removed before the final result of the enclosing query.  It
seems you've been interpreting it as meaning the inner join of A to
the first (arbitrarily chosen) row of B found.  It appears that these
two views of it generate the same results for the other identities,
but not this one.
The first case here could be implemented as an inner join which
included (in addition to any columns needed for other purposes) a row
identifier for A and all the columns of B which are needed for the Pbc
predicate.  The antijoin could be performed on that result, after
which duplicate A rows would be eliminated, as well as the row
identifier and the B columns.  A simplified (and only slightly
artificial) example of where this could buy orders of magnitude
improvement in run time follows.
Imagine that A is a Party table with ten million rows.  Imagine that B
and C are sets of rows within a hundred million row CaseHist table
which records events on cases, some of which are associated with
parties.  B represents warrant issuing events, each of which is
related to a party.  C represents warrant disposition events, each of
which is related to a warrant issuing event.  Both tables are indexed
on case number and a sequence number.  Party has a name index.
You've got a name, and you want a list of outstanding warrants for
parties with a matching name.
The second case above would be the natural way to write the query. 
The first case, implemented as I describe, would be orders of
magnitude faster.
-Kevin


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: mingw check hung
Следующее
От: "BogDan Vatra"
Дата:
Сообщение: Re: SE-PostgreSQL and row level security