Обсуждение: subselect

Поиск
Список
Период
Сортировка

subselect

От
Bruce Momjian
Дата:
I was thinking about subselects, and how to attach the two queries.

What if the subquery makes a range table entry in the outer query, and
the query is set up like the UNION queries where we put the scans in a
row, but in the case we put them over/under each other.

And we push a temp table into the catalog cache that represents the
result of the subquery, then we could join to it in the outer query as
though it was a real table.

Also, can't we do the correlated subqueries by adding the proper
target/output columns to the subquery, and have the outer query
reference those columns in the subquery range table entry.

Maybe I can write up a sample of this?  Vadim, would this help?  Is this
the point we are stuck at?

--
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] subselect

От
"Vadim B. Mikheev"
Дата:
Bruce Momjian wrote:
>
> I was thinking about subselects, and how to attach the two queries.
>
> What if the subquery makes a range table entry in the outer query, and
> the query is set up like the UNION queries where we put the scans in a
> row, but in the case we put them over/under each other.
>
> And we push a temp table into the catalog cache that represents the
> result of the subquery, then we could join to it in the outer query as
> though it was a real table.
>
> Also, can't we do the correlated subqueries by adding the proper
> target/output columns to the subquery, and have the outer query
> reference those columns in the subquery range table entry.

Yes, this is a way to handle subqueries by joining to temp table.
After getting plan we could change temp table access path to
node material. On the other hand, it could be useful to let optimizer
know about cost of temp table creation (have to think more about it)...
Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
is one example of this - joining by <> will give us invalid results.
Setting special NOT EQUAL flag is not enough: subquery plan must be
always inner one in this case. The same for handling ALL modifier.
Note, that we generaly can't use aggregates here: we can't add MAX to
subquery in the case of > ALL (subquery), because of > ALL should return FALSE
if subquery returns NULL(s) but aggregates don't take NULLs into account.

>
> Maybe I can write up a sample of this?  Vadim, would this help?  Is this
> the point we are stuck at?

Personally, I was stuck by holydays -:)
Now I can spend ~ 8 hours ~ each day for development...

Vadim

Re: [HACKERS] subselect

От
Goran Thyni
Дата:
Vadim,

   Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
   is one example of this - joining by <> will give us invalid results.

What is you approach towards this problem?
I got an idea that one could reverse the order,
that is execute the outer first into a temptable
and delete from that according to the result of the
subquery and then return it.
Probably this is too raw and slow. ;-)

   Personally, I was stuck by holydays -:)
   Now I can spend ~ 8 hours ~ each day for development...

Oh, isn't it christmas eve right now in Russia?

    best regards,
--
---------------------------------------------
Göran Thyni, sysadm, JMS Bildbasen, Kiruna


Re: [HACKERS] subselect

От
Bruce Momjian
Дата:
> Yes, this is a way to handle subqueries by joining to temp table.
> After getting plan we could change temp table access path to
> node material. On the other hand, it could be useful to let optimizer
> know about cost of temp table creation (have to think more about it)...
> Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
> is one example of this - joining by <> will give us invalid results.
> Setting special NOT EQUAL flag is not enough: subquery plan must be
> always inner one in this case. The same for handling ALL modifier.
> Note, that we generaly can't use aggregates here: we can't add MAX to
> subquery in the case of > ALL (subquery), because of > ALL should return FALSE
> if subquery returns NULL(s) but aggregates don't take NULLs into account.

OK, here are my ideas.  First, I think you have to handle subselects in
the outer node because a subquery could have its own subquery.  Also, we
now have a field in Aggreg to all us to 'usenulls'.

OK, here it is.  I recommend we pass the outer and subquery through
the parser and optimizer separately.

We parse the subquery first.  If the subquery is not correlated, it
should parse fine.  If it is correlated, any columns we find in the
subquery that are not already in the FROM list, we add the table to the
subquery FROM list, and add the referenced column to the target list of
the subquery.

When we are finished parsing the subquery, we create a catalog cache
entry for it called 'sub1' and make its fields match the target
list of the subquery.

In the outer query, we add 'sub1' to its target list, and change
the subquery reference to point to the new range table.  We also add
WHERE clauses to do any correlated joins.

Here is a simple example:

    select *
    from taba
    where col1 = (select col2
              from tabb)

This is not correlated, and the subquery parser easily.  We create a
'sub1' catalog cache entry, and add 'sub1' to the outer query FROM
clause.  We also replace 'col1 = (subquery)' with 'col1 = sub1.col2'.

Here is a more complex correlated subquery:

    select *
    from taba
    where col1 = (select col2
              from tabb
              where taba.col3 = tabb.col4)

Here we must add 'taba' to the subquery's FROM list, and add col3 to the
target list of the subquery.  After we parse the subquery, add 'sub1' to
the FROM list of the outer query, change 'col1 = (subquery)' to 'col1 =
sub1.col2', and add to the outer WHERE clause 'AND taba.col3 = sub1.col3'.
THe optimizer will do the correlation for us.

In the optimizer, we can parse the subquery first, then the outer query,
and then replace all 'sub1' references in the outer query to use the
subquery plan.

I realize making merging the two plans and doing IN and NOT IN is the
real challenge, but I hoped this would give us a start.

What do you think?

--
Bruce Momjian
maillist@candle.pha.pa.us

Re: [HACKERS] subselect

От
"Vadim B. Mikheev"
Дата:
Goran Thyni wrote:
>
> Vadim,
>
>    Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN
>    is one example of this - joining by <> will give us invalid results.
>
> What is you approach towards this problem?

Actually, this is problem of ALL modifier (NOT IN is _not_equal_ ALL)
and so, we have to have not just NOT EQUAL flag but some ALL node
with modified operator.

After that, one way is put subquery into inner plan of an join node
to be sure that for an outer tuple all corresponding subquery tuples
will be tested with modified operator (this will require either
changing code of all join nodes or addition of new plan type - we'll see)
and another way is ... suggested by you:

> I got an idea that one could reverse the order,
> that is execute the outer first into a temptable
> and delete from that according to the result of the
> subquery and then return it.
> Probably this is too raw and slow. ;-)

This will be faster in some cases (when subquery returns many results
and there are "not so many" results from outer query) - thanks for idea!

>
>    Personally, I was stuck by holydays -:)
>    Now I can spend ~ 8 hours ~ each day for development...
>
> Oh, isn't it christmas eve right now in Russia?

Due to historic reasons New Year is mu-u-u-uch popular
holiday in Russia -:)

Vadim

Re: [HACKERS] subselect

От
"Vadim B. Mikheev"
Дата:
Bruce Momjian wrote:
>
> > always inner one in this case. The same for handling ALL modifier.
> > Note, that we generaly can't use aggregates here: we can't add MAX to
> > subquery in the case of > ALL (subquery), because of > ALL should return FALSE
> > if subquery returns NULL(s) but aggregates don't take NULLs into account.
>
> OK, here are my ideas.  First, I think you have to handle subselects in
> the outer node because a subquery could have its own subquery.  Also, we

I hope that this is no matter: if results of subquery (with/without sub-subqueries)
will go into temp table then this table will be re-scanned for each outer tuple.

> now have a field in Aggreg to all us to 'usenulls'.
                                           ^^^^^^^^
 This can't help:

vac=> select * from x;
y
-
1
2
3
 <<< this is NULL
(4 rows)

vac=> select max(y) from x;
max
---
  3

==> we can't replace

select * from A where A.a > ALL (select y from x);
                                 ^^^^^^^^^^^^^^^
           (NULL will be returned and so A.a > ALL is FALSE - this is what
            Sybase does, is it right ?)
with

select * from A where A.a > (select max(y) from x);
                             ^^^^^^^^^^^^^^^^^^^^
just because of we lose knowledge about NULLs here.

Also, I would like to handle ANY and ALL modifiers for all bool
operators, either built-in or user-defined, for all data types -
isn't PostgreSQL OO-like RDBMS -:)

> OK, here it is.  I recommend we pass the outer and subquery through
> the parser and optimizer separately.

I don't like this. I would like to get parse-tree from parser for
entire query and let optimizer (on upper level) decide how to rewrite
parse-tree and what plans to produce and how these plans should be
merged. Note, that I don't object your methods below, but only where
to place handling of this. I don't understand why should we add
new part to the system which will do optimizer' work (parse-tree -->
execution plan) and deal with optimizer nodes. Imho, upper optimizer
level is nice place to do this.

>
> We parse the subquery first.  If the subquery is not correlated, it
> should parse fine.  If it is correlated, any columns we find in the
> subquery that are not already in the FROM list, we add the table to the
> subquery FROM list, and add the referenced column to the target list of
> the subquery.
>
> When we are finished parsing the subquery, we create a catalog cache
> entry for it called 'sub1' and make its fields match the target
> list of the subquery.
>
> In the outer query, we add 'sub1' to its target list, and change
> the subquery reference to point to the new range table.  We also add
> WHERE clauses to do any correlated joins.
...
> Here is a more complex correlated subquery:
>
>         select *
>         from taba
>         where col1 = (select col2
>                       from tabb
>                       where taba.col3 = tabb.col4)
>
> Here we must add 'taba' to the subquery's FROM list, and add col3 to the
> target list of the subquery.  After we parse the subquery, add 'sub1' to
> the FROM list of the outer query, change 'col1 = (subquery)' to 'col1 =
> sub1.col2', and add to the outer WHERE clause 'AND taba.col3 = sub1.col3'.
> THe optimizer will do the correlation for us.
>
> In the optimizer, we can parse the subquery first, then the outer query,
> and then replace all 'sub1' references in the outer query to use the
> subquery plan.
>
> I realize making merging the two plans and doing IN and NOT IN is the
                   ^^^^^^^^^^^^^^^^^^^^^
This is very easy to do! As I already said we have just change sub1
access path (SeqScan of sub1) with SeqScan of Material node with
subquery plan.

> real challenge, but I hoped this would give us a start.

Decision about how to record subquery stuff in to parse-tree
would be very good start -:)

BTW, note that for _expression_ subqueries (which are introduced without
IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
we have to check that subquery returns single tuple...

Vadim

Re: [HACKERS] subselect

От
"Thomas G. Lockhart"
Дата:
> BTW, note that for _expression_ subqueries (which are introduced without
> IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
> we have to check that subquery returns single tuple...

It might be nice to have a tuple-counting operation or query node (is this the right
terminology?) which could be used to help implement EXISTS. It might help to
re-implement the count(*) function also.

                                                    - Tom



Re: [HACKERS] subselect

От
Bruce Momjian
Дата:
>
> > BTW, note that for _expression_ subqueries (which are introduced without
> > IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples -
> > we have to check that subquery returns single tuple...
>
> It might be nice to have a tuple-counting operation or query node (is this the right
> terminology?) which could be used to help implement EXISTS. It might help to
> re-implement the count(*) function also.

In the new code, count(*) picks a column from one of the tables to count
on.


--
Bruce Momjian
maillist@candle.pha.pa.us