Обсуждение: subselect
I was thinking about subselects, and how to attach the two queries. What if the subquery makes a range table entry in the outer query, and the query is set up like the UNION queries where we put the scans in a row, but in the case we put them over/under each other. And we push a temp table into the catalog cache that represents the result of the subquery, then we could join to it in the outer query as though it was a real table. Also, can't we do the correlated subqueries by adding the proper target/output columns to the subquery, and have the outer query reference those columns in the subquery range table entry. Maybe I can write up a sample of this? Vadim, would this help? Is this the point we are stuck at? -- Bruce Momjian maillist@candle.pha.pa.us
Bruce Momjian wrote: > > I was thinking about subselects, and how to attach the two queries. > > What if the subquery makes a range table entry in the outer query, and > the query is set up like the UNION queries where we put the scans in a > row, but in the case we put them over/under each other. > > And we push a temp table into the catalog cache that represents the > result of the subquery, then we could join to it in the outer query as > though it was a real table. > > Also, can't we do the correlated subqueries by adding the proper > target/output columns to the subquery, and have the outer query > reference those columns in the subquery range table entry. Yes, this is a way to handle subqueries by joining to temp table. After getting plan we could change temp table access path to node material. On the other hand, it could be useful to let optimizer know about cost of temp table creation (have to think more about it)... Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN is one example of this - joining by <> will give us invalid results. Setting special NOT EQUAL flag is not enough: subquery plan must be always inner one in this case. The same for handling ALL modifier. Note, that we generaly can't use aggregates here: we can't add MAX to subquery in the case of > ALL (subquery), because of > ALL should return FALSE if subquery returns NULL(s) but aggregates don't take NULLs into account. > > Maybe I can write up a sample of this? Vadim, would this help? Is this > the point we are stuck at? Personally, I was stuck by holydays -:) Now I can spend ~ 8 hours ~ each day for development... Vadim
Vadim, Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN is one example of this - joining by <> will give us invalid results. What is you approach towards this problem? I got an idea that one could reverse the order, that is execute the outer first into a temptable and delete from that according to the result of the subquery and then return it. Probably this is too raw and slow. ;-) Personally, I was stuck by holydays -:) Now I can spend ~ 8 hours ~ each day for development... Oh, isn't it christmas eve right now in Russia? best regards, -- --------------------------------------------- Göran Thyni, sysadm, JMS Bildbasen, Kiruna
> Yes, this is a way to handle subqueries by joining to temp table. > After getting plan we could change temp table access path to > node material. On the other hand, it could be useful to let optimizer > know about cost of temp table creation (have to think more about it)... > Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN > is one example of this - joining by <> will give us invalid results. > Setting special NOT EQUAL flag is not enough: subquery plan must be > always inner one in this case. The same for handling ALL modifier. > Note, that we generaly can't use aggregates here: we can't add MAX to > subquery in the case of > ALL (subquery), because of > ALL should return FALSE > if subquery returns NULL(s) but aggregates don't take NULLs into account. OK, here are my ideas. First, I think you have to handle subselects in the outer node because a subquery could have its own subquery. Also, we now have a field in Aggreg to all us to 'usenulls'. OK, here it is. I recommend we pass the outer and subquery through the parser and optimizer separately. We parse the subquery first. If the subquery is not correlated, it should parse fine. If it is correlated, any columns we find in the subquery that are not already in the FROM list, we add the table to the subquery FROM list, and add the referenced column to the target list of the subquery. When we are finished parsing the subquery, we create a catalog cache entry for it called 'sub1' and make its fields match the target list of the subquery. In the outer query, we add 'sub1' to its target list, and change the subquery reference to point to the new range table. We also add WHERE clauses to do any correlated joins. Here is a simple example: select * from taba where col1 = (select col2 from tabb) This is not correlated, and the subquery parser easily. We create a 'sub1' catalog cache entry, and add 'sub1' to the outer query FROM clause. We also replace 'col1 = (subquery)' with 'col1 = sub1.col2'. Here is a more complex correlated subquery: select * from taba where col1 = (select col2 from tabb where taba.col3 = tabb.col4) Here we must add 'taba' to the subquery's FROM list, and add col3 to the target list of the subquery. After we parse the subquery, add 'sub1' to the FROM list of the outer query, change 'col1 = (subquery)' to 'col1 = sub1.col2', and add to the outer WHERE clause 'AND taba.col3 = sub1.col3'. THe optimizer will do the correlation for us. In the optimizer, we can parse the subquery first, then the outer query, and then replace all 'sub1' references in the outer query to use the subquery plan. I realize making merging the two plans and doing IN and NOT IN is the real challenge, but I hoped this would give us a start. What do you think? -- Bruce Momjian maillist@candle.pha.pa.us
Goran Thyni wrote: > > Vadim, > > Unfortunately, not all subqueries can be handled by "normal" joins: NOT IN > is one example of this - joining by <> will give us invalid results. > > What is you approach towards this problem? Actually, this is problem of ALL modifier (NOT IN is _not_equal_ ALL) and so, we have to have not just NOT EQUAL flag but some ALL node with modified operator. After that, one way is put subquery into inner plan of an join node to be sure that for an outer tuple all corresponding subquery tuples will be tested with modified operator (this will require either changing code of all join nodes or addition of new plan type - we'll see) and another way is ... suggested by you: > I got an idea that one could reverse the order, > that is execute the outer first into a temptable > and delete from that according to the result of the > subquery and then return it. > Probably this is too raw and slow. ;-) This will be faster in some cases (when subquery returns many results and there are "not so many" results from outer query) - thanks for idea! > > Personally, I was stuck by holydays -:) > Now I can spend ~ 8 hours ~ each day for development... > > Oh, isn't it christmas eve right now in Russia? Due to historic reasons New Year is mu-u-u-uch popular holiday in Russia -:) Vadim
Bruce Momjian wrote: > > > always inner one in this case. The same for handling ALL modifier. > > Note, that we generaly can't use aggregates here: we can't add MAX to > > subquery in the case of > ALL (subquery), because of > ALL should return FALSE > > if subquery returns NULL(s) but aggregates don't take NULLs into account. > > OK, here are my ideas. First, I think you have to handle subselects in > the outer node because a subquery could have its own subquery. Also, we I hope that this is no matter: if results of subquery (with/without sub-subqueries) will go into temp table then this table will be re-scanned for each outer tuple. > now have a field in Aggreg to all us to 'usenulls'. ^^^^^^^^ This can't help: vac=> select * from x; y - 1 2 3 <<< this is NULL (4 rows) vac=> select max(y) from x; max --- 3 ==> we can't replace select * from A where A.a > ALL (select y from x); ^^^^^^^^^^^^^^^ (NULL will be returned and so A.a > ALL is FALSE - this is what Sybase does, is it right ?) with select * from A where A.a > (select max(y) from x); ^^^^^^^^^^^^^^^^^^^^ just because of we lose knowledge about NULLs here. Also, I would like to handle ANY and ALL modifiers for all bool operators, either built-in or user-defined, for all data types - isn't PostgreSQL OO-like RDBMS -:) > OK, here it is. I recommend we pass the outer and subquery through > the parser and optimizer separately. I don't like this. I would like to get parse-tree from parser for entire query and let optimizer (on upper level) decide how to rewrite parse-tree and what plans to produce and how these plans should be merged. Note, that I don't object your methods below, but only where to place handling of this. I don't understand why should we add new part to the system which will do optimizer' work (parse-tree --> execution plan) and deal with optimizer nodes. Imho, upper optimizer level is nice place to do this. > > We parse the subquery first. If the subquery is not correlated, it > should parse fine. If it is correlated, any columns we find in the > subquery that are not already in the FROM list, we add the table to the > subquery FROM list, and add the referenced column to the target list of > the subquery. > > When we are finished parsing the subquery, we create a catalog cache > entry for it called 'sub1' and make its fields match the target > list of the subquery. > > In the outer query, we add 'sub1' to its target list, and change > the subquery reference to point to the new range table. We also add > WHERE clauses to do any correlated joins. ... > Here is a more complex correlated subquery: > > select * > from taba > where col1 = (select col2 > from tabb > where taba.col3 = tabb.col4) > > Here we must add 'taba' to the subquery's FROM list, and add col3 to the > target list of the subquery. After we parse the subquery, add 'sub1' to > the FROM list of the outer query, change 'col1 = (subquery)' to 'col1 = > sub1.col2', and add to the outer WHERE clause 'AND taba.col3 = sub1.col3'. > THe optimizer will do the correlation for us. > > In the optimizer, we can parse the subquery first, then the outer query, > and then replace all 'sub1' references in the outer query to use the > subquery plan. > > I realize making merging the two plans and doing IN and NOT IN is the ^^^^^^^^^^^^^^^^^^^^^ This is very easy to do! As I already said we have just change sub1 access path (SeqScan of sub1) with SeqScan of Material node with subquery plan. > real challenge, but I hoped this would give us a start. Decision about how to record subquery stuff in to parse-tree would be very good start -:) BTW, note that for _expression_ subqueries (which are introduced without IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples - we have to check that subquery returns single tuple... Vadim
> BTW, note that for _expression_ subqueries (which are introduced without > IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples - > we have to check that subquery returns single tuple... It might be nice to have a tuple-counting operation or query node (is this the right terminology?) which could be used to help implement EXISTS. It might help to re-implement the count(*) function also. - Tom
> > > BTW, note that for _expression_ subqueries (which are introduced without > > IN, EXISTS, ALL, ANY - this follows Sybase' naming) - as in your examples - > > we have to check that subquery returns single tuple... > > It might be nice to have a tuple-counting operation or query node (is this the right > terminology?) which could be used to help implement EXISTS. It might help to > re-implement the count(*) function also. In the new code, count(*) picks a column from one of the tables to count on. -- Bruce Momjian maillist@candle.pha.pa.us