Обсуждение: Tracking notnull attributes inside Var

Поиск
Список
Период
Сортировка

Tracking notnull attributes inside Var

От
Andy Fan
Дата:
notnulls discussion is forked from UniqueKey stuff, you can see the attachment
for the UnqiueKey introduction. Tom raised his opinion to track the nullability
inside Var[1][2][3], this thread would start from there based on my understanding.

Generally tracking the null attributes inside Var would have something like:

struct Var
{
...;
      int nullable;  // -1 unknown,  0 - not nullable.  1 - nullable
};

and then semantics of Var->nullable must be attached to a RelOptInfo. For
example:

CREATE TABLE t1(a int, b int);

SELECT abs(a) FROM t1 WHERE a > -100;

The var in RelOptInfo->reltarget should have nullable = 0 but the var in
RelOptInfo->baserestrictinfo should have nullable = 1;  The beauty of this
are: a). It can distinguish the two situations perfectly b). Whenever we want
to know the nullable attribute of a Var for an expression, it is super easy to
know. In summary, we need to maintain the nullable attribute at 2 different
places. one is the before the filters are executed(baserestrictinfo, joininfo,
ec_list at least).  one is after the filters are executed (RelOptInfo.reltarget
only?)

Come to JoinRel, we still need to maintain the 2 different cases as well.

As for the joinrel.reltarget, currently it looks up the inputrel's reltarget to
get the Var, so it is easy to inherit from Var->nullable from inputrel, but
we need to consider the new  changes introduced by current join,
Like new NOT nullable attributes because of join clauses OR new nullable
attributes because of outer join.  Everything looks good for now.

The hard part is RelOptInfo.joininfo & root->eq_classes. All of them uses
the shared RestrictInfo, and it is unclear which Var->nullable should be used in
them. To not provide a wrong answer, I think we can assume nullable=-1 (unknown)
and let the upper layer decides what to do (do we have known use cases to use
the nullable attribute here?).

More considerations about this strategy:
1. We might use more memory for different var copies, the only known cases
   RelOptInfo->reltarget for now.
2. _equalVar() has more complex semantics: shall we consider nulls or not.

My recent experience reminds me of another interesting use case of UniqueKey
which may reduce the planning time a lot IIUC (Value 3 in then attachment). Since
PG15 has just been released, I wonder if more people have time to discuss this topic
again. Do I think the way in the right direction?

[1] https://www.postgresql.org/message-id/1551312.1613142245%40sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/CAApHDvrRwhWCPKUD5H-EQoezHf%3DfnUUsPgTAnXsEOV8f8SF7XQ%40mail.gmail.com
[3] https://www.postgresql.org/message-id/1664320.1625577290%40sss.pgh.pa.us

--
Best Regards
Andy Fan
Вложения

Re: Tracking notnull attributes inside Var

От
Ashutosh Bapat
Дата:
On Sun, May 15, 2022 at 8:41 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

>
> The var in RelOptInfo->reltarget should have nullable = 0 but the var in
> RelOptInfo->baserestrictinfo should have nullable = 1;  The beauty of this
> are: a). It can distinguish the two situations perfectly b). Whenever we want
> to know the nullable attribute of a Var for an expression, it is super easy to
> know. In summary, we need to maintain the nullable attribute at 2 different
> places. one is the before the filters are executed(baserestrictinfo, joininfo,
> ec_list at least).  one is after the filters are executed (RelOptInfo.reltarget
> only?)

Thanks for identifying this. What you have written makes sense and it
might open a few optimization opportunities. But let me put down some
other thoughts here. You might want to take those into consideration
when designing your solution.

Do we want to just track nullable and non-nullable. May be we want
expand this class to nullable (var may be null), non-nullable (Var is
definitely non-NULL), null (Var will be always NULL).

But the other way to look at this is along the lines of equivalence
classes. Equivalence classes record the expressions which are equal in
the final result of the query. The equivalence class members are not
equal at all the stages of query execution.  But because they are
equal in the final result, we can impose that restriction on the lower
levels as well. Can we think of nullable in that fashion? If a Var is
non-nullable in the final result, we can impose that restriction on
the intermediate stages since rows with NULL values for that Var will
be filtered out somewhere. Similarly we could argue for null Var. But
knowledge that a Var is nullable in the final result does not impose a
NULL, non-NULL restriction on the intermediate stages. If we follow
this thought process, we don't need to differentiate Var at different
stages in query.

>
> Come to JoinRel, we still need to maintain the 2 different cases as well.
>
> As for the joinrel.reltarget, currently it looks up the inputrel's reltarget to
> get the Var, so it is easy to inherit from Var->nullable from inputrel, but
> we need to consider the new  changes introduced by current join,
> Like new NOT nullable attributes because of join clauses OR new nullable
> attributes because of outer join.  Everything looks good for now.

Yes, if we want to maintain nullness at different stages in the query.

>
> The hard part is RelOptInfo.joininfo & root->eq_classes. All of them uses
> the shared RestrictInfo, and it is unclear which Var->nullable should be used in
> them. To not provide a wrong answer, I think we can assume nullable=-1 (unknown)
> and let the upper layer decides what to do (do we have known use cases to use
> the nullable attribute here?).

I think what applies to baserestrictinfo and reltarget also applies to
joininfo and join's reltarget. There will be three stages - join
clauses, join quals and reltarget.

In EQs the Vars in RestrictInfo will come from joininfo but EQ member
Vars will derive their nullable-ness from corresponding reltarget. I
can be wrong though.

>
> More considerations about this strategy:
> 1. We might use more memory for different var copies, the only known cases
>    RelOptInfo->reltarget for now.

When a Var is copied the whole expression tree needs to be copied.
That might be more memory than just copies of Var nodes.

> 2. _equalVar() has more complex semantics: shall we consider nulls or not.

This is interesting. It might have impact on set_plan_references and
planner's ability to search and match expressions.

But if we take the approach I have suggested earlier, this question
will not arise.

-- 
Best Wishes,
Ashutosh Bapat



Re: Tracking notnull attributes inside Var

От
Tom Lane
Дата:
Andy Fan <zhihui.fan1213@gmail.com> writes:
> notnulls discussion is forked from UniqueKey stuff, you can see the
> attachment
> for the UnqiueKey introduction. Tom raised his opinion to track the
> nullability
> inside Var[1][2][3], this thread would start from there based on my
> understanding.

I'm pretty certain that I never suggested this:

> struct Var
> {
> ...;
>       int nullable;  // -1 unknown,  0 - not nullable.  1 - nullable
> };

You're free to pursue it if you like, but I think it will be a dead end.
The fundamental problem as you note is that equalVar() cannot do anything
sane with a field defined that way.  Also, we'd have to generate Vars
initially with nullable = unknown (else, for example, ALTER SET/DROP NOT
NULL breaks stored views referring to the column).  It'd be on the planner
to run through the tree and replace that with "nullable" or "not
nullable".  It's hard to see how that's more advantageous than just
keeping the info in the associated RelOptInfo.

Also, I think you're confusing two related but distinct issues.  For
certain optimization issues, we'd like to keep track of whether a column
stored in a table is known NOT NULL.  However, that's not the same thing
as the question that I've muttered about, which is how to treat a Var
that's been possibly forced to null due to null-extension of an outer
join.  That is a different value from the Var as read from the table,
but we currently represent it the same within the planner, which causes
various sorts of undesirable complication.  We cannot fix that by setting
Var.nullable = true in above-the-join instances, because it might also
be true in below-the-join instances.  "Known not null in the table" is
not the inverse of "potentially nulled by an outer join".  Moreover, we
probably need to know *which* join is the one potentially nulling the Var,
so a bool is not likely enough anyway.

The schemes I've been toying with tend to look more like putting a
PlaceHolderVar-ish wrapper around the Var or expression that represents
the below-the-join value.  The wrapper node could carry any join ID
info that we find necessary.  The thing that I'm kind of stalled on is
how to define this struct so that it's not a big headache for join
strength reduction (which could remove the need for a wrapper altogether)
or outer-join reordering (which makes it a bit harder to define which
join we think is the one nulling the value).

            regards, tom lane



Re: Tracking notnull attributes inside Var

От
Andy Fan
Дата:
Hi Tom: 

Thanks for your attention!

On Wed, May 18, 2022 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andy Fan <zhihui.fan1213@gmail.com> writes:
> notnulls discussion is forked from UniqueKey stuff, you can see the
> attachment
> for the UnqiueKey introduction. Tom raised his opinion to track the
> nullability
> inside Var[1][2][3], this thread would start from there based on my
> understanding.

I'm pretty certain that I never suggested this:

> struct Var
> {
> ...;
>       int nullable;  // -1 unknown,  0 - not nullable.  1 - nullable
> };

You're free to pursue it if you like, but I think it will be a dead end.

OK, Here is a huge misunderstanding.  I have my own solution at the
beginning and then I think you want to go with this direction and I think
it is really hard to understand,  so I started this thread to make things
clear.  It is so great that the gap is filled now. 

The fundamental problem as you note is that equalVar() cannot do anything
sane with a field defined that way.  Also, we'd have to generate Vars
initially with nullable = unknown (else, for example, ALTER SET/DROP NOT
NULL breaks stored views referring to the column).  It'd be on the planner
to run through the tree and replace that with "nullable" or "not
nullable".  It's hard to see how that's more advantageous than just
keeping the info in the associated RelOptInfo.

Agreed. 
 

Also, I think you're confusing two related but distinct issues.  For
certain optimization issues, we'd like to keep track of whether a column
stored in a table is known NOT NULL.  However, that's not the same thing
as the question that I've muttered about, which is how to treat a Var
that's been possibly forced to null due to null-extension of an outer
join.  That is a different value from the Var as read from the table,
but we currently represent it the same within the planner, which causes
various sorts of undesirable complication. We cannot fix that by setting
Var.nullable = true in above-the-join instances, because it might also
be true in below-the-join instances.  "Known not null in the table" is
not the inverse of "potentially nulled by an outer join".  Moreover, we
probably need to know *which* join is the one potentially nulling the Var,
so a bool is not likely enough anyway.

I read the above graph several times, but *I think probably my code can
express better than my words*.  It would be great that you can have a
look at them.  Just one point to mention now:  Seems you didn't mention the
case where the NULL values are filtered by qual,  not sure it is negligible
or by mistake.  

CREATE TABLE t(a int);
SELECT * FROM t WHERE a > 1; 

My patch is my previous solution not the Inside Var one. 


The schemes I've been toying with tend to look more like putting a
PlaceHolderVar-ish wrapper around the Var or expression that represents
the below-the-join value.  The wrapper node could carry any join ID
info that we find necessary.  The thing that I'm kind of stalled on is
how to define this struct so that it's not a big headache for join
strength reduction (which could remove the need for a wrapper altogether)
or outer-join reordering (which makes it a bit harder to define which
join we think is the one nulling the value).


Not sure if the "NULL values are filtered by qual '' matters in this solution,
and I'm pretty open for direction. But to avoid further misunderstanding 
from me,  I would like to fill more gaps first by raising my patch now 
and continue talking in this direction. 

--
Best Regards
Andy Fan
Вложения

Re: Tracking notnull attributes inside Var

От
Andy Fan
Дата:
Hi Ashutosh:

   Nice to see you again!

On Tue, May 17, 2022 at 8:50 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Sun, May 15, 2022 at 8:41 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

>
> The var in RelOptInfo->reltarget should have nullable = 0 but the var in
> RelOptInfo->baserestrictinfo should have nullable = 1;  The beauty of this
> are: a). It can distinguish the two situations perfectly b). Whenever we want
> to know the nullable attribute of a Var for an expression, it is super easy to
> know. In summary, we need to maintain the nullable attribute at 2 different
> places. one is the before the filters are executed(baserestrictinfo, joininfo,
> ec_list at least).  one is after the filters are executed (RelOptInfo.reltarget
> only?)

Thanks for identifying this. What you have written makes sense and it
might open a few optimization opportunities. But let me put down some
other thoughts here. You might want to take those into consideration
when designing your solution.

Thanks. 
 

Do we want to just track nullable and non-nullable. May be we want
expand this class to nullable (var may be null), non-nullable (Var is
definitely non-NULL), null (Var will be always NULL).


Currently it doesn't support "Var will be always NULL" .  Do you have any
use cases for this? and I can't think of too many cases where we can get
such information except something like "SELECT a FROM t WHERE a 
IS NULL". 

But the other way to look at this is along the lines of equivalence
classes. Equivalence classes record the expressions which are equal in
the final result of the query. The equivalence class members are not
equal at all the stages of query execution.  But because they are
equal in the final result, we can impose that restriction on the lower
levels as well. Can we think of nullable in that fashion? If a Var is
non-nullable in the final result, we can impose that restriction on
the intermediate stages since rows with NULL values for that Var will
be filtered out somewhere. Similarly we could argue for null Var. But
knowledge that a Var is nullable in the final result does not impose a
NULL, non-NULL restriction on the intermediate stages. If we follow
this thought process, we don't need to differentiate Var at different
stages in query.

I agree this is an option.  If so we need to track it under the PlannerInfo 
struct but it would not be as fine-grained as my previous. Without
intermediate information,  We can't know if a UnqiueKey contains multiple
NULLs, this would not be an issue for the "MARK Distinct as no-op" case,
but I'm not sure it is OK for other UniqueKey user cases.  So my current idea
is I still prefer to maintain the intermediate information, unless we are sure it 
costs too much or it is too complex to implement which I don't think so for now
at least.  So if you have time to look at the attached patch, that would be super
great as well.

--
Best Regards
Andy Fan

Re: Tracking notnull attributes inside Var

От
Andy Fan
Дата:

I thought about the strategy below in the past few days,  and think it
is better because it uses less cycles to get the same answer.  IIUC, the
related structs should be created during / after deconstruct_jointree rather
than join_search_xx stage.
 
The schemes I've been toying with tend to look more like putting a
PlaceHolderVar-ish wrapper around the Var or expression that represents
the below-the-join value.  The wrapper node could carry any join ID
info that we find necessary. 

Just to confirm my understanding,  the wrapper node should answer some
questions like this. 

/*
 * rel_is_nullable_side
 *
 * For the given join ID joinrelids, return if the relid is in the nullable
 * side.
 */
static bool
rel_is_nullable_side(PlannerInfo *root, Relids joinrelids, Index relid)
{
Assert(bms_is_member(relid, joinrelids));
...
}
 
The thing that I'm kind of stalled on is
how to define this struct so that it's not a big headache for join
strength reduction (which could remove the need for a wrapper altogether) 
or outer-join reordering (which makes it a bit harder to define which
join we think is the one nulling the value).

I think about the outer-join reorder case,  can we just rely on
SpecialJoinInfo.min_lefthands & min_righthands to get the answer?
The attached patch is based on that.  and I did some test in the patch
as well,  looks the answer is correct. 

What's more, if the above is correct and the calls of rel_is_nullable_side
is small,  do we still need think about more effective data struct? 

Thanks! 

--
Best Regards
Andy Fan
Вложения