Обсуждение: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

Поиск
Список
Период
Сортировка

Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
I believe I've narrowed down the cause of this failure [1]:

regression=# explain SELECT min(col) FROM enumtest;
ERROR:  type matched to anyenum is not an enum type: anyenum

The core reason for this seems to be that apply_partialaggref_adjustment
fails to consider the possibility that it's dealing with a polymorphic
transtype.  It will happily label the partial aggregate with a polymorphic
aggoutputtype, which is broken in itself (since we won't know the
properties of the output type for purposes such as datum-copying).  But
the above error comes out before we have any opportunity to crash from
that, because we then end up with the upper-level combining aggregate
having an input that is shown as being of polymorphic type, which means
it can't resolve what its transtype is supposed to be.

The business about having a (newly added) separate aggserialtype makes
this doubly dangerous, because if that applies we would be showing the
input to the combining aggregate as being of the serial type, which might
have *nothing* to do with the original input type, thereby breaking
polymorphic transtype resolution even further.  But whether we're using
that or just the transtype, showing the input to the combining aggregate
as being of some type other than the original input is likely to break any
agg implementation function that uses get_call_expr_argtype or similar to
figure out what type it's supposed to be working on.

We might need to fix this by adding the resolved transtype to Aggref
nodes, rather than trying to re-resolve it at executor start.  Some
tracing I did on the way to identifying this bug suggests that that
would save a fair amount of planner and executor-start overhead,
because we are calling get_aggregate_argtypes and thereby
enforce_generic_type_consistency half a dozen times per aggregate,
not to mention resolve_aggregate_transtype itself.

However, before trying to fix any of that, I'd like to demand an
explanation as to why aggoutputtype exists at all.  It seems incredibly
confusing to have both that and aggtype, and certainly the code comments
give no hint of what the semantics are supposed to be when those fields
are different.  I think aggoutputtype needs to go away again.
        regards, tom lane

[1] https://www.postgresql.org/message-id/22782.1466100870@sss.pgh.pa.us



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Robert Haas
Дата:
On Thu, Jun 16, 2016 at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, before trying to fix any of that, I'd like to demand an
> explanation as to why aggoutputtype exists at all.  It seems incredibly
> confusing to have both that and aggtype, and certainly the code comments
> give no hint of what the semantics are supposed to be when those fields
> are different.  I think aggoutputtype needs to go away again.

fix_combine_agg_expr, or more search_indexed_tlist_for_partial_aggref,
needs to be able to match a finalize-aggregate node type to a
partial-aggregate node type beneath it.  aggtype is one of the fields
that gets compared as part of that process, and it won't be the same
if you make aggtype mean the result of that particular node rather
than the result of the aggregate.  nodeAgg.c also uses aggtype for
matching purposes; not sure if there is a similar problem there or
not.

On the other hand, exprType() needs to return the actual output type
of that particular node, which is given by aggoutputtype.  I think
that requirement is non-negotiable, so I suspect the only way to get
by without two fields is if all the stuff for which aggtype is being
used is inessential; in that case, we could remove it and then rename
aggoutputtype back to aggtype.

Thanks for looking at this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 16, 2016 at 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, before trying to fix any of that, I'd like to demand an
>> explanation as to why aggoutputtype exists at all.  It seems incredibly
>> confusing to have both that and aggtype, and certainly the code comments
>> give no hint of what the semantics are supposed to be when those fields
>> are different.  I think aggoutputtype needs to go away again.

> fix_combine_agg_expr, or more search_indexed_tlist_for_partial_aggref,
> needs to be able to match a finalize-aggregate node type to a
> partial-aggregate node type beneath it.

Meh.  I think this is probably telling us that trying to repurpose Aggref
as the representation of a partial aggregate may have been mistaken.
Or maybe just that fix_combine_agg_expr was a bad idea.  It seems likely
to me that that could have been dispensed with altogether in return for
slightly more work in create_grouping_paths, for instance transforming
the upper-level Aggrefs into something that looks more likeAggref(PartialAggref(original-arguments))
whereupon the pre-existing match logic should be sufficient to replace
the "PartialAggref(original-arguments)" subtree with a suitable Var
in the upper aggregation plan node.

> aggtype is one of the fields
> that gets compared as part of that process, and it won't be the same
> if you make aggtype mean the result of that particular node rather
> than the result of the aggregate.  nodeAgg.c also uses aggtype for
> matching purposes; not sure if there is a similar problem there or
> not.

AFAICS, nodeAgg.c's check on that is not logically necessary: if you have
identical inputs and the same aggregate function then you should certainly
expect the same output type.  The only reason we're making it is that the
code originally used equal() to detect identical aggregates, and somebody
slavishly copied all the comparisons when splitting up that test so as to
distinguish "identical inputs" from "identical aggregates".  I'll reserve
judgement on whether search_indexed_tlist_for_partial_aggref has any idea
what it's doing in this regard.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Robert Haas
Дата:
On Thu, Jun 16, 2016 at 6:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Meh.  I think this is probably telling us that trying to repurpose Aggref
> as the representation of a partial aggregate may have been mistaken.
> Or maybe just that fix_combine_agg_expr was a bad idea.  It seems likely
> to me that that could have been dispensed with altogether in return for
> slightly more work in create_grouping_paths, for instance transforming
> the upper-level Aggrefs into something that looks more like
>         Aggref(PartialAggref(original-arguments))
> whereupon the pre-existing match logic should be sufficient to replace
> the "PartialAggref(original-arguments)" subtree with a suitable Var
> in the upper aggregation plan node.

I don't mind if you feel the need to refactor this.  In David's
original patch, he had a PartialAggref node that basically acted as a
wrapper for an Aggref node, effectively behaving like a flag on the
underlying Aggref.  I thought that was ugly and argued for including
the required information in the Aggref itself.  Now what you are
proposing here is a bit different and it may work better, but there
are tricky things like how it all deparses in the EXPLAIN output - you
may recall that I fixed a problem in that area a while back.  We're
trying to represent a concept that doesn't exist at the SQL level, and
that makes things a bit complicated: if the Aggref's input is a
PartialAggref, will the deparse code be able to match that up to the
correct catalog entry?  But if you've got a good idea, have at it.

>> aggtype is one of the fields
>> that gets compared as part of that process, and it won't be the same
>> if you make aggtype mean the result of that particular node rather
>> than the result of the aggregate.  nodeAgg.c also uses aggtype for
>> matching purposes; not sure if there is a similar problem there or
>> not.
>
> AFAICS, nodeAgg.c's check on that is not logically necessary: if you have
> identical inputs and the same aggregate function then you should certainly
> expect the same output type.  The only reason we're making it is that the
> code originally used equal() to detect identical aggregates, and somebody
> slavishly copied all the comparisons when splitting up that test so as to
> distinguish "identical inputs" from "identical aggregates".  I'll reserve
> judgement on whether search_indexed_tlist_for_partial_aggref has any idea
> what it's doing in this regard.

Sure, it's all cargo-cult programming.  I don't want to presume to
read David's mind, but I think we were both thinking that minimizing
the amount of tinkering that we did would cut down on the likelihood
of bugs.  Now, you've found a bug that got through, so we can either
double down on the design we've already picked, or we can change it.
I'm not set on one approach or the other; I thought what we ended up
was fairly reasonable given the bloated mess that is struct Aggref,
but I've got no deep attachment to it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 16, 2016 at 6:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Meh.  I think this is probably telling us that trying to repurpose Aggref
>> as the representation of a partial aggregate may have been mistaken.

> I don't mind if you feel the need to refactor this.

I'm not sure yet.  What I think we need to work out first is exactly
how polymorphic parallel aggregates ought to identify which concrete
data types they are using.  There were well-defined rules before,
but how do we adapt those to two levels of aggregate evaluation?
Are we hard-wiring an assumption that the aggregate transtype is the
same at both levels?  What assumptions do we want to make about the
relationship of the data transfer type (the lower level's output type)
to the polymorphic input and trans types?

A key point here is that the way that polymorphic aggregate support
functions used to find out what they were doing is that we'd set up
dummy FuncExprs that would satisfy get_call_expr_argtype().  That
relied on the fact that the transfn and finalfn calls were specified
in a way that corresponded to legal SQL function calls.  It's not
clear to me whether that statement still holds for parallelized
aggregates.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I don't mind if you feel the need to refactor this.

> I'm not sure yet.  What I think we need to work out first is exactly
> how polymorphic parallel aggregates ought to identify which concrete
> data types they are using.  There were well-defined rules before,
> but how do we adapt those to two levels of aggregate evaluation?

After reviewing things a bit, it seems like the first design-level issue
there is what about polymorphic combine functions.  So far as the type
system is concerned, there's no issue: the declared signature is always
going to be "combine(foo, foo) returns foo" and it doesn't matter whether
foo is an ordinary type, a polymorphic one, or INTERNAL.  The other
question is whether the function itself knows what it's operating on in
the current invocation.  For regular polymorphic types this seems no
different from any other usage.  If the transtype is INTERNAL, then the
type system provides no help; but we could reasonably assume that the
internal representation itself contains as much information as the combine
func needs.  We could add extra arguments like the "finalfunc_extra"
option does for the finalfunc, but I don't really see a need --- that hack
is mainly to satisfy the type system that the finalfunc's signature is
sane, not to tell the finalfunc something it has no other way to find out.
So I think we're probably OK as far as the combine function is concerned.

The situation is much direr as far as serialize/deserialize are concerned.
These basically don't work at all for polymorphic transtypes: if you try
to declare "deserialize(bytea) returns anyarray", the type system won't
let you.  Perhaps that's not an issue because you shouldn't really need
serialize/deserialize for anything except INTERNAL transtype.  However,
there's a far bigger problem which is that "deserialize(bytea) returns
internal" is a blatant security hole, which I see you ignored the defense
against in opr_sanity.sql.  The claim in the comment there that it's okay
if we check for aggregate context is a joke --- or haven't you heard that
we allow users to create aggregates?  That's not nearly good enough to
prevent unsafe usage of such functions.  Not to mention that CREATE
FUNCTION won't allow creation of such functions, so extensions are locked
out of using this feature.

This has to be redesigned or else reverted entirely.  I'm not going to
take no for an answer.

A possible solution is to give deserialize an extra dummy argument, along
the lines of "deserialize(bytea, internal) returns internal", thereby
ensuring it can't be called in any non-system-originated contexts.  This
is still rather dangerous if the other argument is variable, as somebody
might be able to abuse an internal-taking function by naming it as the
deserialize function for a maliciously-designed aggregate.  What I'm
inclined to do to lock it down further is to drop the "serialtype"
argument to CREATE AGGREGATE, which seems rather pointless (what else
would you ever use besides bytea?).  Instead, insist that
serialize/deserialize apply *only* when the transtype is INTERNAL, and
their signatures are exactly "serialize(internal) returns bytea" and
"deserialize(bytea, internal) returns internal", never anything else.

A different way of locking things down, which might be cleaner in the
long run, is to invent a new pseudo-type for the sole purpose of being
the serialization type, that is we'd insist on the signatures being
"serialize(internal) returns serialized_internal" and
"deserialize(serialized_internal) returns internal", where
serialized_internal has the representation properties of bytea but is
usable for no other purpose than this.  Not sure it's worth the trouble
though.

Anyway, setting aside the security angle, it doesn't seem like there is
anything wrong with the high-level design for polymorphic cases.  I'm now
thinking the problem I saw is just a garden variety implementation bug
(or bugs).  I'm still not very happy with the confusion around aggtype
though, not least because I suspect it contributed directly to this bug.
I also feel that both the code comments and the user-facing documentation
for this feature are well below project standards, eg where's the
discussion in section 35.10?
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Robert Haas
Дата:
On Fri, Jun 17, 2016 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I don't mind if you feel the need to refactor this.
>
>> I'm not sure yet.  What I think we need to work out first is exactly
>> how polymorphic parallel aggregates ought to identify which concrete
>> data types they are using.  There were well-defined rules before,
>> but how do we adapt those to two levels of aggregate evaluation?
>
> After reviewing things a bit, it seems like the first design-level issue
> there is what about polymorphic combine functions.  So far as the type
> system is concerned, there's no issue: the declared signature is always
> going to be "combine(foo, foo) returns foo" and it doesn't matter whether
> foo is an ordinary type, a polymorphic one, or INTERNAL.  The other
> question is whether the function itself knows what it's operating on in
> the current invocation.  For regular polymorphic types this seems no
> different from any other usage.  If the transtype is INTERNAL, then the
> type system provides no help; but we could reasonably assume that the
> internal representation itself contains as much information as the combine
> func needs.  We could add extra arguments like the "finalfunc_extra"
> option does for the finalfunc, but I don't really see a need --- that hack
> is mainly to satisfy the type system that the finalfunc's signature is
> sane, not to tell the finalfunc something it has no other way to find out.
> So I think we're probably OK as far as the combine function is concerned.

OK....

> The situation is much direr as far as serialize/deserialize are concerned.
> These basically don't work at all for polymorphic transtypes: if you try
> to declare "deserialize(bytea) returns anyarray", the type system won't
> let you.  Perhaps that's not an issue because you shouldn't really need
> serialize/deserialize for anything except INTERNAL transtype.

In fact, I think that it won't even let you declare a serialize or
deserialize function for anything other than an INTERNAL transtype
right now.  There was a PostGIS use case that made it seem like we
might want to relax that, but currently I believe that's the law of
the land.

> However,
> there's a far bigger problem which is that "deserialize(bytea) returns
> internal" is a blatant security hole, which I see you ignored the defense
> against in opr_sanity.sql.  The claim in the comment there that it's okay
> if we check for aggregate context is a joke --- or haven't you heard that
> we allow users to create aggregates?  That's not nearly good enough to
> prevent unsafe usage of such functions.

That may be true, but it isn't obvious to me.  The user can't simply
say SELECT numeric_avg_deserialize(...) or whatever because the
function won't execute when called that way.  How would you arrange to
call such a function in a context where it wouldn't fail outright but
would return a value that you could then hand to some other function
that expects a type-internal input?

> Not to mention that CREATE
> FUNCTION won't allow creation of such functions, so extensions are locked
> out of using this feature.

Oops.

> This has to be redesigned or else reverted entirely.  I'm not going to
> take no for an answer.

Please don't act as if I've been refusing to fix bugs.  I've been
doing practically nothing else.

> A possible solution is to give deserialize an extra dummy argument, along
> the lines of "deserialize(bytea, internal) returns internal", thereby
> ensuring it can't be called in any non-system-originated contexts.

Seems promising.

> This
> is still rather dangerous if the other argument is variable, as somebody
> might be able to abuse an internal-taking function by naming it as the
> deserialize function for a maliciously-designed aggregate.

The complex and, as far as I can tell, largely undocumented set of
rules about what sorts of internal-taking-and-returning functions we
allow to exist strikes me as a house of cards just waiting for the
next stiff breeze to come along.

> What I'm
> inclined to do to lock it down further is to drop the "serialtype"
> argument to CREATE AGGREGATE, which seems rather pointless (what else
> would you ever use besides bytea?).  Instead, insist that
> serialize/deserialize apply *only* when the transtype is INTERNAL, and
> their signatures are exactly "serialize(internal) returns bytea" and
> "deserialize(bytea, internal) returns internal", never anything else.

I originally had the thought that you might want to make the
serialized representation something that could be human-readable.  I'd
actually like to have an SQL syntax that outputs the transition state
or, if it's internal, the serialized representation of it.  That would
allow us to do partitionwise aggregation given a query like SELECT
avg(foo) FROM parent_table_for_several_foreign_tables.  You'd need to
be able to ship something like SELECT PARTIAL avg(foo) FROM t1 to each
child.  And it might be nicer if the results came back in some
human-readable format rather than as a bytea, but it's not essential,
of course.

> A different way of locking things down, which might be cleaner in the
> long run, is to invent a new pseudo-type for the sole purpose of being
> the serialization type, that is we'd insist on the signatures being
> "serialize(internal) returns serialized_internal" and
> "deserialize(serialized_internal) returns internal", where
> serialized_internal has the representation properties of bytea but is
> usable for no other purpose than this.  Not sure it's worth the trouble
> though.

I think we should break up internal into various kinds of internal
depending on what kind of a thing we've actually got a pointer to.  We
don't need more synonyms for bytea, IMO.

> Anyway, setting aside the security angle, it doesn't seem like there is
> anything wrong with the high-level design for polymorphic cases.  I'm now
> thinking the problem I saw is just a garden variety implementation bug
> (or bugs).  I'm still not very happy with the confusion around aggtype
> though, not least because I suspect it contributed directly to this bug.
> I also feel that both the code comments and the user-facing documentation
> for this feature are well below project standards, eg where's the
> discussion in section 35.10?

Good point.  That needs to be addressed -- ideally by David, but if he
doesn't then I guess I'll have to do it.  If you have other specific
comments, please send them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 17, 2016 at 1:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> there's a far bigger problem which is that "deserialize(bytea) returns
>> internal" is a blatant security hole, which I see you ignored the defense
>> against in opr_sanity.sql.  The claim in the comment there that it's okay
>> if we check for aggregate context is a joke --- or haven't you heard that
>> we allow users to create aggregates?  That's not nearly good enough to
>> prevent unsafe usage of such functions.

> That may be true, but it isn't obvious to me.  The user can't simply
> say SELECT numeric_avg_deserialize(...) or whatever because the
> function won't execute when called that way.  How would you arrange to
> call such a function in a context where it wouldn't fail outright but
> would return a value that you could then hand to some other function
> that expects a type-internal input?

The concern I have is that you could stick it into an aggregate that isn't
the one it's expecting to be used in, or into a slot in that aggregate
that isn't deserialize(), and the run-time test can't detect either of
those things.  Now maybe this is locked down sufficiently by the fact
that we don't let non-superusers create aggregates with transtype
INTERNAL, but that seems pretty shaky to me given the number of moving
parts in aggregates these days and the fact that we keep adding more.
But whether there's a live security bug there today is really moot,
because of:

>> Not to mention that CREATE
>> FUNCTION won't allow creation of such functions, so extensions are locked
>> out of using this feature.

> Oops.

I think that means we *have* to change this.

>> A possible solution is to give deserialize an extra dummy argument, along
>> the lines of "deserialize(bytea, internal) returns internal", thereby
>> ensuring it can't be called in any non-system-originated contexts.

> Seems promising.

If nobody has a better idea, I'll pursue that next week or so.  It doesn't
seem like a must-fix-for-beta2.  (Letting it slide means we're locked into
an initdb or pg_upgrade again for beta3, but I don't have a lot of hope of
avoiding that anyway.)

Meanwhile, I'll go see if I can work out exactly what's broken about the
polymorphic type-slinging.  That I would like to see working in beta2.

> I originally had the thought that you might want to make the
> serialized representation something that could be human-readable.

Perhaps, but on efficiency grounds I doubt that people would want that
to be the same API that's used for partial aggregation.  I could see
adding an additional function slot "prettyprint(internal) returns text"
to aggregates for this purpose.  In any case, even if we lock down the
declared type to be bytea, there's nothing stopping someone from defining
their serialization code to output an ASCII string.

> I think we should break up internal into various kinds of internal
> depending on what kind of a thing we've actually got a pointer to.

Not a bad long-term project, but it's not happening in this cycle.
I'm not very sure how we'd go about it anyway --- for examples
like this, every new user-defined aggregate potentially wants its
own flavor of "internal", so how do we manage that?
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:

> > I think we should break up internal into various kinds of internal
> > depending on what kind of a thing we've actually got a pointer to.
> 
> Not a bad long-term project, but it's not happening in this cycle.
> I'm not very sure how we'd go about it anyway --- for examples
> like this, every new user-defined aggregate potentially wants its
> own flavor of "internal", so how do we manage that?

Can't we have them be ExtensibleNode?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Robert Haas
Дата:
On Fri, Jun 17, 2016 at 3:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The concern I have is that you could stick it into an aggregate that isn't
> the one it's expecting to be used in, or into a slot in that aggregate
> that isn't deserialize(), and the run-time test can't detect either of
> those things.  Now maybe this is locked down sufficiently by the fact
> that we don't let non-superusers create aggregates with transtype
> INTERNAL, but that seems pretty shaky to me given the number of moving
> parts in aggregates these days and the fact that we keep adding more.

Well, I'm not averse to changing it for more security, but "there
could be a bug there in somewhere" is a bit different from "the claim
in the comment there that it's okay if we check for aggregate context
is a joke".

>>> Not to mention that CREATE
>>> FUNCTION won't allow creation of such functions, so extensions are locked
>>> out of using this feature.
>
>> Oops.
>
> I think that means we *have* to change this.

Well, we don't *have* to change things for this reason, but it's
certainly not at all desirable for user-defined aggregates to be
locked out of this functionality.  So I'm in favor of changing it.

>> I think we should break up internal into various kinds of internal
>> depending on what kind of a thing we've actually got a pointer to.
>
> Not a bad long-term project, but it's not happening in this cycle.
> I'm not very sure how we'd go about it anyway --- for examples
> like this, every new user-defined aggregate potentially wants its
> own flavor of "internal", so how do we manage that?

I think we'd want some way to easily spin up new internal-ish types.
CREATE TYPE myinternalthingy AS INTERNAL, or something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I think we should break up internal into various kinds of internal
>>> depending on what kind of a thing we've actually got a pointer to.

>> Not a bad long-term project, but it's not happening in this cycle.
>> I'm not very sure how we'd go about it anyway --- for examples
>> like this, every new user-defined aggregate potentially wants its
>> own flavor of "internal", so how do we manage that?

> Can't we have them be ExtensibleNode?

No, these are data types and values of data types, not parsetree nodes.
The core problem is to teach the type system to prevent you from
sticking foo(internal) into a context where the actual value passed
will be some other kind of "internal" than it's expecting.

Having said that, another possible solution is to establish a convention
that every sort of "internal" thingy should have an ID word (with a
different magic number for each kind of thingy) that all functions check.
But I'm not convinced that that's going to be convenient to do everywhere,
and it's certainly not likely to be easier to bolt on than a bunch of new
internal-ish type OIDs.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:

> > Can't we have them be ExtensibleNode?
> 
> No, these are data types and values of data types, not parsetree nodes.

Ah, right.

> The core problem is to teach the type system to prevent you from
> sticking foo(internal) into a context where the actual value passed
> will be some other kind of "internal" than it's expecting.

Makes sense as a safety measure.

> Having said that, another possible solution is to establish a convention
> that every sort of "internal" thingy should have an ID word (with a
> different magic number for each kind of thingy) that all functions check.
> But I'm not convinced that that's going to be convenient to do everywhere,
> and it's certainly not likely to be easier to bolt on than a bunch of new
> internal-ish type OIDs.

I think the magic number is a lot more convenient than new type OIDs.
As the number of internal types grows, it will become a nuisance,
considering that you need to add boilerplate for in/out/send/recv
functions, document it as a pseudotype, yadda yadda yadda.  Some central
place to define the internal numbers to ensure uniqueness, akin to
duplicate_oids, is probably a necessity, though, which may be a bit too
much scripting to be indulging in, this late in the cycle.

I suppose we could have done pg_ddl_command this way, but it's already
in 9.5 so it's probably too late for that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
I wrote:
> Meanwhile, I'll go see if I can work out exactly what's broken about the
> polymorphic type-slinging.  That I would like to see working in beta2.

OK, I've got at least the core of the problem.  fix_combine_agg_expr
changes the upper aggregate's Aggref node so that its args list is a
single Var of the transtype.  This has nothing whatsoever to do with the
original user-supplied aggregate arguments, not as to datatype nor even
number of arguments.  Nonetheless, nodeAgg.c uses that args list as input
to get_aggregate_argtypes(), and thence to resolve_aggregate_transtype()
and build_aggregate_finalfn_expr(), as well as some miscellaneous checks
in build_pertrans_for_aggref (though by luck it appears that those checks
aren't reachable for a combining Aggref).

It's basically just luck that this ever works; we'd have noticed it long
ago if it weren't that so few of the aggregates that currently have
combinefns are either polymorphic or multiple-argument.  Likely the
only polymorphic one that ever got tested is anyarray_agg, and that
accidentally fails to fail because (1) enforce_generic_type_consistency
has special-case exceptions for ANYARRAY, and (2) the array-mashing code
doesn't actually need to know which array type it's dealing with
since it finds that out by looking into the array values.

We could fix the resolve_aggregate_transtype aspect of this by storing
the resolved transtype into Aggref nodes sometime before whacking them
into partial format, rather than re-deducing it during nodeAgg startup.
I'm inclined to do that just on efficiency grounds; but it's not enough
to fix the problem here, because we'd still be passing the wrong argument
types to build_aggregate_finalfn_expr(), and any finalfunc that actually
used that information would fail.

It seems like the best way to fix this is to add another field to Aggref
that is an OID list of the original argument types, which we don't change
when replacing the argument list for a combining agg, and to look to that
rather than the physical args list when extracting the argument types in
nodeAgg.c.  This is kind of bulky and grotty, but it should work reliably.

I also thought about leaving the original args list in place, and having
a secondary args list used only for combining Aggrefs that contains just
the Var referencing the subsidiary transtype result.  An attraction of
that is that we'd not need that Rube-Goldberg-worthy method for printing
combining Aggrefs in ruleutils.c.  However, the cruft would still leak out
somewhere, because if we just did that much then setrefs.c would want to
find all the aggregate input values in the output of the subsidiary plan
node, where of course they aren't and shouldn't be.  Maybe we could skip
resolving the original args list into subplan references in this case,
but that might well break ruleutils, so it just seems like a mess.

Yet another idea is to have nodeAgg do something similar to what ruleutils
does and drill down through the subsidiary-transtype-result Var to find
the partial Aggref and grab its argument list.  But that's just doubling
down on a mechanism that we should be looking to get rid of not emulate.

So at this point my proposal is:

1. Add an OID-list field to Aggref holding the data types of the
user-supplied arguments.  This can be filled in parse analysis since it
won't change thereafter.  Replace calls to get_aggregate_argtypes() with
use of the OID-list field, or maybe keep the function but have it look
at the OID list not the physical args list.

2. Add an OID field to Aggref holding the resolved (non polymorphic)
transition data type's OID.  For the moment we could just set this
during parse analysis, since we do not support changing the transtype
of an existing aggregate.  If we ever decide we want to allow that,
the lookup could be postponed into the planner.  Replace calls to
resolve_aggregate_transtype with use of this field.
(resolve_aggregate_transtype() wouldn't disappear, but it would be
invoked only once during parse analysis, not multiple times per query.)

#2 isn't necessary to fix the bug, but as long as we are doing #1
we might as well do #2 too, to buy back some of the planner overhead
added by parallel query.

Barring objections I'll make this happen by tomorrow.

I still don't like anything about the way that the matching logic in
fix_combine_agg_expr works, but at the moment I can't point to any
observable bug from that, so I'll not try to change it before beta2.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
David Rowley
Дата:
On 18 June 2016 at 05:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The situation is much direr as far as serialize/deserialize are concerned.
> These basically don't work at all for polymorphic transtypes: if you try
> to declare "deserialize(bytea) returns anyarray", the type system won't
> let you.  Perhaps that's not an issue because you shouldn't really need
> serialize/deserialize for anything except INTERNAL transtype.  However,
> there's a far bigger problem which is that "deserialize(bytea) returns
> internal" is a blatant security hole, which I see you ignored the defense
> against in opr_sanity.sql.  The claim in the comment there that it's okay
> if we check for aggregate context is a joke --- or haven't you heard that
> we allow users to create aggregates?  That's not nearly good enough to
> prevent unsafe usage of such functions.  Not to mention that CREATE
> FUNCTION won't allow creation of such functions, so extensions are locked
> out of using this feature.

This is an unfortunate oversight, and it is my fault.

> This has to be redesigned or else reverted entirely.  I'm not going to
> take no for an answer.

I don't think anyone is against fixing any of the said issues. I
personally was committed to other non-work related things Friday -
Sunday and I was unable to do anything with this.

> A possible solution is to give deserialize an extra dummy argument, along
> the lines of "deserialize(bytea, internal) returns internal", thereby
> ensuring it can't be called in any non-system-originated contexts.  This
> is still rather dangerous if the other argument is variable, as somebody
> might be able to abuse an internal-taking function by naming it as the
> deserialize function for a maliciously-designed aggregate.  What I'm
> inclined to do to lock it down further is to drop the "serialtype"
> argument to CREATE AGGREGATE, which seems rather pointless (what else
> would you ever use besides bytea?).  Instead, insist that
> serialize/deserialize apply *only* when the transtype is INTERNAL, and
> their signatures are exactly "serialize(internal) returns bytea" and
> "deserialize(bytea, internal) returns internal", never anything else.

This is also the only way that I can think of to fix this issue. If we
can agree that the fix should be to insist that the deserialisation
function take an additional 2nd parameter of INTERNAL, then I can
write a patch to fix this, and include a patch for the document
section 35.10 to explain better about parallelising user defined
aggregates.

> A different way of locking things down, which might be cleaner in the
> long run, is to invent a new pseudo-type for the sole purpose of being
> the serialization type, that is we'd insist on the signatures being
> "serialize(internal) returns serialized_internal" and
> "deserialize(serialized_internal) returns internal", where
> serialized_internal has the representation properties of bytea but is
> usable for no other purpose than this.  Not sure it's worth the trouble
> though.

I agree this idea has merit, but seems like quite a large project for
9.6 at this stage. Adding the extra parameter to the deserialiation
function seems more reasonable at this stage, although once we get
that we're likely stuck with it even if we devise a better way to
handle all this later.

> Anyway, setting aside the security angle, it doesn't seem like there is
> anything wrong with the high-level design for polymorphic cases.  I'm now
> thinking the problem I saw is just a garden variety implementation bug
> (or bugs).  I'm still not very happy with the confusion around aggtype
> though, not least because I suspect it contributed directly to this bug.
> I also feel that both the code comments and the user-facing documentation
> for this feature are well below project standards, eg where's the
> discussion in section 35.10?

Thank you for committing the fix for the originally reported problem
with the polymorphic aggregates.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
David Rowley
Дата:
On 18 June 2016 at 09:29, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So at this point my proposal is:
>
> 1. Add an OID-list field to Aggref holding the data types of the
> user-supplied arguments.  This can be filled in parse analysis since it
> won't change thereafter.  Replace calls to get_aggregate_argtypes() with
> use of the OID-list field, or maybe keep the function but have it look
> at the OID list not the physical args list.
>
> 2. Add an OID field to Aggref holding the resolved (non polymorphic)
> transition data type's OID.  For the moment we could just set this
> during parse analysis, since we do not support changing the transtype
> of an existing aggregate.  If we ever decide we want to allow that,
> the lookup could be postponed into the planner.  Replace calls to
> resolve_aggregate_transtype with use of this field.
> (resolve_aggregate_transtype() wouldn't disappear, but it would be
> invoked only once during parse analysis, not multiple times per query.)
>
> #2 isn't necessary to fix the bug, but as long as we are doing #1
> we might as well do #2 too, to buy back some of the planner overhead
> added by parallel query.
>
> Barring objections I'll make this happen by tomorrow.

Thanks for committing this change.

>From reading over the commit I see you've left;

+ * XXX need more documentation about partial aggregation here

Would the attached cover off what you had imagined might go here?

> I still don't like anything about the way that the matching logic in
> fix_combine_agg_expr works, but at the moment I can't point to any
> observable bug from that, so I'll not try to change it before beta2.

Right, I see more work is needed in that area as there's now an
out-of-date comment at the top of
search_indexed_tlist_for_partial_aggref. The comment claims we only
ignore aggoutputtype, but you added aggtranstype and aggargtypes to
that list.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
David Rowley
Дата:
On 20 June 2016 at 19:06, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 18 June 2016 at 05:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A possible solution is to give deserialize an extra dummy argument, along
>> the lines of "deserialize(bytea, internal) returns internal", thereby
>> ensuring it can't be called in any non-system-originated contexts.  This
>> is still rather dangerous if the other argument is variable, as somebody
>> might be able to abuse an internal-taking function by naming it as the
>> deserialize function for a maliciously-designed aggregate.  What I'm
>> inclined to do to lock it down further is to drop the "serialtype"
>> argument to CREATE AGGREGATE, which seems rather pointless (what else
>> would you ever use besides bytea?).  Instead, insist that
>> serialize/deserialize apply *only* when the transtype is INTERNAL, and
>> their signatures are exactly "serialize(internal) returns bytea" and
>> "deserialize(bytea, internal) returns internal", never anything else.
>
> This is also the only way that I can think of to fix this issue. If we
> can agree that the fix should be to insist that the deserialisation
> function take an additional 2nd parameter of INTERNAL, then I can
> write a patch to fix this, and include a patch for the document
> section 35.10 to explain better about parallelising user defined
> aggregates.

I've gone and implemented the dummy argument approach for
deserialization functions.

If we go with this, I can then write the docs for 35.10 which'll serve
to explain parallel user defined aggregates in detail.

Some notes about the patch;

I didn't remove the comments at the top of each deserial function
which mention something like:

 * numeric_avg_serialize(numeric_avg_deserialize(bytea)) must result in a value
 * which matches the original bytea value.

I'm thinking that perhaps these now make a little less sense, given
that numeric_avg_deserialize is now numeric_avg_deserialize(bytea,
internal).

Perhaps these should be updated or removed.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
David Rowley
Дата:
On 20 June 2016 at 22:19, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've gone and implemented the dummy argument approach for
> deserialization functions.
>
> If we go with this, I can then write the docs for 35.10 which'll serve
> to explain parallel user defined aggregates in detail.

I've attached my proposed patch for the user defined aggregate docs.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> I've gone and implemented the dummy argument approach for
> deserialization functions.

How do you feel about the further idea of locking down the signatures
to be exactly "serialize(internal) returns bytea" and "deserialize(bytea,
internal) returns internal", and removing pg_aggregate.aggserialtype?
I don't see very much value in allowing any other nominal transmission
type besides bytea; and the less flexibility in these function signatures,
the less chance of confusion/misuse of other internal-related functions.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
I wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> I've gone and implemented the dummy argument approach for
>> deserialization functions.

> How do you feel about the further idea of locking down the signatures
> to be exactly "serialize(internal) returns bytea" and "deserialize(bytea,
> internal) returns internal", and removing pg_aggregate.aggserialtype?
> I don't see very much value in allowing any other nominal transmission
> type besides bytea; and the less flexibility in these function signatures,
> the less chance of confusion/misuse of other internal-related functions.

Not hearing any objections, pushed that way.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
David Rowley
Дата:
On 23 June 2016 at 08:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>> I've gone and implemented the dummy argument approach for
>>> deserialization functions.
>
>> How do you feel about the further idea of locking down the signatures
>> to be exactly "serialize(internal) returns bytea" and "deserialize(bytea,
>> internal) returns internal", and removing pg_aggregate.aggserialtype?
>> I don't see very much value in allowing any other nominal transmission
>> type besides bytea; and the less flexibility in these function signatures,
>> the less chance of confusion/misuse of other internal-related functions.

I don't object to removing it. If we find a use for non-bytea serial
types down the line, then we can add it again.

> Not hearing any objections, pushed that way.

Many thanks for committing those fixes.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> I've attached my proposed patch for the user defined aggregate docs.

I whacked this around some more and pushed it.

While working on that, I noticed what seems to me to be a minor bug.
The behavior that I'd expect (and that I documented) for a deserialization
function is that it just allocates its result in the current, short-lived
memory context, since it will be the combine function's responsibility to
merge that into the long-lived transition state.  But it looks to me like
the deserialization functions in numeric.c are allocating their results
in the aggregate context, which will mean a leak.  (For example,
numeric_avg_deserialize creates its result using makeNumericAggState
which forces the result into the agg context.)

Now this leak isn't going to be real significant unless we accumulate a
whole lot of partial results in one query, which would be unusual and
maybe even impossible in the current parallel-query environment.  But it
still seems wrong.  Please check, and submit a patch if I'm right about
what's happening.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
David Rowley
Дата:
On 23 June 2016 at 11:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> While working on that, I noticed what seems to me to be a minor bug.
> The behavior that I'd expect (and that I documented) for a deserialization
> function is that it just allocates its result in the current, short-lived
> memory context, since it will be the combine function's responsibility to
> merge that into the long-lived transition state.  But it looks to me like
> the deserialization functions in numeric.c are allocating their results
> in the aggregate context, which will mean a leak.  (For example,
> numeric_avg_deserialize creates its result using makeNumericAggState
> which forces the result into the agg context.)

Yes, you're right.

In the end I decided to add a makeNumericAggStateCurrentContext()
function which does not perform any memory context switching at all.
It seems like this can be used for the combine functions too, since
they've already switched to the aggregate memory context. This should
save a few cycles during aggregate combine, and not expend any extra
as some alternatives, like adding a flag to makeNumericAggState().


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 23 June 2016 at 11:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The behavior that I'd expect (and that I documented) for a deserialization
>> function is that it just allocates its result in the current, short-lived
>> memory context, since it will be the combine function's responsibility to
>> merge that into the long-lived transition state.  But it looks to me like
>> the deserialization functions in numeric.c are allocating their results
>> in the aggregate context, which will mean a leak.  (For example,
>> numeric_avg_deserialize creates its result using makeNumericAggState
>> which forces the result into the agg context.)

> Yes, you're right.

> In the end I decided to add a makeNumericAggStateCurrentContext()
> function which does not perform any memory context switching at all.
> It seems like this can be used for the combine functions too, since
> they've already switched to the aggregate memory context. This should
> save a few cycles during aggregate combine, and not expend any extra
> as some alternatives, like adding a flag to makeNumericAggState().

You missed the ones using makePolyNumAggState --- I fixed that and
pushed it.
        regards, tom lane



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Noah Misch
Дата:
On Thu, Jun 23, 2016 at 10:57:26AM -0400, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > On 23 June 2016 at 11:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> The behavior that I'd expect (and that I documented) for a deserialization
> >> function is that it just allocates its result in the current, short-lived
> >> memory context, since it will be the combine function's responsibility to
> >> merge that into the long-lived transition state.  But it looks to me like
> >> the deserialization functions in numeric.c are allocating their results
> >> in the aggregate context, which will mean a leak.  (For example,
> >> numeric_avg_deserialize creates its result using makeNumericAggState
> >> which forces the result into the agg context.)
> 
> > Yes, you're right.
> 
> > In the end I decided to add a makeNumericAggStateCurrentContext()
> > function which does not perform any memory context switching at all.
> > It seems like this can be used for the combine functions too, since
> > they've already switched to the aggregate memory context. This should
> > save a few cycles during aggregate combine, and not expend any extra
> > as some alternatives, like adding a flag to makeNumericAggState().
> 
> You missed the ones using makePolyNumAggState --- I fixed that and
> pushed it.

What, if anything, is yet required to close this 9.6 open item?



Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype

От
Tom Lane
Дата:
Noah Misch <noah@leadboat.com> writes:
> What, if anything, is yet required to close this 9.6 open item?

The original complaint about polymorphic aggs is fixed to my satisfaction.
The business about how non-text-format EXPLAIN reports parallel plans
(<16002.1466972724@sss.pgh.pa.us>) remains, but perhaps we should view
that as an independent issue.
        regards, tom lane