Обсуждение: PG 7.2b4 bug?
Apparently there's been a change in the way views are handled within PostreSQL. The following program works fine in earlier versions. It also works if the select on the view is replaced with a direct call to nextval(). We use this construct repeatedly in OpenACS so we can share queries that use sequences between Oracle and PostgreSQL unless they differ in other ways that can't be reconciled between the two RDBMS's. Obviously, the greater the number of queries we can share in this way, the lower our porting, new code development, and code maintenance costs. As it stands neither the older OpenACS 3.x toolkit nor the upcoming OpenACS 4.x toolkit will work with PG 7.2. Ouch. Sorry for the long delay in reporting this. I only recently decided to get off my rear end and test against PG 7.2 after Marc Fournier tried to do an install and ran into a few problems. Here's some code. The problem's caused by the fact that two rows are being inserted due to the reference to "multiple rows". Sequential inserts of single rows using the view work just fine: create sequence test_seq_x; create view test_seq as select nextval('test_seq_x') as nextval; create table data (i integer primary key); create table multiple_rows (i integer); insert into multiple_rows values (1); insert into multiple_rows values (2); insert into data select test_seq.nextval from multiple_rows; -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Don Baccus <dhogaza@pacifier.com> writes: > Apparently there's been a change in the way views are handled within > PostreSQL. The following program works fine in earlier versions. AFAICT, it was just pure, unadulterated luck that it "works" in prior versions. In 7.1 I get: regression=# select test_seq.nextval from multiple_rows; NOTICE: Adding missing FROM-clause entry for table "test_seq"nextval --------- 3 4 (2 rows) regression=# explain select test_seq.nextval from multiple_rows; NOTICE: Adding missing FROM-clause entry for table "test_seq" NOTICE: QUERY PLAN: Nested Loop (cost=0.00..30.00 rows=1000 width=4) -> Seq Scan on multiple_rows (cost=0.00..20.00 rows=1000 width=0) -> Subquery Scan test_seq (cost=0.00..0.00 rows=0 width=0) -> Result (cost=0.00..0.00 rows=0 width=0) EXPLAIN In 7.2 I get: regression=# select test_seq.nextval from multiple_rows; NOTICE: Adding missing FROM-clause entry for table "test_seq"nextval --------- 4 4 (2 rows) regression=# explain select test_seq.nextval from multiple_rows; NOTICE: Adding missing FROM-clause entry for table "test_seq" NOTICE: QUERY PLAN: Nested Loop (cost=0.00..30.01 rows=1000 width=8) -> Subquery Scan test_seq (cost=0.00..0.01 rows=1 width=0) -> Result (cost=0.00..0.01 rows=1 width=0) -> Seq Scan on multiple_rows (cost=0.00..20.00 rows=1000 width=0) EXPLAIN The reason it "works" in 7.1 is that the view is the inside of the nested loop, and so is re-evaluated for each tuple from the outer query. (The Result node is where the nextval call is actually being evaluated.) In 7.2 the view has been placed on the outside of the nested loop, so it's only evaluated once. The reason for the change is that the 7.2 planner makes the (much more realistic) assumption that evaluating the Result node isn't free, and so it considers that evaluating the view multiple times is more expensive than doing it only once. This can be demonstrated to be the cause by setting the Result cost to zero; then the behavior matches 7.1: regression=# show cpu_tuple_cost ; NOTICE: cpu_tuple_cost is 0.01 SHOW VARIABLE regression=# set cpu_tuple_cost to 0; SET VARIABLE regression=# explain select test_seq.nextval from multiple_rows; NOTICE: Adding missing FROM-clause entry for table "test_seq" NOTICE: QUERY PLAN: Nested Loop (cost=0.00..10.00 rows=1000 width=8) -> Seq Scan on multiple_rows (cost=0.00..10.00 rows=1000 width=0) -> Subquery Scan test_seq (cost=0.00..0.00 rows=1 width=0) -> Result (cost=0.00..0.00 rows=1 width=0) EXPLAIN regression=# select test_seq.nextval from multiple_rows; NOTICE: Adding missing FROM-clause entry for table "test_seq"nextval --------- 5 6 (2 rows) However, it's pure luck that you get the nested loop expressed this way and not the other way when the costs come out the same. I'm surprised that you consistently got the behavior you wanted in queries more complex than this test case. I'd have to say that I consider the code as given to be broken; it's not a bug for the planner to rearrange this query in any way it sees fit. It would be nice to accept the Oracle syntax for nextval, but I'm afraid this hack doesn't get the job done :-( regards, tom lane
On Mon, 17 Dec 2001, Don Baccus wrote: > insert into data > select test_seq.nextval > from multiple_rows; I'm not sure that's wrong though with that example. test_seq.nextval in the select list means to PostgreSQL a join with test_seq which is a view with one row and I'd expect it to only evaluate that one row once, if it did it more than once in the past, I'd say it was buggy. However, I'd think: "select (select nextval from test_seq) from multiple_rows;" should give you different values and doesn't, although "select (select nextval from test_seq where i IS NULL or i IS NOT NULL)from multiple_rows;" does give you different values.
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > However, I'd think: > "select (select nextval from test_seq) from multiple_rows;" > should give you different values and doesn't, although > "select (select nextval from test_seq where i IS NULL or i IS NOT NULL) > from multiple_rows;" does give you different values. In the first case, the subselect is visibly not dependent on the outer query, so it's evaluated only once; in the second case it has to be re-evaluated for each row using that row's value of i. You can see the difference (InitPlan vs. SubPlan) in the query's EXPLAIN output. regards, tom lane
Tom Lane wrote: > Don Baccus <dhogaza@pacifier.com> writes: > >>Apparently there's been a change in the way views are handled within >>PostreSQL. The following program works fine in earlier versions. >> > > AFAICT, it was just pure, unadulterated luck that it "works" in prior > versions. > > In 7.1 I get: > > regression=# select test_seq.nextval from multiple_rows; > NOTICE: Adding missing FROM-clause entry for table "test_seq" > nextval > --------- > 3 > 4 > (2 rows) Normally one expects a statement's sematics to depend only upon the source code and to be consistent, not to vary depending on the mood du jour of the processor ... this also fails (it's the same statement with manual substitution): test=# select (select nextval('test_seq_x') as nextval) as test_seq from multiple_rows; test_seq ---------- 2 2 (2 rows) test=# In other words the function's only called once (as I expected). I've looked at Date and Darwin's appendix on SQL3's PSMs but it's no help that I can see, it doesn't get into nitpicking semantic details regarding their use in queries, just their definition. Maybe the behavior's implementation defined ... if not, I'd presume SQL3 states that a function in the above context iscalled either once per row or once per query, not sometimes one or sometimes the other. So I think it's too early to write this off as not being a bug ... -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
On Mon, 17 Dec 2001, Tom Lane wrote: > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > However, I'd think: > > "select (select nextval from test_seq) from multiple_rows;" > > should give you different values and doesn't, although > > "select (select nextval from test_seq where i IS NULL or i IS NOT NULL) > > from multiple_rows;" does give you different values. > > In the first case, the subselect is visibly not dependent on the outer > query, so it's evaluated only once; in the second case it has to be > re-evaluated for each row using that row's value of i. You can see the > difference (InitPlan vs. SubPlan) in the query's EXPLAIN output. I figured that, but I'm not sure whether or not that's a bug of over-optimizing since I think that it probably *should* give you different results since it is different each time it's evaluated. Without checking spec, I'd expect that conceptually the select list entries are evaluated per row, even if we can avoid that when the value is certain to be the same, which would mean the result is incorrect since if it was evaluated per row it would give different results each time.
Don Baccus <dhogaza@pacifier.com> writes: > Maybe the behavior's implementation defined ... if not, I'd presume SQL3 > states that a function in the above context is called either once per > row or once per query, not sometimes one or sometimes the other. AFAICT, the relevant concept in SQL99 is whether a function is "deterministic" or not: An SQL-invoked routine is either deterministic or possibly non- deterministic. An SQL-invoked function thatis deterministic always returns the same return value for a given list of SQL argument values. An SQL-invokedprocedure that is deterministic always returns the same values in its output and inout SQL parameters for a given list of SQL argument values. An SQL-invoked routine is possibly non-deterministic if, during invocationof that SQL- invoked routine, an SQL-implementation might, at two different times when the stateof the SQL-data is the same, produce unequal results due to General Rules that specify implementation-dependent behavior. It looks to me like the spec does NOT attempt to nail down the behavior of non-deterministic functions; in the places where they talk about non-deterministic functions at all, it's mostly to forbid their use in contexts where nondeterminism would affect the final result. Otherwise the results are implementation-defined. regards, tom lane
Don Baccus <dhogaza@pacifier.com> writes: > Most language standards - at least the ones I've worked > on - require compliant implementations to define and document > implementation-defined behavior ... SQL99 saith: g) implementation-defined: Possibly differing between SQL- implementations, but specified by the implementorfor each particular SQL-implementation. h) implementation-dependent: Possibly differing between SQL- implementations, but not specified by ISO/IEC9075, and not required to be specified by the implementor for any particular SQL-implementations. Behavior of nondeterministic functions falls in the second category ... regards, tom lane
Tom Lane wrote: > Don Baccus <dhogaza@pacifier.com> writes: > >>Maybe the behavior's implementation defined ... if not, I'd presume SQL3 >> states that a function in the above context is called either once per >>row or once per query, not sometimes one or sometimes the other. > It looks to me like the spec does NOT attempt to nail down the behavior > of non-deterministic functions; in the places where they talk about > non-deterministic functions at all, it's mostly to forbid their use in > contexts where nondeterminism would affect the final result. Otherwise > the results are implementation-defined. Thanks ... I wasn't trying to lobby for a change, I just wanted to make sure that the standard stated that the behavior is implementation defined or otherwise punted on the issue before my example was written off as a non-bug. At some point the non-deterministic behavior of non-deterministic functions called in subselects in the target list should probably be documented, no? Most language standards - at least the ones I've worked on - require compliant implementations to define and document implementation-defined behavior ... Maybe a warning would be appropriate, too? I realize both of the above would rank pretty low in priority on the todo list ... -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Tom Lane wrote: > Don Baccus <dhogaza@pacifier.com> writes: > >>Most language standards - at least the ones I've worked >>on - require compliant implementations to define and document >>implementation-defined behavior ... >> > > SQL99 saith: > > g) implementation-defined: Possibly differing between SQL- > implementations, but specified by the implementor for each > particular SQL-implementation. > > h) implementation-dependent: Possibly differing between SQL- > implementations, but not specified by ISO/IEC 9075, and not > required to be specified by the implementor for any particular > SQL-implementations. > > Behavior of nondeterministic functions falls in the second category ... Yep, those are the definitions I'm used to. OK, then, since this is implementation-dependent, not implementation-defined, PG's off the hook entirely! -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Tom Lane wrote: > Don Baccus <dhogaza@pacifier.com> writes: > >>Maybe the behavior's implementation defined ... if not, I'd presume SQL3 >> states that a function in the above context is called either once per >>row or once per query, not sometimes one or sometimes the other. This is still bothering me so I decided to plunge into the standard myself. First of all... > > AFAICT, the relevant concept in SQL99 is whether a function is > "deterministic" or not: Actually this argument may well apply to the function all within the subselect or view, but I fail to see any language in the standard that suggests that this trumps the following declaration about the execution of a <query specification> (what many of us informally refer to as a "SELECT"): (from section 7.12, Query Specification) a) If T is not a grouped table, then Case: (I deleted Case i, which refers to standard aggregates like COUNT) ii) If the <select list> does not include a <set function specification> that contains a referenceto T, then each <value expression> is applied to each row of T yielding a table of M rows, where M is the cardinality of T ... (FYI a <set function specification> is a standard aggregate like COUNT, i.e. Case ii pertains to those queries that don't fall into Case i) ISTM that this quite clearly states that a subselect in a target list should be applied to each row to be returned in M. I don't see any waffle-room here. I would have to dig more deeply into the standard's view regarding VIEW semantics but I would assume it would knit together in a consistent manner. For instance, earlier we saw the following exchange between Stephen and Tom: Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > However, I'd think:> > "select (select nextval from test_seq) from multiple_rows;"> > should give you different valuesand doesn't, although> > "select (select nextval from test_seq where i IS NULL or i IS NOT NULL)> > from multiple_rows;"does give you different values. >> In the first case, the subselect is visibly not dependent on the outer> query, so it's evaluated only once; in the secondcase it has to be> re-evaluated for each row using that row's value of i. Note that the standard does not give you this freedom. It says the <value expression> (in this case the subselect, and yes subselects are valid <value expressions> in SQL3, at least in my draft) must be applied to each row. IMO this means that the optimizer can choose to evaluate the <value expression> once only if it knows for certain that multiple calls will return the same value. For example "my_cachable_function()", not "my_uncachable_function()" or "nextval()". Or IMO a view built using a non-cachable function. In other words it can only suppress evaluation if it can be certain that doing so doesn't change the result. Another nit-pick, the claim's not even strictly true. "i IS NULL OR i IS NOT NULL" can be folded to true, so the subselect's not "visibly dependent on i". In fact, it is quite visibly *not* dependent on the outer query. PG just isn't smart enough to fold the boolean expression into the known value "true". It's this kind of uncertainty that makes the current behavior so ... ugly. You get different answers depending on various optimization values, the complexity of the query, etc. ISTM that the standard is quite clearly worded to avoid this unpleasantness. ... > It looks to me like the spec does NOT attempt to nail down the behavior > of non-deterministic functions; in the places where they talk about > non-deterministic functions at all, it's mostly to forbid their use in > contexts where nondeterminism would affect the final result. Otherwise > the results are implementation-dependent. I've been looking at a few of the "non-deterministic" clauses in the General Rules, out of curiousity. They generally aren't involved with the execution or non-execution of expressions. Ordering of execution is in many cases non-deterministic and implementation-dependent. There are plenty of General Rules of this sort. We also have this big gaping black hole of non-determinism due to character set collation. In other words: select foo from bar order by foo; is non-deterministic (we don't know the order in which the rows will be returned) if foo is a character type. This can even be true within implementations, for instance in PG it changes with when you change locales (and have locale support enabled). However, it seems clear that: select foo, my_function() from bar order by foo; requires my_function() to be called for every row - we just can't depend on the order in which it will be applied to those rows in the case where foo is a character type. Of course, iscachable tells the optimizer that it's OK to just call it once but that's an extension outside SQL3's domain. Obviously if you run this query over and over again with the same collation order the "order by" is deterministic. The non-determinism is in respect to the portability of the query to implementationsbuilt on differing character sets. I'm just not seeing justification for claiming that Section 7.12 can be ignored if the subselect or view happens to contain a function that's not cachable. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Some more bug-or-not-bug thoughts ... I thought I'd add a quote from Date that furthers my belief that the subselect example I posted does indeed expose a bug: (T1 is the table conceptually created by the various joins, etc) "[if] the select-item takes the form "scalar-expression [[AS] column]" ... For each row in T1 the scalar-expression is evaluated .." (page 151 Date & Darwin) SQL92 didn't support subselects in the select-item-list. SQL3 extends the expression to include one-row selects that return a single scalar value. It does NOT however add any wording that allows the subselect to be yanked and evaluated once rather than evaluated for each row. The standard uses the word "applied" not "evaluated". I interpret this to mean "evaluated" and it appears that Date does, too. On the other hand the view example is giving the proper result in PG 7.2, though only by luck, as Tom pointed out earler. For (given the view "create view foo as select nextval('foo_sequence') as nextval;") select foo.nextval from multiple_rows; isn't actually legal SQL. It must be stated as: select foo.nextval from foo, multiple_rows; (all PG does is add "foo" to the from clause for me if I leave it out). The semantics of this are obvious when you think about it - materialize "foo" then cross-join the resulting table with multiple_rows. Since "foo" returns a single row computed by "nextval('foo_sequence')" obviously the result seen with PG 7.2 is correct. Date is quite clear on the semantics of this and it makes tons of sense since views are meant to be treated like tables. So: 1. If an explicit scalar subselect appears in the target list, it should be executed for every row in the result set. 2. A view referenced in the target list is actually supposed to be materialized in the FROM clause (even if implictly addedto it for you) then joined to the other tables in the query, if any. Meaning it should always be executed onceand only once. The standard doesn't have PG-style rules, of course, but such tables are also should be in the FROMclause, evaluated and joined afterwards IMO. At least that's my reading and I've spent quite a bit of time on this now. Unfortunately PG currently doesn't use the form of the query to decide whether or not to execute the subselect or view once or for each row, but rather does so depending on the estimated cost of each approach. That's the real bug it seems. The form of the query, not the whim of the optimizer, is the determinant. Neither of these cases is likely to arise frequently in practice, so if I ruled Middle Earth I'd decree that: 1. It be filed as a bug 2. It not be assigned a high priority. However it's not merely of academic interest. The semantics of the view example is such that you should be able to force single-evaluation of a function by simply wrapping it in a view, regardless of whether or not it has side-effects. Meanwhile I get to go off and inspect the roughly 750 queries that use this particular style view and determine which ones incorrectly assume that the view's evaluated more than once per query! :) -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org