Обсуждение: behavior of GROUP BY with VOLATILE expressions

Поиск
Список
Период
Сортировка

behavior of GROUP BY with VOLATILE expressions

От
Paul George
Дата:
Hackers,

I wanted to surface a discussion in [1] regarding the expected behavior of GROUP BY with VOLATILE expressions. There seems to be a discrepancy between how volatile functions (RANDOM(), also confirmed with TIMEOFDAY()) and subqueries are evaluated in groups. In the examples below, volatile functions do not always appear to be evaluated per-call (evidenced by looking at EXPLAIN or results) whereas scalar subqueries always appear to be independently evaluated.

Based on the docs, "A query using a volatile function will re-evaluate the function at every row where its value is needed," it seems that the handling of subqueries is correct and that each call to RANDOM() should be evaluated (not the current behavior). But, what is correct/anticipated?

[version: PostgreSQL 16.3]

-- ### grouping volatile functions
-- single evaluation of RANDOM()
select random(), random(), random() group by 1;
       random       |       random       |       random      
--------------------+--------------------+--------------------
 0.5156775158087117 | 0.5156775158087117 | 0.5156775158087117
(1 row)

-- two evaluations of RANDOM()
select random(), random(), random() group by 1, 2;
       random        |       random        |       random        
---------------------+---------------------+---------------------
 0.36612763448670793 | 0.23423805164449374 | 0.36612763448670793
(1 row)

-- three evaluations of RANDOM()
select random(), random(), random() group by 1, 2, 3;
       random       |       random       |       random      
--------------------+--------------------+--------------------
 0.2292929455776751 | 0.6613628224046473 | 0.5367692073422399
(1 row)

-- single evaluation of RANDOM()
select random(), random(), random() group by random();
       random       |       random       |       random      
--------------------+--------------------+--------------------
 0.3069805404158834 | 0.3069805404158834 | 0.3069805404158834
(1 row)

-- single evaluation of RANDOM()
select random(), random(), random() group by random(), random();
       random       |       random       |       random      
--------------------+--------------------+--------------------
 0.2860459945718521 | 0.2860459945718521 | 0.2860459945718521
(1 row)

-- single evaluation of RANDOM()
select random(), random(), random() group by random(), random(), random();
       random       |       random       |       random      
--------------------+--------------------+--------------------
 0.3249129391658361 | 0.3249129391658361 | 0.3249129391658361
(1 row)


-- ### grouping scalar subqueries
-- each subquery evaluated
select (select random()), (select random()), (select random()) group by 1;
       random        |       random       |       random      
---------------------+--------------------+--------------------
 0.30149979064538757 | 0.7911979526441186 | 0.5251471322291046
(1 row)

-- each subquery evaluated
select (select random()), (select random()), (select random()) group by (select random());
       random       |       random       |        random        
--------------------+--------------------+----------------------
 0.3411533489925591 | 0.4359004781684166 | 0.018305770511828356
(1 row)


-- ### sample EXPLAINs
-- two evaluations of RANDOM()
explain (verbose, costs off) select random(), random(), random() group by 1, 2;
                  QUERY PLAN                  
----------------------------------------------
 HashAggregate
   Output: (random()), (random()), (random())
   Group Key: random(), random()
   ->  Result
         Output: random(), random()
(5 rows)

-- singe evaluation of RANDOM()
explain (verbose, costs off) select random(), random(), random() group by random(), random();
                  QUERY PLAN                  
----------------------------------------------
 HashAggregate
   Output: (random()), (random()), (random())
   Group Key: random()
   ->  Result
         Output: random()
(5 rows)


-Paul-

Re: behavior of GROUP BY with VOLATILE expressions

От
"David G. Johnston"
Дата:
On Fri, Jul 19, 2024 at 7:20 AM Paul George <p.a.george19@gmail.com> wrote:

I wanted to surface a discussion in [1] regarding the expected behavior of GROUP BY with VOLATILE expressions. There seems to be a discrepancy between how volatile functions (RANDOM(), also confirmed with TIMEOFDAY()) and subqueries are evaluated in groups. In the examples below, volatile functions do not always appear to be evaluated per-call (evidenced by looking at EXPLAIN or results) whereas scalar subqueries always appear to be independently evaluated.

Based on the docs, "A query using a volatile function will re-evaluate the function at every row where its value is needed," it seems that the handling of subqueries is correct and that each call to RANDOM() should be evaluated (not the current behavior). But, what is correct/anticipated?


The observed behaviors are unlikely to change.  Prior discussions can be found regarding this:


David J.

Re: behavior of GROUP BY with VOLATILE expressions

От
Paul George
Дата:
Great, thanks for the links and useful past discussions! I figured I wasn't the first to stumble across this, and it's interesting to see the issue arise with ORDER BY [VOLATILE FUNC] as well.

My question was not so much about changing behavior as it was about understanding what is desired, especially in light of the fact that subqueries behave differently. From my reading of the links you provided, it seems that even the notion of "desired" here is itself dubious and that there is a case for reevaluating RANDOM() everywhere and a case for not doing that. Given this murkiness, is it fair then to say that drawing parallels between how GROUP BY subquery is handled is moot?

-Paul-

On Fri, Jul 19, 2024 at 7:48 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, Jul 19, 2024 at 7:20 AM Paul George <p.a.george19@gmail.com> wrote:

I wanted to surface a discussion in [1] regarding the expected behavior of GROUP BY with VOLATILE expressions. There seems to be a discrepancy between how volatile functions (RANDOM(), also confirmed with TIMEOFDAY()) and subqueries are evaluated in groups. In the examples below, volatile functions do not always appear to be evaluated per-call (evidenced by looking at EXPLAIN or results) whereas scalar subqueries always appear to be independently evaluated.

Based on the docs, "A query using a volatile function will re-evaluate the function at every row where its value is needed," it seems that the handling of subqueries is correct and that each call to RANDOM() should be evaluated (not the current behavior). But, what is correct/anticipated?


The observed behaviors are unlikely to change.  Prior discussions can be found regarding this:


David J.

Re: behavior of GROUP BY with VOLATILE expressions

От
"David G. Johnston"
Дата:
On Fri, Jul 19, 2024 at 2:21 PM Paul George <p.a.george19@gmail.com> wrote:
Great, thanks for the links and useful past discussions! I figured I wasn't the first to stumble across this, and it's interesting to see the issue arise with ORDER BY [VOLATILE FUNC] as well.

My question was not so much about changing behavior as it was about understanding what is desired, especially in light of the fact that subqueries behave differently. From my reading of the links you provided, it seems that even the notion of "desired" here is itself dubious and that there is a case for reevaluating RANDOM() everywhere and a case for not doing that. Given this murkiness, is it fair then to say that drawing parallels between how GROUP BY subquery is handled is moot?

Only now just grasping that you are trying to group something that is definitionally random.  That just doesn't make sense to me.  Grouping is for categorical data (loosely defined, something like Invoice# arguably counts as a category if you are looking at invoice details.)

I'll stick with: this whole area, implementation-wise, is going to remain status-quo.  If you've got ideas for documenting it better hopefully a patch goes in at some point.  Mostly that can be done black-box style - inputs and outputs, not code reading.

David J.

Re: behavior of GROUP BY with VOLATILE expressions

От
Paul George
Дата:
David:

>Only now just grasping that you are trying to group something that is definitionally random.  That just doesn't make sense to me.

Oh, sorry for the confusion. Yeah, totally. I didn't mean to draw specific attention to GROUP BY -- as you've pointed out elsewhere this issue also exists with ORDER BY.

To clean this up a bit, it's specifically the comparison of how volatile functions and expressions are evaluated differently here (covered in prior links you've provided),

postgres=# select random(), random() order by random();
      random       |      random      
-------------------+-------------------
 0.956989895473876 | 0.956989895473876
(1 row)

and, here,

postgres=# select (select random()), (select random()) order by (select random());
       random       |       random      
--------------------+--------------------
 0.2872914386383745 | 0.8976525075618966
(1 row)

Regarding documentation, I think those changes would be useful. There's this suggestion

"An expression or subexpression in
the SELECT list that matches an ORDER BY or GROUP BY item is taken to represent the same value that was sorted or grouped by, even when the
(sub)expression is volatile".

and this one,

"A side-effect of this feature is that ORDER BY expressions containing
volatile functions will execute the volatile function only once for the
entire row; thus any column expressions using the same function will reuse
the same function result."

But I don't think either cover the additional, albeit nuanced, case of volatile scalar subqueries.

-Paul-

On Fri, Jul 19, 2024 at 2:28 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, Jul 19, 2024 at 2:21 PM Paul George <p.a.george19@gmail.com> wrote:
Great, thanks for the links and useful past discussions! I figured I wasn't the first to stumble across this, and it's interesting to see the issue arise with ORDER BY [VOLATILE FUNC] as well.

My question was not so much about changing behavior as it was about understanding what is desired, especially in light of the fact that subqueries behave differently. From my reading of the links you provided, it seems that even the notion of "desired" here is itself dubious and that there is a case for reevaluating RANDOM() everywhere and a case for not doing that. Given this murkiness, is it fair then to say that drawing parallels between how GROUP BY subquery is handled is moot?

Only now just grasping that you are trying to group something that is definitionally random.  That just doesn't make sense to me.  Grouping is for categorical data (loosely defined, something like Invoice# arguably counts as a category if you are looking at invoice details.)

I'll stick with: this whole area, implementation-wise, is going to remain status-quo.  If you've got ideas for documenting it better hopefully a patch goes in at some point.  Mostly that can be done black-box style - inputs and outputs, not code reading.

David J.