Обсуждение: MIN/MAX functions for a record

Поиск
Список
Период
Сортировка

MIN/MAX functions for a record

От
Viliam Ďurina
Дата:
In my queries I often need to do MIN/MAX for tuples, for example:

  SELECT MAX(row(year, month)) 
  FROM (VALUES(2025, 1), (2024,2)) x(year, month);

This query throws:

    ERROR: function max(record) does not exist

In this case you can replace it with `MAX((year||'-'||month||'-1')::date)`. However in my case I have an event table with `event_time` and `text` columns, I'm grouping that table by some key and want to have the text for the newest event. I would do `MAX(ROW(event_time, text)).text`. Workarounds for this are clumsy, e.g. with a subquery with LIMIT 1.

The lack of this feature is kind of unexpected, because the `>` operator or `GREATEST` function are defined for records:

    SELECT 
        GREATEST((2025, 1), (2024, 2)), 
        (2025, 1) > (2024, 2)

Was this ever discussed or is there something preventing the implementation?

Viliam

Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> In my queries I often need to do MIN/MAX for tuples, for example:
>
>   SELECT MAX(row(year, month))
>   FROM (VALUES(2025, 1), (2024,2)) x(year, month);
>
> This query throws:
>
>     ERROR: function max(record) does not exist
>
> In this case you can replace it with `MAX((year||'-'||month||'-1')::date)`. However in my case I have an event table
with`event_time` and `text` columns, I'm grouping that table by some key and want to have the text for the newest
event.I would do `MAX(ROW(event_time, text)).text`. Workarounds for this are clumsy, e.g. with a subquery with LIMIT 1. 
>
> The lack of this feature is kind of unexpected, because the `>` operator or `GREATEST` function are defined for
records:
>
>     SELECT
>         GREATEST((2025, 1), (2024, 2)),
>         (2025, 1) > (2024, 2)
>
> Was this ever discussed or is there something preventing the implementation?

I believe it would be challenging to implement max(record) that would
work reasonably well in a general case.

What if, for instance, one of the columns is JOSNB or XML? Not to
mention the fact that Postgres supports user-defined types which don't
necessarily have a reasonable order. Take a point in a 2D or 3D space
as an example. On top of that I doubt that the proposed query will
perform well since I don't see how it could benefit from using
indexes. I don't claim that this is necessarily true in your case but
generally one could argue that the wrong schema is used here and
instead of (year, month) pair a table should have a date/timestamp(tz)
column.

Personally I would choose format() function [1] in cases like this in
order to play it safe. Assuming of course that the table is small and
the query is not executed often.

[1]: https://www.postgresql.org/docs/current/functions-string.html

--
Best regards,
Aleksander Alekseev



Re: MIN/MAX functions for a record

От
Tom Lane
Дата:
Aleksander Alekseev <aleksander@timescale.com> writes:
>> In my queries I often need to do MIN/MAX for tuples, for example:
>> SELECT MAX(row(year, month))
>> FROM (VALUES(2025, 1), (2024,2)) x(year, month);
>> This query throws:
>> ERROR: function max(record) does not exist
>> Was this ever discussed or is there something preventing the implementation?

> I believe it would be challenging to implement max(record) that would
> work reasonably well in a general case.

As long as you define it as "works the same way record comparison
does", ie base it on record_cmp(), I don't think it would be much
more than a finger exercise [*].  And why would you want it to act
any differently from record_cmp()?  Those semantics have been
established for a long time.

            regards, tom lane

[*] Although conceivably there are some challenges in getting
record_cmp's caching logic to work in the context of an aggregate.



Re: MIN/MAX functions for a record

От
Viliam Ďurina
Дата:
Exactly Tom, I see no fundamental problem for it not to be implemented, since comparison operator is already implemented. In fact, MIN/MAX should work for all types for which comparison operator is defined.

Regarding index support, there should not be an issue if the index is defined for the record (e.g. `CREATE INDEX ON my_table(ROW(field_a, field_b))`). However such indexes seem not to be supported. Whether a composite index is compatible with a record created on the indexed fields in every edge case I'm not sure...

Alexander, rewriting the year-month example is easy, but how would you rewrite this query?

CREATE TABLE events(event_time TIMESTAMP, message VARCHAR, user_id VARCHAR);

You want a newest message for each user. It's easy with MAX(record):

SELECT user_id, MAX(ROW(event_time, message)).message
FROM events
GROUP BY user_id;

One option is to rewrite to a subquery with LIMIT 1

SELECT user_id, (SELECT message FROM events e2 WHERE e1.user_id=e2.user_id ORDER BY event_time DESC LIMIT 1)
FROM events e1
GROUP BY user_id;

If your query already has multiple levels of grouping, multiple joins, UNIONs etc., it gets much more complex. I also wonder if the optimizer would pick the same plan as it would be if the MAX(record) is supported.

Viliam

On Fri, Mar 22, 2024 at 4:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksander Alekseev <aleksander@timescale.com> writes:
>> In my queries I often need to do MIN/MAX for tuples, for example:
>> SELECT MAX(row(year, month))
>> FROM (VALUES(2025, 1), (2024,2)) x(year, month);
>> This query throws:
>> ERROR: function max(record) does not exist
>> Was this ever discussed or is there something preventing the implementation?

> I believe it would be challenging to implement max(record) that would
> work reasonably well in a general case.

As long as you define it as "works the same way record comparison
does", ie base it on record_cmp(), I don't think it would be much
more than a finger exercise [*].  And why would you want it to act
any differently from record_cmp()?  Those semantics have been
established for a long time.

                        regards, tom lane

[*] Although conceivably there are some challenges in getting
record_cmp's caching logic to work in the context of an aggregate.

Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> Exactly Tom, I see no fundamental problem for it not to be implemented, since comparison operator is already
implemented.In fact, MIN/MAX should work for all types for which comparison operator is defined.
 

On second thought, this should work reasonably well.

PFA a WIP patch. At this point it implements only MAX(record), no MIN, no tests:

```
=# SELECT MAX(row(year, month)) FROM (VALUES(2025, 1), (2024,2)) x(year, month);
   max
----------
 (2025,1)
```

One thing I'm not 100% sure of is whether record_larger() should make
a copy of its arguments or the current implementation is safe.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: MIN/MAX functions for a record

От
Tom Lane
Дата:
Aleksander Alekseev <aleksander@timescale.com> writes:
> One thing I'm not 100% sure of is whether record_larger() should make
> a copy of its arguments or the current implementation is safe.

I don't see any copying happening in, say, text_larger or
numeric_larger, so this shouldn't need to either.

Personally I'd write "record_cmp(fcinfo) > 0" rather than indirecting
through record_gt.  The way you have it is not strictly correct anyhow:
you're cheating by not using DirectFunctionCall.

Also, given that you're passing the fcinfo, there's no need
to extract the arguments from it before that call.  So it
seems to me that code like

    if (record_cmp(fcinfo) > 0)
        PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(0));
    else
        PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(1));

should do, and possibly save one useless detoast step.  Or you could
do

    if (record_cmp(fcinfo) > 0)
        PG_RETURN_DATUM(PG_GETARG_DATUM(0));
    else
        PG_RETURN_DATUM(PG_GETARG_DATUM(1));

because really there's no point in detoasting at all.

            regards, tom lane



Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> I don't see any copying happening in, say, text_larger or
> numeric_larger, so this shouldn't need to either.
>
> Personally I'd write "record_cmp(fcinfo) > 0" rather than indirecting
> through record_gt.  The way you have it is not strictly correct anyhow:
> you're cheating by not using DirectFunctionCall.
>
> Also, given that you're passing the fcinfo, there's no need
> to extract the arguments from it before that call.  So it
> seems to me that code like
>
>         if (record_cmp(fcinfo) > 0)
>                 PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(0));
>         else
>                 PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(1));
>
> should do, and possibly save one useless detoast step.  Or you could
> do
>
>         if (record_cmp(fcinfo) > 0)
>                 PG_RETURN_DATUM(PG_GETARG_DATUM(0));
>         else
>                 PG_RETURN_DATUM(PG_GETARG_DATUM(1));
>
> because really there's no point in detoasting at all.

Many thanks. Here is the corrected patch. Now it also includes MIN()
support and tests.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> Many thanks. Here is the corrected patch. Now it also includes MIN()
> support and tests.

Michael Paquier (cc:'ed) commented offlist that I forgot to change the
documentation.

Here is the corrected patch.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: MIN/MAX functions for a record

От
Michael Paquier
Дата:
On Mon, Jul 08, 2024 at 12:20:30PM +0300, Aleksander Alekseev wrote:
> Here is the corrected patch.

313f87a17155 is one example of a similar change with pg_lsn, with four
entries added to pg_proc and two to pg_aggregate.  That's what this
patch is doing from what I can see.

-        and arrays of any of these types.
+        and also arrays and records of any of these types.

This update of the docs is incorrect, no?  Records could include much
more types than the ones currently supported for min()/max().

I am not sure to get the concerns of upthread regarding the type
caching in the context of an aggregate, which is the business with
lookup_type_cache(), especially since there is a btree operator
relying on record_cmp().  Tom, what were your concerns here?
--
Michael

Вложения

Re: MIN/MAX functions for a record

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> I am not sure to get the concerns of upthread regarding the type
> caching in the context of an aggregate, which is the business with
> lookup_type_cache(), especially since there is a btree operator
> relying on record_cmp().  Tom, what were your concerns here?

Don't recall right at this instant, but I've put myself down as
reviewer of this patch to remind me to look at it in the next
day or two.

            regards, tom lane



Re: MIN/MAX functions for a record

От
Michael Paquier
Дата:
On Mon, Jul 08, 2024 at 08:54:31PM -0400, Tom Lane wrote:
> Don't recall right at this instant, but I've put myself down as
> reviewer of this patch to remind me to look at it in the next
> day or two.

Thanks for the update.  WFM.
--
Michael

Вложения

Re: MIN/MAX functions for a record

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Jul 08, 2024 at 12:20:30PM +0300, Aleksander Alekseev wrote:
> -        and arrays of any of these types.
> +        and also arrays and records of any of these types.

> This update of the docs is incorrect, no?  Records could include much
> more types than the ones currently supported for min()/max().

Yeah, actually the contained data types could be anything with
btree sort support.  This is true for arrays too, so the text
was wrong already.  I changed it to

+        and also arrays and composite types containing sortable data types.

(Using "composite type" not "record" is a judgment call here, but
I don't see anyplace else in func.sgml preferring "record" for this
meaning.)

> I am not sure to get the concerns of upthread regarding the type
> caching in the context of an aggregate, which is the business with
> lookup_type_cache(), especially since there is a btree operator
> relying on record_cmp().  Tom, what were your concerns here?

Re-reading, I was just mentioning that as something to check,
not a major problem.  It isn't, because array min/max are already
relying on the ability to use fcinfo->flinfo->fn_extra as cache space
in an aggregate.  (Indeed, the array aggregate code is almost
identical to where we ended up.)

AFAICS this is good to go.  I made a couple of tiny cosmetic
adjustments, added a catversion bump, and pushed.

            regards, tom lane