Обсуждение: Extracting superlatives - SQL design philosophy

От:
Dave Crooke
Дата:

This is a generic SQL issue and not PG specific, but I'd like to get
an opinion from this list.

Consider the following data:

# \d bar
                Table "public.bar"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 city   | character varying(255)      |
 temp   | integer                     |
 date   | timestamp without time zone |

# select * from bar order by city, date;
   city    | temp |        date
-----------+------+---------------------
 Austin    |   75 | 2010-02-21 15:00:00
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   56 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(5 rows)

If you want the highest recorded temperature for a city, that's easy
to do, since the selection criteria works on the same column that we
are extracing:

# select city, max(temp) from bar group by city order by 1;
   city    | max
-----------+-----
 Austin    |  75
 Edinburgh |  42
 New York  |  78
(3 rows)


However there is (AFAIK) no simple way in plain SQL to write a query
that performs such an aggregation where the aggregation criteria is on
one column and you want to return another, e.g. adding the the *date
of* that highest temperature to the output above, or doing a query to
get the most recent temperature reading for each city.

What I'd like to do is something like the below (and I'm inventing
mock syntax here, the following is not valid SQL):

-- Ugly implicit syntax but no worse than an Oracle outer join ;-)
select city, temp, date from bar where date=max(date) group by city,
temp order by city;

or perhaps

-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;

Both of the above, if they existed, would be a single data access
followed by and sort-merge.

The only way I know how to do it involves doing two accesses to the data, e.g.

# select city, temp, date from bar a where date=(select max(b.date)
from bar b where a.city=b.city) order by 1;
   city    | temp |        date
-----------+------+---------------------
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(3 rows)


# explain select * from bar a where date=(select max(b.date) from bar
b where a.city=b.city) order by 1;
                                QUERY PLAN
--------------------------------------------------------------------------
 Sort  (cost=1658.86..1658.87 rows=1 width=528)
   Sort Key: a.city
   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
         Filter: (date = (subplan))
         SubPlan
           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
width=8)     -- would be an index lookup in a real scenario
                       Filter: (($0)::text = (city)::text)
(8 rows)

От:
Mose
Дата:

Can you try using window functions?

Something like this:

select distinct
  city,
  first_value(temp) over w as max_temp,
  first_value(date) over w as max_temp_date
 from
  cities
  window w as (partition by city order by temp desc)


- Mose

On Wed, Feb 24, 2010 at 1:31 PM, Dave Crooke <> wrote:
This is a generic SQL issue and not PG specific, but I'd like to get
an opinion from this list.

Consider the following data:

# \d bar
                Table "public.bar"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 city   | character varying(255)      |
 temp   | integer                     |
 date   | timestamp without time zone |

# select * from bar order by city, date;
  city    | temp |        date
-----------+------+---------------------
 Austin    |   75 | 2010-02-21 15:00:00
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   56 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(5 rows)

If you want the highest recorded temperature for a city, that's easy
to do, since the selection criteria works on the same column that we
are extracing:

# select city, max(temp) from bar group by city order by 1;
  city    | max
-----------+-----
 Austin    |  75
 Edinburgh |  42
 New York  |  78
(3 rows)


However there is (AFAIK) no simple way in plain SQL to write a query
that performs such an aggregation where the aggregation criteria is on
one column and you want to return another, e.g. adding the the *date
of* that highest temperature to the output above, or doing a query to
get the most recent temperature reading for each city.

What I'd like to do is something like the below (and I'm inventing
mock syntax here, the following is not valid SQL):

-- Ugly implicit syntax but no worse than an Oracle outer join ;-)
select city, temp, date from bar where date=max(date) group by city,
temp order by city;

or perhaps

-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;

Both of the above, if they existed, would be a single data access
followed by and sort-merge.

The only way I know how to do it involves doing two accesses to the data, e.g.

# select city, temp, date from bar a where date=(select max(b.date)
from bar b where a.city=b.city) order by 1;
  city    | temp |        date
-----------+------+---------------------
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(3 rows)


# explain select * from bar a where date=(select max(b.date) from bar
b where a.city=b.city) order by 1;
                               QUERY PLAN
--------------------------------------------------------------------------
 Sort  (cost=1658.86..1658.87 rows=1 width=528)
  Sort Key: a.city
  ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
        Filter: (date = (subplan))
        SubPlan
          ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
                ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
width=8)     -- would be an index lookup in a real scenario
                      Filter: (($0)::text = (city)::text)
(8 rows)

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

От:
"Garrett Murphy"
Дата:

This looks to be a perfect use for SELECT DISTINCT ON:

SELECT DISTINCT ON (city)
* FROM bar
ORDER BY city, temp desc

Or am I misunderstanding the issue?

Garrett Murphy

-----Original Message-----
From:  [mailto:] On Behalf Of Dave Crooke
Sent: Wednesday, February 24, 2010 2:31 PM
To: pgsql-performance
Subject: [PERFORM] Extracting superlatives - SQL design philosophy

This is a generic SQL issue and not PG specific, but I'd like to get
an opinion from this list.

Consider the following data:

# \d bar
                Table "public.bar"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 city   | character varying(255)      |
 temp   | integer                     |
 date   | timestamp without time zone |

# select * from bar order by city, date;
   city    | temp |        date
-----------+------+---------------------
 Austin    |   75 | 2010-02-21 15:00:00
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   56 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(5 rows)

If you want the highest recorded temperature for a city, that's easy
to do, since the selection criteria works on the same column that we
are extracing:

# select city, max(temp) from bar group by city order by 1;
   city    | max
-----------+-----
 Austin    |  75
 Edinburgh |  42
 New York  |  78
(3 rows)


However there is (AFAIK) no simple way in plain SQL to write a query
that performs such an aggregation where the aggregation criteria is on
one column and you want to return another, e.g. adding the the *date
of* that highest temperature to the output above, or doing a query to
get the most recent temperature reading for each city.

What I'd like to do is something like the below (and I'm inventing
mock syntax here, the following is not valid SQL):

-- Ugly implicit syntax but no worse than an Oracle outer join ;-)
select city, temp, date from bar where date=max(date) group by city,
temp order by city;

or perhaps

-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;

Both of the above, if they existed, would be a single data access
followed by and sort-merge.

The only way I know how to do it involves doing two accesses to the data, e.g.

# select city, temp, date from bar a where date=(select max(b.date)
from bar b where a.city=b.city) order by 1;
   city    | temp |        date
-----------+------+---------------------
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(3 rows)


# explain select * from bar a where date=(select max(b.date) from bar
b where a.city=b.city) order by 1;
                                QUERY PLAN
--------------------------------------------------------------------------
 Sort  (cost=1658.86..1658.87 rows=1 width=528)
   Sort Key: a.city
   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
         Filter: (date = (subplan))
         SubPlan
           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
width=8)     -- would be an index lookup in a real scenario
                       Filter: (($0)::text = (city)::text)
(8 rows)

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

От:
"George Sexton"
Дата:

You could do:

select
    B.City,
    MaxCityTemp.Temp,
    min(B.Date) as FirstMaxDate
from bar b
    INNER JOIN (select city,max(temp) as Temp from Bar group by City) as
MaxCityTemp
    ON B.City=MaxCityTemp.City
Group by
    B.City,
    MaxCityTemp.Temp

George Sexton
MH Software, Inc.
http://www.mhsoftware.com/
Voice: 303 438 9585


> -----Original Message-----
> From:  [mailto:pgsql-performance-
> ] On Behalf Of Dave Crooke
> Sent: Wednesday, February 24, 2010 2:31 PM
> To: pgsql-performance
> Subject: [PERFORM] Extracting superlatives - SQL design philosophy
>
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>    city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>    city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the
> data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
>    city    | temp |        date
> -----------+------+---------------------
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
>                                 QUERY PLAN
> -----------------------------------------------------------------------
> ---
>  Sort  (cost=1658.86..1658.87 rows=1 width=528)
>    Sort Key: a.city
>    ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
>          Filter: (date = (subplan))
>          SubPlan
>            ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
>                  ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> width=8)     -- would be an index lookup in a real scenario
>                        Filter: (($0)::text = (city)::text)
> (8 rows)
>
> --
> Sent via pgsql-performance mailing list (pgsql-
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



От:
"George Sexton"
Дата:

I missed something:

select
    B.City,
    MaxCityTemp.Temp,
    min(B.Date) as FirstMaxDate
from bar b
    INNER JOIN (select city,max(temp) as Temp from Bar group by City) as
MaxCityTemp
    ON B.City=MaxCityTemp.City AND B.Temp=MaxCityTemp.Temp
Group by
    B.City,
    MaxCityTemp.Temp

George Sexton
MH Software, Inc.
http://www.mhsoftware.com/
Voice: 303 438 9585


> -----Original Message-----
> From:  [mailto:pgsql-performance-
> ] On Behalf Of George Sexton
> Sent: Wednesday, February 24, 2010 2:58 PM
> To: 
> Subject: Re: [PERFORM] Extracting superlatives - SQL design philosophy
>
> You could do:
>
> select
>     B.City,
>     MaxCityTemp.Temp,
>     min(B.Date) as FirstMaxDate
> from bar b
>     INNER JOIN (select city,max(temp) as Temp from Bar group by City)
> as
> MaxCityTemp
>     ON B.City=MaxCityTemp.City
> Group by
>     B.City,
>     MaxCityTemp.Temp
>
> George Sexton
> MH Software, Inc.
> http://www.mhsoftware.com/
> Voice: 303 438 9585
>
>
> > -----Original Message-----
> > From:  [mailto:pgsql-
> performance-
> > ] On Behalf Of Dave Crooke
> > Sent: Wednesday, February 24, 2010 2:31 PM
> > To: pgsql-performance
> > Subject: [PERFORM] Extracting superlatives - SQL design philosophy
> >
> > This is a generic SQL issue and not PG specific, but I'd like to get
> > an opinion from this list.
> >
> > Consider the following data:
> >
> > # \d bar
> >                 Table "public.bar"
> >  Column |            Type             | Modifiers
> > --------+-----------------------------+-----------
> >  city   | character varying(255)      |
> >  temp   | integer                     |
> >  date   | timestamp without time zone |
> >
> > # select * from bar order by city, date;
> >    city    | temp |        date
> > -----------+------+---------------------
> >  Austin    |   75 | 2010-02-21 15:00:00
> >  Austin    |   35 | 2010-02-23 15:00:00
> >  Edinburgh |   42 | 2010-02-23 15:00:00
> >  New York  |   56 | 2010-02-23 15:00:00
> >  New York  |   78 | 2010-06-23 15:00:00
> > (5 rows)
> >
> > If you want the highest recorded temperature for a city, that's easy
> > to do, since the selection criteria works on the same column that we
> > are extracing:
> >
> > # select city, max(temp) from bar group by city order by 1;
> >    city    | max
> > -----------+-----
> >  Austin    |  75
> >  Edinburgh |  42
> >  New York  |  78
> > (3 rows)
> >
> >
> > However there is (AFAIK) no simple way in plain SQL to write a query
> > that performs such an aggregation where the aggregation criteria is
> on
> > one column and you want to return another, e.g. adding the the *date
> > of* that highest temperature to the output above, or doing a query to
> > get the most recent temperature reading for each city.
> >
> > What I'd like to do is something like the below (and I'm inventing
> > mock syntax here, the following is not valid SQL):
> >
> > -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> > select city, temp, date from bar where date=max(date) group by city,
> > temp order by city;
> >
> > or perhaps
> >
> > -- More explicit
> > select aggregate_using(max(date), city, temp, date) from bar group by
> > city, temp order by city;
> >
> > Both of the above, if they existed, would be a single data access
> > followed by and sort-merge.
> >
> > The only way I know how to do it involves doing two accesses to the
> > data, e.g.
> >
> > # select city, temp, date from bar a where date=(select max(b.date)
> > from bar b where a.city=b.city) order by 1;
> >    city    | temp |        date
> > -----------+------+---------------------
> >  Austin    |   35 | 2010-02-23 15:00:00
> >  Edinburgh |   42 | 2010-02-23 15:00:00
> >  New York  |   78 | 2010-06-23 15:00:00
> > (3 rows)
> >
> >
> > # explain select * from bar a where date=(select max(b.date) from bar
> > b where a.city=b.city) order by 1;
> >                                 QUERY PLAN
> > ---------------------------------------------------------------------
> --
> > ---
> >  Sort  (cost=1658.86..1658.87 rows=1 width=528)
> >    Sort Key: a.city
> >    ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
> >          Filter: (date = (subplan))
> >          SubPlan
> >            ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
> >                  ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> > width=8)     -- would be an index lookup in a real scenario
> >                        Filter: (($0)::text = (city)::text)
> > (8 rows)
> >
> > --
> > Sent via pgsql-performance mailing list (pgsql-
> > )
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



От:
Dave Crooke
Дата:

Garrett's is the best answer from the list .... the only fly in the
ointment here is that it performs a full sort of the records, which
isn't strictly necessary to the required output. This is functionally
equivalent to what I came up with for a MODE (most common value)
aggregation, but the syntax is a bit neater.

Craig / Geroge - there are lots of ways to do this with a subquery or
join back against the bar table. My goal was actually to avoid this
dual lookup and join, not for this contrived example but for my
real-world case where the "bar" table is not actually a table but is a
record set generated on the fly with a non-trivial subquery which I
don't want to repeat.

Mose - I think I get what you're aiming at here, but your query as
stated returns a syntax error.

What I'd like to have is a way in SQL to get an execution plan which
matches Java algorithm below, which just does one "table scan" down
the data and aggregates in place, i.e. it's O(n) with row lookups:

   HashMap<String,Row> winners = new HashMap<String,Row>();

   for (Row r : rows) {
        Row oldrow = winners.get(r.city);
        if (oldrow == null || r.temp > oldrow.temp) winnders.put(r.city, r);
    };
    for (String city : winners.keySet()) System.out.println(winners.get(city));

I'd imagine it would be possible to have a query planner optimization
that would convert Garrett's DISTINCT ON syntax to do what I was
trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is
going to return the the one row for each X which has the highest value
of Y, and so use a MAX-structured accumulation instead of a sort.

Cheers
Dave

On Wed, Feb 24, 2010 at 3:43 PM, Garrett Murphy <> wrote:
> This looks to be a perfect use for SELECT DISTINCT ON:
>
> SELECT DISTINCT ON (city)
> * FROM bar
> ORDER BY city, temp desc
>
> Or am I misunderstanding the issue?
>
> Garrett Murphy
>
> -----Original Message-----
> From:  [mailto:] On Behalf Of Dave Crooke
> Sent: Wednesday, February 24, 2010 2:31 PM
> To: pgsql-performance
> Subject: [PERFORM] Extracting superlatives - SQL design philosophy
>
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>   city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
>                                QUERY PLAN
> --------------------------------------------------------------------------
>  Sort  (cost=1658.86..1658.87 rows=1 width=528)
>   Sort Key: a.city
>   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
>         Filter: (date = (subplan))
>         SubPlan
>           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
>                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> width=8)     -- would be an index lookup in a real scenario
>                       Filter: (($0)::text = (city)::text)
> (8 rows)
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

От:
Craig James
Дата:

Dave Crooke wrote:
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>    city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>    city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.

If you add a unique-id column to your table that's filled in from a sequence, it becomes easy:

  select city, temp, date from bar where id in
    (select id from bar where ... whatever you like ...);

Craig

От:
Richard Huxton
Дата:

On 24/02/10 22:47, Dave Crooke wrote:
> I'd imagine it would be possible to have a query planner optimization
> that would convert Garrett's DISTINCT ON syntax to do what I was
> trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is
> going to return the the one row for each X which has the highest value
> of Y, and so use a MAX-structured accumulation instead of a sort.

Why is there only one row? For city temperatures, that seems unlikely.

In the event of more than one row does your algorithm give repeatable
results?

--
   Richard Huxton
   Archonet Ltd

От:
Dave Crooke
Дата:

1. The city temps table is a toy example, not meant to be realistic :-)

2. Yes, my (Java) algorithm is deterministic ... it will return
exactly one row per city, and that will be the row (or strictly, *a*
row) containing the highest temp. Temp value ties will break in favour
of earlier rows in Guinness Book of Records tradition :-) It's
equivalent to a HashAggregate implementation.


The following two query plans (from my real schema) illustrate the
itch I am trying to scratch .... I want the functionality of the 2nd
one, but with the execution plan structure of the first:

# explain analyse select a, max(b) from perf_raw_2010_02_23 group by a;
                                                              QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=117953.09..117961.07 rows=639 width=8) (actual
time=10861.845..10863.008 rows=1023 loops=1)
   ->  Seq Scan on perf_raw_2010_02_23  (cost=0.00..91572.39
rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139
loops=1)
 Total runtime: 10863.856 ms
(3 rows)

Time: 10864.817 ms
# explain analyse select distinct on (a) * from perf_raw_2010_02_23
order by a, b desc ;
                                                                 QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1059395.04..1085775.73 rows=639 width=28) (actual
time=46011.204..58428.210 rows=1023 loops=1)
   ->  Sort  (cost=1059395.04..1072585.39 rows=5276139 width=28)
(actual time=46011.200..53561.112 rows=5276139 loops=1)
         Sort Key: a, b
         Sort Method:  external merge  Disk: 247584kB
-- actually OS RAM buffers
         ->  Seq Scan on perf_raw_2010_02_23  (cost=0.00..91572.39
rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139
loops=1)
 Total runtime: 58516.185 ms
(6 rows)

Time: 58517.233 ms

The only difference between these two is that the second query returns
the whole row. The *ratio* in cost between these two plans increases
in proportion to log(n) of the table size ... at 5.5m rows its
livable, at 500m it's probably not :-!

Cheers
Dave

On Wed, Feb 24, 2010 at 5:12 PM, Richard Huxton <> wrote:
> On 24/02/10 22:47, Dave Crooke wrote:
>>
>> I'd imagine it would be possible to have a query planner optimization
>> that would convert Garrett's DISTINCT ON syntax to do what I was
>> trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is
>> going to return the the one row for each X which has the highest value
>> of Y, and so use a MAX-structured accumulation instead of a sort.
>
> Why is there only one row? For city temperatures, that seems unlikely.
>
> In the event of more than one row does your algorithm give repeatable
> results?
>
> --
>  Richard Huxton
>  Archonet Ltd
>

От:
Richard Huxton
Дата:

On 24/02/10 23:37, Dave Crooke wrote:
> 1. The city temps table is a toy example, not meant to be realistic :-)

You knew that and I guessed it, but it's worth stating these things for
people who read the archives a year from now.

> 2. Yes, my (Java) algorithm is deterministic ... it will return
> exactly one row per city, and that will be the row (or strictly, *a*
> row) containing the highest temp. Temp value ties will break in favour
> of earlier rows in Guinness Book of Records tradition :-) It's
> equivalent to a HashAggregate implementation.

But not when you add in other columns (which is what you're trying to do).

> The following two query plans (from my real schema) illustrate the
> itch I am trying to scratch .... I want the functionality of the 2nd
> one, but with the execution plan structure of the first:
>
> # explain analyse select a, max(b) from perf_raw_2010_02_23 group by a;
>                                                                QUERY
> PLAN
>
--------------------------------------------------------------------------------------------------------------------------------------
>   HashAggregate  (cost=117953.09..117961.07 rows=639 width=8) (actual
> time=10861.845..10863.008 rows=1023 loops=1)
>     ->   Seq Scan on perf_raw_2010_02_23  (cost=0.00..91572.39
> rows=5276139 width=8) (actual time=0.038..4459.222 rows=5276139
> loops=1)
>   Total runtime: 10863.856 ms
> (3 rows)
>
> Time: 10864.817 ms
> # explain analyse select distinct on (a) * from perf_raw_2010_02_23
> order by a, b desc ;

One big bit of the cost difference is going to be the ordering you need
to get a repeatable result.

>                                                                   QUERY
> PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
>   Unique  (cost=1059395.04..1085775.73 rows=639 width=28) (actual
> time=46011.204..58428.210 rows=1023 loops=1)
>     ->   Sort  (cost=1059395.04..1072585.39 rows=5276139 width=28)
> (actual time=46011.200..53561.112 rows=5276139 loops=1)
>           Sort Key: a, b
>           Sort Method:  external merge  Disk: 247584kB
> -- actually OS RAM buffers

Even if the sort never actually reaches a physical disk, you should
still see an increase by increasing sort_mem for the duration of the one
query. It's not going to be the magnitude you want, but probably worth
doing.

>           ->   Seq Scan on perf_raw_2010_02_23  (cost=0.00..91572.39
> rows=5276139 width=28) (actual time=0.047..6491.036 rows=5276139
> loops=1)
>   Total runtime: 58516.185 ms
> (6 rows)
>
> Time: 58517.233 ms
>
> The only difference between these two is that the second query returns
> the whole row. The *ratio* in cost between these two plans increases
> in proportion to log(n) of the table size ... at 5.5m rows its
> livable, at 500m it's probably not :-!

If performance on this query is vital to you, and the table doesn't
change after initial population (which I'm guessing is true) then try an
index on (a asc, b desc) and CLUSTER that index. Depending on the ratio
of distinct a:b values that could be what you're after.

--
   Richard Huxton
   Archonet Ltd

От:
"Pierre C"
Дата:

> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;

select city, max(ROW(temp, date)) from bar group by city;

Does not work (alas) for lack of a default comparison for record type.

Another solution, which works wonders if you've got the list of cities in
a separate table, and an index on (city, temp) is this :

SELECT c.city, (SELECT ROW( t.date, t.temp ) FROM cities_temp t WHERE
t.city=c.city ORDER BY temp DESC LIMIT 1) FROM cities;

This will do a nested loop index scan and it is the fastest way, except if
you have very few rows per city.
The syntax is ugly and you have to extract the stuff from the ROW()
afterwards, though.

Unfortunately, this does not work :

SELECT c.city, (SELECT t.date, t.temp FROM cities_temp t WHERE
t.city=c.city ORDER BY temp DESC LIMIT 1) AS m FROM cities;

because the subselect isn't allowed to return more than 1 column.

Note that you can also get the usually annoying top-N by category to use
the index by doing something like :

SELECT c.city, (SELECT array_agg(date) FROM (SELECT t.date FROM
cities_temp t WHERE t.city=c.city ORDER BY temp DESC LIMIT 5)) AS m FROM
cities;

The results aren't in a very usable form either, but :

CREATE INDEX ti ON annonces( type_id, price ) WHERE price IS NOT NULL;

EXPLAIN ANALYZE SELECT
t.id, (SELECT ROW(a.id, a.price, a.date_annonce)
    FROM annonces a
    WHERE a.type_id = t.id AND price IS NOT NULL
    ORDER BY price DESC LIMIT 1)
 FROM types_bien t;
                                                                   QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Seq Scan on types_bien t  (cost=0.00..196.09 rows=57 width=4) (actual
time=0.025..0.511 rows=57 loops=1)
    SubPlan 1
      ->  Limit  (cost=0.00..3.41 rows=1 width=16) (actual
time=0.008..0.008 rows=1 loops=57)
            ->  Index Scan Backward using ti on annonces a
(cost=0.00..8845.65 rows=2592 width=16) (actual time=0.007..0.007 rows=1
loops=57)
                  Index Cond: (type_id = $0)
  Total runtime: 0.551 ms

explain analyze
select distinct type_id, first_value(price) over w as max_price
 from annonces where price is not null
window w as (partition by type_id order by price desc);
                                                              QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=30515.41..30626.87 rows=11146 width=10) (actual
time=320.927..320.971 rows=46 loops=1)
    ->  WindowAgg  (cost=27729.14..29958.16 rows=111451 width=10) (actual
time=195.289..282.150 rows=111289 loops=1)
          ->  Sort  (cost=27729.14..28007.76 rows=111451 width=10) (actual
time=195.278..210.762 rows=111289 loops=1)
                Sort Key: type_id, price
                Sort Method:  quicksort  Memory: 8289kB
                ->  Seq Scan on annonces  (cost=0.00..18386.17 rows=111451
width=10) (actual time=0.009..72.589 rows=111289 loops=1)
                      Filter: (price IS NOT NULL)
  Total runtime: 322.382 ms

Here using the index is 600x faster... worth a bit of ugly SQL, you decide.

By disabling seq_scan and bitmapscan, you can corecr this plan :

EXPLAIN ANALYZE SELECT DISTINCT ON (type_id) type_id, date_annonce, price
 FROM annonces WHERE price IS NOT NULL ORDER BY type_id, price LIMIT 40;
                                                                 QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..78757.61 rows=33 width=14) (actual time=0.021..145.509
rows=40 loops=1)
    ->  Unique  (cost=0.00..78757.61 rows=33 width=14) (actual
time=0.021..145.498 rows=40 loops=1)
          ->  Index Scan using ti on annonces  (cost=0.00..78478.99
rows=111451 width=14) (actual time=0.018..132.671 rows=110796 loops=1)
  Total runtime: 145.549 ms

This plan would be very bad (unless the whole table is in RAM) because I
guess the index scan isn't aware of the DISTINCT ON, so it scans all rows
in the index and in the table.








От:
"Julien Theulier"
Дата:

Hi all,

 

Just for your information, and this is not related to PG directly:

Teradata provides a “qualify” syntax which works as a filtering condition on a windowed function result. This is the only DB allowing this direct filtering on windowed functions, from what I know.

So, as an example, the query you ask for becomes very easy on this database:

select

city, temp, date

from bar

qualify row_number() over (partition by city order by temp desc)=1

 

This is very practical indeed (you can mix it also with classical where/having/group by syntaxes).

On postgres, you may get the same result using an inner query (sorry, I can’t test it for now) such as:

select

city, temp, date

from

(select city, temp, date, row_number() over (partition by city order by temp desc) as nr

from bar ) a1

where nr=1

 

Julien Theulier

 

De : [mailto:] De la part de Mose
Envoyé : mercredi 24 février 2010 22:50
À : Dave Crooke
Cc : pgsql-performance
Objet : Re: [PERFORM] Extracting superlatives - SQL design philosophy

 

Can you try using window functions?

 

Something like this:

 

select distinct

  city,

  first_value(temp) over w as max_temp,

  first_value(date) over w as max_temp_date

 from

  cities

  window w as (partition by city order by temp desc)

 

 

- Mose

 

On Wed, Feb 24, 2010 at 1:31 PM, Dave Crooke <> wrote:

This is a generic SQL issue and not PG specific, but I'd like to get
an opinion from this list.

Consider the following data:

# \d bar
                Table "public.bar"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 city   | character varying(255)      |
 temp   | integer                     |
 date   | timestamp without time zone |

# select * from bar order by city, date;
  city    | temp |        date
-----------+------+---------------------
 Austin    |   75 | 2010-02-21 15:00:00
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   56 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(5 rows)

If you want the highest recorded temperature for a city, that's easy
to do, since the selection criteria works on the same column that we
are extracing:

# select city, max(temp) from bar group by city order by 1;
  city    | max
-----------+-----
 Austin    |  75
 Edinburgh |  42
 New York  |  78
(3 rows)


However there is (AFAIK) no simple way in plain SQL to write a query
that performs such an aggregation where the aggregation criteria is on
one column and you want to return another, e.g. adding the the *date
of* that highest temperature to the output above, or doing a query to
get the most recent temperature reading for each city.

What I'd like to do is something like the below (and I'm inventing
mock syntax here, the following is not valid SQL):

-- Ugly implicit syntax but no worse than an Oracle outer join ;-)
select city, temp, date from bar where date=max(date) group by city,
temp order by city;

or perhaps

-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;

Both of the above, if they existed, would be a single data access
followed by and sort-merge.

The only way I know how to do it involves doing two accesses to the data, e.g.

# select city, temp, date from bar a where date=(select max(b.date)
from bar b where a.city=b.city) order by 1;
  city    | temp |        date
-----------+------+---------------------
 Austin    |   35 | 2010-02-23 15:00:00
 Edinburgh |   42 | 2010-02-23 15:00:00
 New York  |   78 | 2010-06-23 15:00:00
(3 rows)


# explain select * from bar a where date=(select max(b.date) from bar
b where a.city=b.city) order by 1;
                               QUERY PLAN
--------------------------------------------------------------------------
 Sort  (cost=1658.86..1658.87 rows=1 width=528)
  Sort Key: a.city
  ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
        Filter: (date = (subplan))
        SubPlan
          ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
                ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
width=8)     -- would be an index lookup in a real scenario
                      Filter: (($0)::text = (city)::text)
(8 rows)

--
Sent via pgsql-performance mailing list (
)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

 

От:
Tom Lane
Дата:

"Julien Theulier" <> writes:
> Teradata provides a �qualify� syntax which works as a filtering condition on
> a windowed function result. This is the only DB allowing this direct
> filtering on windowed functions, from what I know.

Seems like you could easily translate that into SQL-standard syntax by
adding a level of sub-select:

    select ... from (select *, window_function wf from ...) ss
    where wf=1;

            regards, tom lane

От:
Merlin Moncure
Дата:

On Wed, Feb 24, 2010 at 4:31 PM, Dave Crooke <> wrote:
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>   city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
>                                QUERY PLAN
> --------------------------------------------------------------------------
>  Sort  (cost=1658.86..1658.87 rows=1 width=528)
>   Sort Key: a.city
>   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
>         Filter: (date = (subplan))
>         SubPlan
>           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
>                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> width=8)     -- would be an index lookup in a real scenario
>                       Filter: (($0)::text = (city)::text)

Another cool way to do this is via a custom aggregate:
create function maxdata(data, data) returns data as
$$
  select case when ($1).date > ($2).date then $1 else $2 end;
$$ language sql;

create aggregate maxdata(data)
(
   sfunc=maxdata,
   stype=data
);

select (d).* from
(
  select maxdata(data) as d from data group by city
);

It does it in a single pass.  Where this approach can pay dividends is
when you have a very complicated 'max'-ing criteria to justify the
verbosity of creating the aggregate.  If you are not doing the whole
table, the self join is often faster.  I'm surprised custom aggregates
aren't used more...they seem very clean and neat to me.

merlin

От:
Dave Crooke
Дата:

Cool trick .... I didn't realise you could do this at the SQL level without a custom max() written in C.


What I ended up doing for my app is just going with straight SQL that generates the "key" tuples with a SELECT DISTINCT, and then has a dependent subquery that does a very small index scan to pull the data for each row (I care somewhat about portability). In order to make this perform, I created a second index on the raw data table that has the columns tupled in the order I need for this rollup, which allows PG to do a fairly efficient index range scan.

I had been trying to avoid using the disk space to carry this 2nd index, since it is only needed for the bulk rollup, and I then reliased I only have to keep it on the current day's partition, and I can drop it once that partition's data has been aggregated (the insert overhead of the index isn't as much of a concern).

Alternatively, I could have lived without the index by sharding the raw data right down to the rollup intervals, which would mean that rollups are effective as a full table scan anyway, but I didn't want to do that as it would make real-time data extration queries slower if they had to go across 10-20 tables.


Thanks everyone for the insights

Cheers
Dave

On Tue, Mar 9, 2010 at 6:46 AM, Merlin Moncure <> wrote:
On Wed, Feb 24, 2010 at 4:31 PM, Dave Crooke <> wrote:
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>   city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
>                                QUERY PLAN
> --------------------------------------------------------------------------
>  Sort  (cost=1658.86..1658.87 rows=1 width=528)
>   Sort Key: a.city
>   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
>         Filter: (date = (subplan))
>         SubPlan
>           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
>                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> width=8)     -- would be an index lookup in a real scenario
>                       Filter: (($0)::text = (city)::text)

Another cool way to do this is via a custom aggregate:
create function maxdata(data, data) returns data as
$$
 select case when ($1).date > ($2).date then $1 else $2 end;
$$ language sql;

create aggregate maxdata(data)
(
  sfunc=maxdata,
  stype=data
);

select (d).* from
(
 select maxdata(data) as d from data group by city
);

It does it in a single pass.  Where this approach can pay dividends is
when you have a very complicated 'max'-ing criteria to justify the
verbosity of creating the aggregate.  If you are not doing the whole
table, the self join is often faster.  I'm surprised custom aggregates
aren't used more...they seem very clean and neat to me.

merlin