Обсуждение: Rapidly finding maximal rows

Поиск
Список
Период
Сортировка

Rapidly finding maximal rows

От
James Cranch
Дата:
I have a slow query, based on the problem of finding the set of rows which
are maximal in some sense. I expect that there is either a more intelligent
way to write it, or that one could make some indexes that would speed it
up. I've tried various indexes, and am not getting anywhere.

I'd be grateful for any suggestions. Reasonably full details are below.


DESCRIPTION
===========

I am writing a database which stores scores from mass-entry competitions
("challenges"). Candidates who were the best in their school (schools being
identified by their "centre_number") receive a special certificate in each
competition.

Currently the database has data from six competitions, each entered by
something like 200,000 students from about 2,000 schools. I wish to produce
a view showing the students who had best-in-school performances.

I have made two attempts (both immediately after running VACUUM ANALYZE),
and both are surprisingly slow.


THE QUERIES
===========

I'm interested in running something like

SELECT * FROM best_in_school_methodN WHERE competition_name = 'X' AND
academic_year_beginning = 2010

and the following two variants have been tried:


CREATE VIEW best_in_school_method1 AS
    WITH best_scores(competition_name, academic_year_beginning,
centre_number, total_score) AS
        (SELECT competition_name, academic_year_beginning, centre_number,
MAX(total_score) AS total_score FROM challenge_entries GROUP BY
competition_name, academic_year_beginning, centre_number)
    SELECT competition_name, academic_year_beginning, centre_number,
entry_id, total_score, (true) AS best_in_school FROM challenge_entries
NATURAL JOIN best_scores;

This is EXPLAIN ANALYZEd here:
  http://explain.depesz.com/s/EiS


CREATE VIEW best_in_school_method2 AS
    WITH innertable(competition_name, academic_year_beginning,
centre_number, entry_id, total_score, school_max_score) AS
        (SELECT competition_name, academic_year_beginning, centre_number,
entry_id, total_score, MAX(total_score) OVER (PARTITION BY
competition_name, academic_year_beginning, centre_number) AS
centre_max_score FROM challenge_entries)
    SELECT competition_name, academic_year_beginning, centre_number,
entry_id, total_score, (true) AS best_in_school FROM innertable WHERE
centre_max_score = total_score;

This one is EXPLAIN ANALYZEd here:
  http://explain.depesz.com/s/6Eh


COMMENT
=======

In both cases, unless I've misunderstood, most of the time is taken up by
sorting all the results for that particular competition. It appears to me
that there should be much better ways: the results do not need to be fully
sorted.

If I were such an expert, I wouldn't be asking you all though.

By the way, the choice to SELECT a value of true is so that I can join the
results back into the original table to easily produce a best in school
boolean column.


SCHEMA
======

I should explain the tables(though probably only the last one is
interesting) and the index mentioned by one of the EXPLAINs. They can be
produced by

CREATE TABLE school_years
(
  yearname VARCHAR(5) PRIMARY KEY,
  minimum_usual_age_september1 INTERVAL,
  maximum_usual_age_september1 INTERVAL,
  usual_successor VARCHAR(5) REFERENCES school_years
);

CREATE TABLE challenge_types
(
  competition_name TEXT PRIMARY KEY,
  too_young_yeargroup VARCHAR(5) REFERENCES school_years
);

CREATE TABLE challenges
(
  competition_name TEXT REFERENCES challenge_types,
  academic_year_beginning INTEGER,
  competition_date DATE,
  CONSTRAINT competition_is_in_year CHECK (competition_date BETWEEN (date
(academic_year_beginning || '.001') + interval '9 months') AND (date
(academic_year_beginning || '.001') + interval '21 months')),
  CONSTRAINT one_challenge_per_year UNIQUE
(academic_year_beginning,competition_name),
  PRIMARY KEY (competition_name, academic_year_beginning)
);

CREATE TABLE challenge_entries
(
  entry_id SERIAL,
  competition_name TEXT,
  academic_year_beginning INTEGER,
  given_name TEXT,
  surname TEXT,
  centre_number CHAR(6),
  school_year VARCHAR(5),
  date_of_birth DATE,
  uk_educated BOOLEAN,
  uk_passport BOOLEAN,
  sex SEX,
  total_score INTEGER NOT NULL DEFAULT 0,
  PRIMARY KEY (competition_name,academic_year_beginning,entry_id),
  FOREIGN KEY (school_year) REFERENCES school_years,
  FOREIGN KEY (competition_name,academic_year_beginning) REFERENCES
challenges );

CREATE INDEX challenge_entries_by_competition_centre_number_and_total_score
  ON challenge_entries
  (competition_name,academic_year_beginning,centre_number,total_score DESC);


SOFTWARE AND HARDWARE
=====================

I'm running "PostgreSQL 8.4.8 on x86_64-pc-linux-gnu, compiled by GCC
gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit". It's the standard
installation from Debian stable (Squeeze), and I haven't messed around with
it.

My Linux kernel is 2.6.32-5-amd64.

I have a desktop PC with a Intel Core i7 CPU and 6GB of RAM, and a single
640GB Hitachi HDT72106 disk. My root partition is less than 30% full.



Re: Rapidly finding maximal rows

От
bricklen
Дата:
On Tue, Oct 11, 2011 at 3:16 AM, James Cranch <jdc41@cam.ac.uk> wrote:

>
> This is EXPLAIN ANALYZEd here:
>  http://explain.depesz.com/s/EiS

"Sort Method: external merge Disk: 35712kB"

>
> SOFTWARE AND HARDWARE
> =====================
>
> I'm running "PostgreSQL 8.4.8 on x86_64-pc-linux-gnu, compiled by GCC
> gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit". It's the standard installation
> from Debian stable (Squeeze), and I haven't messed around with it.
>
> My Linux kernel is 2.6.32-5-amd64.
>
> I have a desktop PC with a Intel Core i7 CPU and 6GB of RAM, and a single
> 640GB Hitachi HDT72106 disk. My root partition is less than 30% full.

Try setting work_mem to something larger, like 40MB to do that sort
step in memory, rather than spilling to disk. The usual caveats apply
though, like if you have many users/queries performing sorts or
aggregations, up to that amount of work_mem may be used at each step
potentially resulting in your system running out of memory/OOM etc.

Re: Rapidly finding maximal rows

От
Dave Crooke
Дата:
Hi James


I'm guessing the problem is that the combination of using a view and the way the view is defined with an in-line temporary table is too complex for the planner to introspect into, transform and figure out the equivalent direct query, and so it's creating that entire temporary table every time you evaluate the select.

Our app has some similar queries (get me the most recent row from a data logging table) and these work fine with a simple self-join, like this example (irrelevant columns omitted for discussion)

select t.entity, t.time_stamp, t.data from log_table t
where t.entity=cast('21EC2020-3AEA-1069-A2DD-08002B30309D' as uuid)
and t.time_stamp=
   (select max(time_stamp)
    from log_table u
    where t.entity=u.entity)

given a schema with the obvious indexes ...

create table log_table
   (entity UUID,
    time_stamp TIMESTAMP WITHOUT TIME ZONE,
    data TEXT);

create index log_table_index on log_table (entity, time_stamp);

.. and the plan for the dependent sub-query does the obvious reverse index scan as you'd expect / want.



If you still want / need to have the view, I suspect that getting rid of the temp table definition will fix it ... my effort is below, alternatively you might be able to take your first example and pull out best_scores and define it as a view alos,

CREATE VIEW best_in_school_method3 AS
  SELECT competition_name, academic_year_beginning, centre_number, entry_id, total_score, (true) AS best_in_school FROM challenge_entries ce1
  WHERE total_score =
      (SELECT MAX(total_score) FROM challenge_entries ce2
       WHERE ce1.
competition_name=ce2.competition_name
         AND ce1.
academic_year_beginning=ce2.academic_year_beginning
         AND ce1.
centre_number=ce2.centre_number
      )

If you don't actually need to have the view for other purposes, and just want to solve the original problem (listing certificates to be issued), you can do it as a direct query, e.g.

  SELECT competition_name, academic_year_beginning, centre_number, entry_id, total_score, (true) AS best_in_school FROM challenge_entries ce1
  WHERE total_score =
      (SELECT MAX(total_score) FROM challenge_entries ce2
       WHERE ce1.
competition_name=ce2.competition_name
         AND ce1.
academic_year_beginning=ce2.academic_year_beginning
         AND ce1.
centre_number=ce2.centre_number
      )
    AND competition_name = 'X'
    AND academic_year_beginning = 2010

PostgreSQL also has a proprietary extension SELECT DISTINCT ON which has a much nicer syntax, but unlike the above it will only show one (arbitrarily selected) pupil per school in the event of a tie, which is probably not what you want :-)

Looking at the schema, the constraint one_challenge_per_year is redundant with the primary key.

Cheers
Dave

P.S. Small world ... did my undergrad there, back when @cam.ac.uk email went to an IBM 3084 mainframe and the user ids typically ended in 10 :-)

On Tue, Oct 11, 2011 at 5:16 AM, James Cranch <jdc41@cam.ac.uk> wrote:

I have a slow query, based on the problem of finding the set of rows which are maximal in some sense. I expect that there is either a more intelligent way to write it, or that one could make some indexes that would speed it up. I've tried various indexes, and am not getting anywhere.

I'd be grateful for any suggestions. Reasonably full details are below.


DESCRIPTION
===========

I am writing a database which stores scores from mass-entry competitions ("challenges"). Candidates who were the best in their school (schools being identified by their "centre_number") receive a special certificate in each competition.

Currently the database has data from six competitions, each entered by something like 200,000 students from about 2,000 schools. I wish to produce a view showing the students who had best-in-school performances.

I have made two attempts (both immediately after running VACUUM ANALYZE), and both are surprisingly slow.


THE QUERIES
===========

I'm interested in running something like

SELECT * FROM best_in_school_methodN WHERE competition_name = 'X' AND academic_year_beginning = 2010

and the following two variants have been tried:


CREATE VIEW best_in_school_method1 AS
  WITH best_scores(competition_name, academic_year_beginning, centre_number, total_score) AS
      (SELECT competition_name, academic_year_beginning, centre_number, MAX(total_score) AS total_score FROM challenge_entries GROUP BY competition_name, academic_year_beginning, centre_number)
  SELECT competition_name, academic_year_beginning, centre_number, entry_id, total_score, (true) AS best_in_school FROM challenge_entries NATURAL JOIN best_scores;

This is EXPLAIN ANALYZEd here:
 http://explain.depesz.com/s/EiS


CREATE VIEW best_in_school_method2 AS
  WITH innertable(competition_name, academic_year_beginning, centre_number, entry_id, total_score, school_max_score) AS
      (SELECT competition_name, academic_year_beginning, centre_number, entry_id, total_score, MAX(total_score) OVER (PARTITION BY competition_name, academic_year_beginning, centre_number) AS centre_max_score FROM challenge_entries)
  SELECT competition_name, academic_year_beginning, centre_number, entry_id, total_score, (true) AS best_in_school FROM innertable WHERE centre_max_score = total_score;

This one is EXPLAIN ANALYZEd here:
 http://explain.depesz.com/s/6Eh


COMMENT
=======

In both cases, unless I've misunderstood, most of the time is taken up by sorting all the results for that particular competition. It appears to me that there should be much better ways: the results do not need to be fully sorted.

If I were such an expert, I wouldn't be asking you all though.

By the way, the choice to SELECT a value of true is so that I can join the results back into the original table to easily produce a best in school boolean column.


SCHEMA
======

I should explain the tables(though probably only the last one is interesting) and the index mentioned by one of the EXPLAINs. They can be produced by

CREATE TABLE school_years
(
 yearname VARCHAR(5) PRIMARY KEY,
 minimum_usual_age_september1 INTERVAL,
 maximum_usual_age_september1 INTERVAL,
 usual_successor VARCHAR(5) REFERENCES school_years
);

CREATE TABLE challenge_types
(
 competition_name TEXT PRIMARY KEY,
 too_young_yeargroup VARCHAR(5) REFERENCES school_years
);

CREATE TABLE challenges
(
 competition_name TEXT REFERENCES challenge_types,
 academic_year_beginning INTEGER,
 competition_date DATE,
 CONSTRAINT competition_is_in_year CHECK (competition_date BETWEEN (date (academic_year_beginning || '.001') + interval '9 months') AND (date (academic_year_beginning || '.001') + interval '21 months')),
 CONSTRAINT one_challenge_per_year UNIQUE (academic_year_beginning,competition_name),
 PRIMARY KEY (competition_name, academic_year_beginning)
);

CREATE TABLE challenge_entries
(
 entry_id SERIAL,
 competition_name TEXT,
 academic_year_beginning INTEGER,
 given_name TEXT,
 surname TEXT,
 centre_number CHAR(6),
 school_year VARCHAR(5),
 date_of_birth DATE,
 uk_educated BOOLEAN,
 uk_passport BOOLEAN,
 sex SEX,
 total_score INTEGER NOT NULL DEFAULT 0,
 PRIMARY KEY (competition_name,academic_year_beginning,entry_id),
 FOREIGN KEY (school_year) REFERENCES school_years,
 FOREIGN KEY (competition_name,academic_year_beginning) REFERENCES challenges );

CREATE INDEX challenge_entries_by_competition_centre_number_and_total_score
 ON challenge_entries
 (competition_name,academic_year_beginning,centre_number,total_score DESC);


SOFTWARE AND HARDWARE
=====================

I'm running "PostgreSQL 8.4.8 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit". It's the standard installation from Debian stable (Squeeze), and I haven't messed around with it.

My Linux kernel is 2.6.32-5-amd64.

I have a desktop PC with a Intel Core i7 CPU and 6GB of RAM, and a single 640GB Hitachi HDT72106 disk. My root partition is less than 30% full.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Rapidly finding maximal rows

От
James Cranch
Дата:
Dear Dave,

>CREATE VIEW best_in_school_method3 AS
>  SELECT competition_name, academic_year_beginning, centre_number,
> entry_id, total_score, (true) AS best_in_school FROM challenge_entries
> ce1
>  WHERE total_score =
>      (SELECT MAX(total_score) FROM challenge_entries ce2
>       WHERE ce1.competition_name=ce2.competition_name
>         AND ce1.academic_year_beginning=ce2.academic_year_beginning
>         AND ce1.centre_number=ce2.centre_number
>      )

Thanks! That works much better, as you can see here:

  http://explain.depesz.com/s/Jz1

>If you don't actually need to have the view for other purposes, and just
>want to solve the original problem (listing certificates to be issued), you
>can do it as a direct query, e.g.

I'll keep the view, please.

> PostgreSQL also has a proprietary extension SELECT DISTINCT ON which has
> a much nicer syntax, but unlike the above it will only show one
> (arbitrarily selected) pupil per school in the event of a tie, which is
> probably not what you want :-)

Indeed not, that's disastrous here.

>Looking at the schema, the constraint one_challenge_per_year is redundant
>with the primary key.

Oh, yes, thanks. It's a legacy from an earlier approach.

> P.S. Small world ... did my undergrad there, back when @cam.ac.uk email
> went to an IBM 3084 mainframe and the user ids typically ended in 10 :-)

Heh. The people with only two initials are generating bignums these days: I
know xy777@cam.ac.uk (here x and y are variables representing letters of
the alphabet).

Cheers,

James
\/\/\

Re: Rapidly finding maximal rows

От
James Cranch
Дата:
Dear Bricklen,

>Try setting work_mem to something larger, like 40MB to do that sort
>step in memory, rather than spilling to disk. The usual caveats apply
>though, like if you have many users/queries performing sorts or
>aggregations, up to that amount of work_mem may be used at each step
>potentially resulting in your system running out of memory/OOM etc.

Thanks, I'll bear that in mind as a strategy. It's good to know. But since
Dave has saved me the sort altogether, I'll go with his plan.

Best wishes,

James
\/\/\

--
------------------------------------------------------------
James Cranch                 http://www.srcf.ucam.org/~jdc41



Re: Rapidly finding maximal rows

От
Merlin Moncure
Дата:
On Tue, Oct 11, 2011 at 8:05 PM, Dave Crooke <dcrooke@gmail.com> wrote:
> Hi James
>
>
> I'm guessing the problem is that the combination of using a view and the way
> the view is defined with an in-line temporary table is too complex for the
> planner to introspect into, transform and figure out the equivalent direct
> query, and so it's creating that entire temporary table every time you
> evaluate the select.
>
> Our app has some similar queries (get me the most recent row from a data
> logging table) and these work fine with a simple self-join, like this
> example (irrelevant columns omitted for discussion)
>
> select t.entity, t.time_stamp, t.data from log_table t
> where t.entity=cast('21EC2020-3AEA-1069-A2DD-08002B30309D' as uuid)
> and t.time_stamp=
>    (select max(time_stamp)
>     from log_table u
>     where t.entity=u.entity)
>
> given a schema with the obvious indexes ...
>
> create table log_table
>    (entity UUID,
>     time_stamp TIMESTAMP WITHOUT TIME ZONE,
>     data TEXT);
>
> create index log_table_index on log_table (entity, time_stamp);
>
> .. and the plan for the dependent sub-query does the obvious reverse index
> scan as you'd expect / want.
>
>
>
> If you still want / need to have the view, I suspect that getting rid of the
> temp table definition will fix it ... my effort is below, alternatively you
> might be able to take your first example and pull out best_scores and define
> it as a view alos,
>
> CREATE VIEW best_in_school_method3 AS
>   SELECT competition_name, academic_year_beginning, centre_number, entry_id,
> total_score, (true) AS best_in_school FROM challenge_entries ce1
>   WHERE total_score =
>       (SELECT MAX(total_score) FROM challenge_entries ce2
>        WHERE ce1.competition_name=ce2.competition_name
>          AND ce1.academic_year_beginning=ce2.academic_year_beginning
>          AND ce1.centre_number=ce2.centre_number
>       )

This is a very common problem in SQL and has a lot of interesting solutions.

In queries like this I usually use the 'ORDER BY total_score DESC
LIMIT 1 trick.  Modern postgres is *usually* smart enough to convert
max to that, but not always.

WHERE total_score =
       (SELECT total_score FROM challenge_entries ce2
        WHERE ce1.competition_name=ce2.competition_name
          AND ce1.academic_year_beginning=ce2.academic_year_beginning
          AND ce1.centre_number=ce2.centre_number
          ORDER BY total_score DESC LIMIT 1
       )

Another clever, and more portable way to write it which can sometimes
give a better plan is like this:

WHERE NOT EXISTS
(
  SELECT 1 FROM challenge_entries ce2
    WHERE  ce1.competition_name=ce2.competition_name
          AND ce1.academic_year_beginning=ce2.academic_year_beginning
          AND ce1.centre_number=ce2.centre_number
          AND ce1.total_score > ce2.total_score
)

Yet another interesting way of getting the 'top' record based on a
certain criteria is to write a custom aggregate.  In postgres you can
aggregate over the entire record type, not just plain fields, so that
running your aggregate function looks like this:

SELECT competition_name, academic_year_beginning, centre_number,
max_challenge_entries(ce) FROM challenge_entries ce GROUP BY 1,2,3;

Your function aggregator in SQL would look something like this:

CREATE OR REPLACE FUNCTION
max_challenge_entries_impl(challenge_entries, challenge_entries)
returns challenge_entries  AS
$$
  SELECT CASE WHEN ($2).total_score > ($1).total_score THEN $2 ELSE $1 END;
$$ LANGUAGE SQL IMMUTABLE;

This very STLish approach is rarely the best way to go performance
wise although it can give better worst case plans in some cases
(although total_score if in index can never be used for optimization).
 I mention it because I find it to be very clean conceptually and can
be a great approach if your 'picking' algorithm is sufficiently more
complex than 'field > field' and is also otherwise not optimizable.
Anyways, food for thought.

merlin