Обсуждение: query optimization scenarios 17,701 times faster!!!

Поиск

Список

Период

Сортировка

query optimization scenarios 17,701 times faster!!!

От

"Robert Dyas"

Дата:

23 апреля 2003 г., 14:06:14

Hi Everyone - this is my first post (but I have been lurking on and off for
a couple of years). Congratulations on a steadily improving product!

This is not a question, just some observations on performance that I thought
might trigger someone's thinking on ways to improve query optimization.

The following is a list of query pairs (one fast, one slow) that must
produce identical results by definition (and do), but have very different
execution times. Especially the last example. I did some development work on
a commercial SQL database product 7 years ago (names will not be used), so
although I am clueless about PostgreSQL internals, I (think) I have a grip
on some of the query optimization issues (though not necessarily a grip on
life). The data set used for all of these queries was very small - most
tables had a few hundred records or less. No, its not very scientific, but I
believe its illustrative none-the-less.

I'll just make a couple of observations on the last query and leave everyone
else to reach their own conclusions.

1) the two versions of the last query must produce identical results by
definition (and they do)
2) it appears that the optimizer is doing all of the join work before ever
considering the impact of where clause restrictions. (this may not be the
case, but it appears so)
3) It could have said, hey, I have a where clause restriction on the primary
key that is equal to a fixed value. So I have a single row from that table
to deal with, all of the columns come from that table too, and further its
left joined to the rest of the crap so I can safely ignore it.

I always think its illustrative to look at extreme examples like this to
point out optimizations that may be overlooked. If I ever get some free
time, I look forward to contributing to this wonderful project!


NOTES:
1) I didn't include the schema to keep this post reasonable. send email to
rdyas@adelphia.net if you want the schema to look into this further.
2) the primary keys for the following tables are
org_milestones.id
tasks.task_id
contacts.contact_id
organizations.org_id


EXPLAIN ANALYZE SELECT DISTINCT org_milestones.completed_on,
org_milestones.id, org_milestones.milestone_id, org_milestones.notes,
org_milestones.org_id FROM org_milestones RIGHT OUTER JOIN organizations ON
(org_milestones.org_id = organizations.org_id) LEFT OUTER JOIN contacts ON
(contacts.org_id = organizations.org_id) LEFT OUTER JOIN tasks ON
(tasks.org_id = organizations.org_id) WHERE (organizations.org_id = 71)
ORDER BY org_milestones.completed_on, org_milestones.id
79.20 msec

EXPLAIN ANALYZE SELECT org_milestones.completed_on, org_milestones.id,
org_milestones.milestone_id, org_milestones.notes, org_milestones.org_id
FROM org_milestones RIGHT OUTER JOIN organizations ON (org_milestones.org_id
= organizations.org_id) WHERE (organizations.org_id = 71) ORDER BY
org_milestones.completed_on, org_milestones.id
6.44 msec
= 12 times faster

---------------
EXPLAIN ANALYZE SELECT DISTINCT tasks.completed, tasks.contact_id,
tasks.created_by, tasks.notes, tasks.objective, tasks.org_id, tasks.outcome,
tasks.priority, tasks.start_date, tasks.start_time, tasks.task_id,
tasks.task_type FROM tasks RIGHT OUTER organizations ON (tasks.org_id =
organizations.org_id) LEFT OUTER JOIN org_milestones ON
(org_milestones.org_id = organizations.org_id) LEFT OUTER JOIN contacts ON
(contacts.org_id = organizations.org_id) WHERE (organizations.org_id = 71)
ORDER BY tasks.start_date, tasks.start_time
2,548.71 msec

EXPLAIN ANALYZE SELECT tasks.completed, tasks.contact_id, tasks.created_by,
tasks.notes, tasks.objective, tasks.org_id, tasks.outcome, tasks.priority,
tasks.start_date, tasks.start_time, tasks.task_id, tasks.task_type FROM
tasks RIGHT OUTER JOIN organizations ON (tasks.org_id =
organizations.org_id) WHERE (organizations.org_id = 71) ORDER BY
tasks.start_date, tasks.start_time
29.21 msec
= 87 times faster

-----------
EXPLAIN ANALYZE SELECT DISTINCT contacts.address_1, contacts.address_2,
contacts.assistant_email, contacts.assistant_name, contacts.assistant_phone,
contacts.city, contacts.contact_id, contacts.email_address,
contacts.first_name, contacts.functional_role, contacts.last_name,
contacts.needs, contacts.notes, contacts.org_id, contacts.pain,
contacts.phone, contacts.reasons, contacts.reports_to, contacts.state,
contacts.title, contacts.zip_code, (contacts.first_name || ' ' ||
contacts.last_name) AS full_name FROM contacts RIGHT OUTER JOIN
organizations ON (contacts.org_id = organizations.org_id) LEFT OUTER JOIN
org_milestones ON (org_milestones.org_id = organizations.org_id) LEFT OUTER
JOIN tasks ON (tasks.org_id = organizations.org_id) WHERE
(organizations.org_id = 71)
2056.83 msec

EXPLAIN ANALYZE SELECT DISTINCT contacts.address_1, contacts.address_2,
contacts.assistant_email, contacts.assistant_name, contacts.assistant_phone,
contacts.city, contacts.contact_id, contacts.email_address,
contacts.first_name, contacts.functional_role, contacts.last_name,
contacts.needs, contacts.notes, contacts.org_id, contacts.pain,
contacts.phone, contacts.reasons, contacts.reports_to, contacts.state,
contacts.title, contacts.zip_code, (contacts.first_name || ' ' ||
contacts.last_name) AS full_name FROM contacts RIGHT OUTER JOIN
organizations ON (contacts.org_id = organizations.org_id) WHERE
(organizations.org_id = 71)
27.41 msec
= 75 times faster
-----------------

EXPLAIN ANALYZE SELECT DISTINCT organizations.city, organizations.inactive,
organizations.java_developers, organizations.name, organizations.org_id,
organizations.overview, organizations.phone, organizations.salesperson,
organizations.state, organizations.time_zone FROM organizations LEFT OUTER
JOIN org_milestones ON (org_milestones.org_id = organizations.org_id) LEFT
OUTER JOIN contacts ON (contacts.org_id = organizations.org_id) LEFT OUTER
JOIN tasks ON (tasks.org_id = organizations.org_id) WHERE
(organizations.org_id = 71) ORDER BY organizations.name
12,567.87 msec

EXPLAIN ANALYZE SELECT organizations.city, organizations.inactive,
organizations.java_developers, organizations.name, organizations.org_id,
organizations.overview, organizations.phone, organizations.salesperson,
organizations.state, organizations.time_zone FROM organizations WHERE
(organizations.org_id = 71)
0.71 msec
= 17,701 times faster
-----------------------

Re: query optimization scenarios 17,701 times faster!!!

От

Dennis Björklund

Дата:

23 апреля 2003 г., 16:20:57

On Wed, 23 Apr 2003, Robert Dyas wrote:

> 2) it appears that the optimizer is doing all of the join work before ever
> considering the impact of where clause restrictions. (this may not be the
> case, but it appears so)

What version of postgres do you use? I think the current cvs should handle
the joins better then the current released versions. I don't use the cvs
version myself, but I think I've seen people say that the cvs version
optimizes joins better.

If you could try the same examples on the cvs version if would be fun to
see some numbers.

-- 
/Dennis

Re: query optimization scenarios 17,701 times faster!!!

От

Tom Lane

Дата:

23 апреля 2003 г., 22:59:51

"Robert Dyas" <rdyas@adelphia.net> writes:
> The following is a list of query pairs (one fast, one slow) that must
> produce identical results by definition (and do), but have very different
> execution times.

With no details about the table schemas, nor the EXPLAIN ANALYZE output
data, I really wonder how you expect any intelligent comments.  We are
programmers, not mind-readers.
        regards, tom lane

Re: query optimization scenarios 17,701 times faster!!!

От

"Robert Dyas"

Дата:

24 апреля 2003 г., 10:13:01

OK, so you lack mind-reading skills ;-) I wasn't sure if anybody would be
interested in exploring this, but here's the full monty. (full EXPLAIN
ANALYZE output + entire schema) I can't provide the actual data, sorry. But
each table has at most a couple of hundred records. The numbers are actually
a little worse this time (33,623 times faster!!!) because there is slightly
more data in the tables. Performance degrades very rapidly as table size
increases.



EXPLAIN ANALYZE SELECT DISTINCT organizations.city, organizations.inactive,
organizations.java_developers, organizations.name, organizations.org_id,
organizations.overview, organizations.phone, organizations.salesperson,
organizations.state, organizations.time_zone FROM organizations LEFT OUTER
JOIN org_milestones ON (org_milestones.org_id = organizations.org_id) LEFT
OUTER JOIN contacts ON (contacts.org_id = organizations.org_id) LEFT OUTER
JOIN tasks ON (tasks.org_id = organizations.org_id) WHERE
(organizations.org_id = 71) ORDER BY organizations.name
covont_production-# ;

QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
-------------------------Unique  (cost=53.32..54.08 rows=3 width=884) (actual
time=19200.24..24870.87 rows=1 loops=1)  ->  Sort  (cost=53.32..53.39 rows=27 width=884) (actual
time=19200.19..19315.98 rows=840 loops=1)        Sort Key: organizations.name, organizations.city,
organizations.inactive, organizations.java_developers, organizations.org_id,
organizations.overview, organizations.phone, organizations.salesperson,
organizations.state, organizations.time_zone        ->  Hash Join  (cost=43.68..52.67 rows=27 width=884) (actual
time=18.42..170.69 rows=840 loops=1)              Hash Cond: ("outer".org_id = "inner".org_id)              ->  Hash
Join (cost=12.38..20.75 rows=5 width=880) (actual
 
time=7.18..19.42 rows=42 loops=1)                    Hash Cond: ("outer".org_id = "inner".org_id)                    ->
Nested Loop  (cost=0.00..8.24 rows=2 width=876)
 
(actual time=1.03..6.16 rows=7 loops=1)                          Join Filter: ("inner".org_id = "outer".org_id)
                ->  Index Scan using organizations_pkey on
 
organizations  (cost=0.00..4.66 rows=1 width=872) (actual time=0.32..0.34
rows=1 loops=1)                                Index Cond: (org_id = 71)                          ->  Seq Scan on
org_milestones (cost=0.00..2.15
 
rows=115 width=4) (actual time=0.04..2.97 rows=116 loops=1)                    ->  Hash  (cost=8.36..8.36 rows=136
width=4)(actual
 
time=5.44..5.44 rows=0 loops=1)                          ->  Seq Scan on contacts  (cost=0.00..8.36
rows=136 width=4) (actual time=0.05..3.21 rows=136 loops=1)              ->  Hash  (cost=19.92..19.92 rows=292 width=4)
(actual
time=10.50..10.50 rows=0 loops=1)                    ->  Seq Scan on tasks  (cost=0.00..19.92 rows=292
width=4) (actual time=0.06..6.38 rows=294 loops=1)Total runtime: 24881.28 msec
(17 rows)

covont_production=# EXPLAIN ANALYZE SELECT organizations.city,
organizations.inactive, organizations.java_developers, organizations.name,
organizations.org_id, organizations.overview, organizations.phone,
organizations.salesperson, organizations.state, organizations.time_zone FROM
organizations WHERE (organizations.org_id = 71)
covont_production-# ;                                                           QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------Index Scan using organizations_pkey on organizations
(cost=0.00..4.66
rows=1 width=872) (actual time=0.28..0.29 rows=1 loops=1)  Index Cond: (org_id = 71)Total runtime: 0.74 msec
(3 rows)

covont_production=#



****************************************************************************
**
Here is the complete schema:

CREATE TABLE organizations (org_id              SERIAL PRIMARY KEY,salesperson         INTEGER REFERENCES
users(user_id),name               VARCHAR(30) NOT NULL,phone               VARCHAR(30),city
VARCHAR(30),state              VARCHAR(2),time_zone           VARCHAR(3),overview            TEXT,java_developers
INTEGER,inactive           BOOLEAN NOT NULL
 
);

CREATE INDEX organizations_salesperson ON organizations(salesperson);

CREATE TABLE milestones (milestone_id        SERIAL PRIMARY KEY,name                VARCHAR(60) NOT NULL,red_flag_days
    INTEGER NOT NULL
 
);

CREATE TABLE org_milestones (id                  SERIAL PRIMARY KEY,milestone_id          INTEGER NOT NULL REFERENCES
milestones(milestone_id),org_id             INTEGER NOT NULL REFERENCES organizations(org_id),completed_on        DATE
NOTNULL,notes               VARCHAR(250)
 
);

CREATE TABLE contacts (contact_id          SERIAL PRIMARY KEY,org_id              INTEGER NOT NULL REFERENCES
organizations(org_id),first_name         VARCHAR(30) NOT NULL,last_name           VARCHAR(30),title
VARCHAR(60),phone              VARCHAR(30),email_address       VARCHAR(120),assistant_name
VARCHAR(30),assistant_phone    VARCHAR(30),assistant_email     VARCHAR(120),functional_role     VARCHAR(30),reports_to
       INTEGER REFERENCES contacts(contact_id),notes               TEXT,pain                VARCHAR(120),reasons
    VARCHAR(250),needs               VARCHAR(250),address_1           VARCHAR(60),address_2           VARCHAR(60),city
             VARCHAR(30),state               VARCHAR(2),zip_code            VARCHAR(10),time_zone           VARCHAR(3)
 
);

CREATE INDEX contacts_org_id ON contacts(org_id);

CREATE TABLE tasks (task_id             SERIAL PRIMARY KEY,org_id              INTEGER NOT NULL REFERENCES
organizations(org_id),contact_id         INTEGER NOT NULL REFERENCES contacts(contact_id),created_by          INTEGER
NOTNULL REFERENCES users(user_id),start_date          DATE NOT NULL,start_time          TIME,completed
BOOLEANNOT NULL,task_type           VARCHAR(30) NOT NULL,objective           VARCHAR(120) NOT NULL,outcome
VARCHAR(120),notes              TEXT,priority            INTEGER NOT NULL
 
);

CREATE INDEX tasks_org_id ON tasks(org_id);
CREATE INDEX tasks_contact_id ON tasks(contact_id);
CREATE INDEX tasks_start_date ON tasks(start_date);

CREATE TABLE attachments (attach_id           SERIAL PRIMARY KEY,task_id             INTEGER NOT NULL REFERENCES
tasks(task_id)ON DELETE
 
CASCADE,attachment          BYTEA
);

CREATE INDEX attachments_task_id ON attachments(task_id);




-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, April 23, 2003 11:00 PM
To: Robert Dyas
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] query optimization scenarios 17,701 times
faster!!!


"Robert Dyas" <rdyas@adelphia.net> writes:
> The following is a list of query pairs (one fast, one slow) that must
> produce identical results by definition (and do), but have very different
> execution times.

With no details about the table schemas, nor the EXPLAIN ANALYZE output
data, I really wonder how you expect any intelligent comments.  We are
programmers, not mind-readers.
        regards, tom lane

Re: query optimization scenarios 17,701 times faster!!!

От

Kurt Roeckx

Дата:

24 апреля 2003 г., 14:07:49

On Wed, Apr 23, 2003 at 02:04:02PM -0400, Robert Dyas wrote:
> 
> The following is a list of query pairs (one fast, one slow) that must
> produce identical results by definition (and do), but have very different
> execution times.

I think what you see is what is described in the documentation
too.  If you use outer joins, it will do them in the order of the
query.  I think the reason for that was that the SQL standard
required it.

So if you put your query in an other order, you will get a
different result.

I believe that they removed that restriction in the last/cvs
version of PosgresQL.

> Especially the last example.

Where it's joining alot of tables it doesn't use in the first
place.  You even removed the conditions on how to join the
tables.

Kurt

Re: query optimization scenarios 17,701 times faster!!!

От

"Robert Dyas"

Дата:

24 апреля 2003 г., 14:36:44

I was hoping to point out an extreme example in hopes of showing that some
very important optimizations are not taking place.

In a perfect world, two different queries that must by definition return the
same result would run the exact same query plan, which would be the fastest
one available. Yes, I know I'm stating the obvious, but sometimes that
helps.

There are some basic optimizations that I think may not be taking place that
could have a dramatic impact in many different scenarios. The biggest one
that I can see could be stated like this: "if a where clause includes a
restriction on a primary key column with a fixed value, only that row should
be used in subsequent query processing"

Conceptually that seems like an "easy win" for improving performance,
possibly very significantly in a wide variety of circumstances.

-----Original Message-----
From: Q@ping.be [mailto:Q@ping.be]
Sent: Thursday, April 24, 2003 2:08 PM
To: Robert Dyas
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] query optimization scenarios 17,701 times
faster!!!

On Wed, Apr 23, 2003 at 02:04:02PM -0400, Robert Dyas wrote:
>
> The following is a list of query pairs (one fast, one slow) that must
> produce identical results by definition (and do), but have very different
> execution times.

I think what you see is what is described in the documentation
too.  If you use outer joins, it will do them in the order of the
query.  I think the reason for that was that the SQL standard
required it.

So if you put your query in an other order, you will get a
different result.

I believe that they removed that restriction in the last/cvs
version of PosgresQL.

> Especially the last example.

Where it's joining alot of tables it doesn't use in the first
place.  You even removed the conditions on how to join the
tables.

Kurt

Re: query optimization scenarios 17,701 times faster!!!

От

Stephan Szabo

Дата:

24 апреля 2003 г., 15:17:14

On Thu, 24 Apr 2003, Robert Dyas wrote:

> I was hoping to point out an extreme example in hopes of showing that some
> very important optimizations are not taking place.
>
> In a perfect world, two different queries that must by definition return the
> same result would run the exact same query plan, which would be the fastest
> one available. Yes, I know I'm stating the obvious, but sometimes that
> helps.

Of course in a perfect world doing the optimizations would take no time.
;)

> There are some basic optimizations that I think may not be taking place
that
> could have a dramatic impact in many different scenarios. The biggest one
> that I can see could be stated like this: "if a where clause includes a
> restriction on a primary key column with a fixed value, only that row should
> be used in subsequent query processing"

The particular conversion you're doing for that query requires more than
just that AFAICS. The outerness of the join makes it so you know there
won't be 0 rows after the join, and the distinct cancels out the
possibility of multiple rows. Both of those conditions seem necessary to
make the conversion valid as well.  And for your first example (with
the right join), it seems that you'd need to know that if there were
multiple matching rows with the right join that the output rows were
guaranteed to be distinct (since otherwise the select distinct might roll
the two rows together which would cause the output to change from the
second version of that example).  A more detailed example of what the
actual optimizations you're invisioning (rather than example queries)
might be helpful.

Also, I think there exists some chance of side effects that would be
elided by the second version in the case of views or set returning
functions, although I think we can probably ignore that.

Re: query optimization scenarios 17,701 times faster!!!

От

"Robert Dyas"

Дата:

24 апреля 2003 г., 17:09:00

Actually, what I'm suggesting first is much more straight forward. Each
operation must take place on a row set. When you've got a bunch of tables
that are joined together, and columns with unique indexes are specified in
the where clause to be equal to some simple value (regalrdless of which
table), then before you do any other processing, make sure you only process
on that single row! Don't join a million rows together only to throw them
all out but one!! Which it appears is exactly what 7.3.1 was doing. It
should be so easy to scan a where clause for columns that contain unique
indexes and are equal to a fixed specified value. If so, in all cases, use
that to limit rows before you do any other processing. That is the simple
case, but that alone will have a major impact on a good percentage of real
world queries. It is a simple optimization and will work in all cases, no
exceptions. If you join tables before ever considering the where clause
restrictions, you've got to be doing a ton of extra work in many, many
cases.

If the above makes sense, then we can talk about the next most general case
(i.e. determining the likely number of rows that will be retunred by a where
clause restriction on a column). But the concept is still the same - don't
spend time joining rows when you are going to later throw them away becuase
of some where clause restriction. Stated another way, "restrict then join"
is far more efficient than "join then restrict".

-----Original Message-----
From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
Sent: Thursday, April 24, 2003 4:39 PM
To: Robert Dyas
Subject: RE: [HACKERS] query optimization scenarios 17,701 times
faster!!!

On Thu, 24 Apr 2003, Robert Dyas wrote:

> Actually, I wish I would have sent along only the last pair of queries to
> focus the conversation more. So I will focus on the last pair of queries
for
> this comment.

Okay, that one did seem wierder.

> In that case, the first table specified in the from clause is the table
from
> which all of the columns are retrieved. Furthermore, the where clause
> indicates that the table's primary key must equal a single value. This can
> only return one row (or zero rows). So all subsequent processing should
only
> occur on that one row. That is the simple case involving unique indexes,
and
> the easiest to optimize.

Right, but I was pointing out that the particular two queries you gave
aren't necessarily the same without that distinct on the query.  The fact
that it gives you the same results depends on that you know that only one
(or zero because its outer) rows in the right side will match or that
you're doing something like distinct, distinct on, group by afaics.
Otherwise I think you'd get duplicated rows in the first version and not
in the second.

> The more general case would be an estimate (based on index statistics) of
> the number of rows that would be returned by a column equaling some value.
> Those where the estimated number of hits is sufficiently small relative to
> total table size would be processed first (in the case of multiple
> restrictions, those resulting in the least number of estimated result rows
> would be processed first). I would assume this is one of the first
> optimizations somebody would address when writing a query optimizer, yet
it
> appears to be absent or broken in 7.3.1.

The problem is that explicit join syntax is also overloaded right now to
mean do these joins in this order which does inhibit some optimizations.
I think there's been talk of making this optional at some point.
I believe there were cases with outer joins where (A LJ B) LJ C is not the
same as A LJ (B LJ C) as well (admittedly they're really wierd using
things like is not null or coalesce in the on conditions) and so you have
to be careful when reordering them (if that's part of what you're
suggesting).

Re: query optimization scenarios 17,701 times faster!!!

От

"scott.marlowe"

Дата:

24 апреля 2003 г., 17:24:47

On Thu, 24 Apr 2003, Robert Dyas wrote:

> clause restriction on a column). But the concept is still the same - don't
> spend time joining rows when you are going to later throw them away becuase
> of some where clause restriction. Stated another way, "restrict then join"
> is far more efficient than "join then restrict".

True, but "join then restrict" is guaranteed to get the right answer 
through sheer force of will, while restrict then join requires that you 
not only try to optimize the query, but you have to make sure you are not 
throwing away the wrong ones.  I.e. accidentally leaving in extra rows to 
throw away costs you CPU time, accidentally tossing the wrong rows gives 
bogus results.

The real answer here is that SQL is the answer.  It allows you to restrict 
the data sets you're playing with before the join, not after.  Subselects, 
unions, etc... allow you to build a more efficient query that is handling, 
in theory, smaller data sets, and should therefore be faster.

I'd love for the planner to be able to optimize everything, but let's face 
it, since all databases live in the real world where optimzation can never 
be perfect, we should all strive to create SQL queries that hit the fewest 
rows needed to do the job, and THEN let the planner take it from there.

We all benefit from faster sorting algorhythms, better indexing methods, 
etc.   Only people who write very inneficient SQL benefit from the types 
of optimizations you're talking about.  So, if someone has to put in time 
programming, I'd rather it be on things we can all use and benefit from 
the most.

When we have async/sync multi-master multi-slave replication, with bit 
mapped indexes, and super fast hashing, along with maybe a ccNUMA friendly 
caching method that can efficiently handle tens of gigabytes of free RAM, 
sure, maybe someone should get around to optimizing the corner cases.  But 
unconstrained joins or even just poorly thought out ones, are NOT a corner 
case, they're a SQL mistake.

Re: query optimization scenarios 17,701 times faster!!!

От

Stephan Szabo

Дата:

24 апреля 2003 г., 17:45:55

On Thu, 24 Apr 2003, Robert Dyas wrote:

> Actually, what I'm suggesting first is much more straight forward. Each
> operation must take place on a row set. When you've got a bunch of tables
> that are joined together, and columns with unique indexes are specified in
> the where clause to be equal to some simple value (regalrdless of which
> table), then before you do any other processing, make sure you only process
> on that single row! Don't join a million rows together only to throw them
> all out but one!! Which it appears is exactly what 7.3.1 was doing. It

It doesn't look that way to me.  It looks to me from the explain output
that it was doing an index scan getting the one row from the table with
the condition and then joining that with each successive table (possibly
getting additional rows from those), then doing a sort and unique for the
distinct.

Re: query optimization scenarios 17,701 times faster!!!

От

Stephan Szabo

Дата:

24 апреля 2003 г., 17:52:21

On Thu, 24 Apr 2003, Stephan Szabo wrote:

>
> On Thu, 24 Apr 2003, Robert Dyas wrote:
>
> > Actually, what I'm suggesting first is much more straight forward. Each
> > operation must take place on a row set. When you've got a bunch of tables
> > that are joined together, and columns with unique indexes are specified in
> > the where clause to be equal to some simple value (regalrdless of which
> > table), then before you do any other processing, make sure you only process
> > on that single row! Don't join a million rows together only to throw them
> > all out but one!! Which it appears is exactly what 7.3.1 was doing. It
>
> It doesn't look that way to me.  It looks to me from the explain output
> that it was doing an index scan getting the one row from the table with
> the condition and then joining that with each successive table (possibly
> getting additional rows from those), then doing a sort and unique for the
> distinct.

As a note, relooking at the explain analyze output, it looked like most of
the time was in the sort and unique for the distinct.  I wonder if raising
sort_mem would give a gain.  Not sure about that.  (At best it'd still be
on the order of 300x the other query rather than 17000x).

Re: query optimization scenarios 17,701 times faster!!!

От

Tom Lane

Дата:

24 апреля 2003 г., 20:30:41

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> The particular conversion you're doing for that query requires more than
> just that AFAICS. The outerness of the join makes it so you know there
> won't be 0 rows after the join, and the distinct cancels out the
> possibility of multiple rows. Both of those conditions seem necessary to
> make the conversion valid as well.

Yeah.  I think the optimization rule being proposed here is something
like "if a table is not used in the SELECT output list, *and* it's on
the nullable side of an outer join (so that it can't cause rows to be
omitted from the output), *and* there's a DISTINCT (so that generating
multiple matched rows isn't interesting either) then don't bother to
include the table in the join tree".  But I'm not convinced that's
a safe transformation --- are there any other conditions that would
have to be checked?  And of course the $64 question is whether the
combination would hit often enough to make it worth the cycles to check
for.  It seems like a fairly unusual case to me.

The variant where you know there are not multiple rows because you're
joining on a unique column, rather than depending on DISTINCT, might
be more useful in practice ... but I'm still not convinced it's worth
checking for.
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: query optimization scenarios 17,701 times faster!!!