Обсуждение: Help analyzing 7.2.4 EXPLAIN

Поиск
Список
Период
Сортировка

Help analyzing 7.2.4 EXPLAIN

От
Josh Berkus
Дата:
Folks,

What follows is a 7.2.4 EXPLAIN ANALYZE statement for the shown query.  This
query is currently taking 570 msec, an OK amount of time until you realize
that the system only has test data currently, and the tables in full
production will have 100-1000 times as much data.

 Becuase it's 7.2.4, it's a little hard to tell exactly which part of the
query is taking up 90% of the processing time.    The line which claims to be
taking that time is:
  ->  Seq Scan on users  (cost=0.00..3595.33 rows=12 width=87) (actual
time=13.50..547.59 rows=41 loops=1)

However, since users has only 200 records, I suspect that what's actually
being represented here is the processing time for the PL/pgSQL procedure in
the correlated subselect, if_addendee_conflict().

Questions:
1. Can anyone confirm my analysis in the paragraph above?
2. Can anyone point out any obvious ways to speed up the query below?
3. In the query below, if_attendee_conflict needs to be run once for each
(attorney times events) on the same day.   Further, if_attendee_conflict
involves a database lookup in 10-25% of cases.   Given that
if_attendee_conflict needs to apply complex criteria to determine whether or
not there is a conflict, can anyone suggest possible ways to cut down on the
number of required loops?

Thanks everyone!   Query and analyze follows.


j_test=> explain analyze
SELECT users.user_id, (users.fname || COALESCE(' ' || users.minit, '') || ' '
|| users.lname) as atty_name,
     users.lname,
     (SELECT if_addendee_conflict(users.user_id, 3272, '2003-04-15 10:00', '1
days'::INTERVAL,
         events.event_id, events.event_date, events.duration, event_cats.status, '30
minutes') as cflt
        FROM events, event_types, event_cats, event_days
        WHERE events.event_id = event_days.event_id
            and events.etype_id = event_types.etype_id
             AND event_types.ecat_id = event_cats.ecat_id
             AND event_days.event_day
                 BETWEEN '2003-04-15' AND '2003-04-16 10:00'
        ORDER BY cflt LIMIT 1)  AS conflict
FROM users
WHERE EXISTS (SELECT teams_users.user_id FROM teams_users JOIN teams_tree
    ON teams_users.team_id = teams_tree.team_id WHERE teams_tree.treeno
    BETWEEN 3 and 4 AND teams_users.user_id = users.user_id)
AND users.status > 0
    AND NOT EXISTS (SELECT staff_id FROM event_staff WHERE event_id = 3272
         AND staff_id = users.user_id)
ORDER BY conflict, users.lname, atty_name;

NOTICE:  QUERY PLAN:

Sort  (cost=3595.55..3595.55 rows=12 width=87) (actual time=547.89..547.91
rows=41 loops=1)
  ->  Seq Scan on users  (cost=0.00..3595.33 rows=12 width=87) (actual
time=13.50..547.59 rows=41 loops=1)
        SubPlan
          ->  Limit  (cost=54.03..54.03 rows=1 width=46) (actual
time=13.14..13.14 rows=1 loops=41)
                ->  Sort  (cost=54.03..54.03 rows=1 width=46) (actual
time=13.13..13.13 rows=2 loops=41)
                      ->  Hash Join  (cost=52.77..54.02 rows=1 width=46)
(actual time=5.09..12.94 rows=95 loops=41)
                            ->  Seq Scan on event_cats  (cost=0.00..1.16
rows=16 width=6) (actual time=0.01..0.05 rows=16 loops=41)
                            ->  Hash  (cost=52.77..52.77 rows=1 width=40)
(actual time=4.72..4.72 rows=0 loops=41)
                                  ->  Hash Join  (cost=49.94..52.77 rows=1
width=40) (actual time=4.19..4.59 rows=95 loops=41)
                                        ->  Seq Scan on event_types
(cost=0.00..2.54 rows=54 width=8) (actual time=0.01..0.12 rows=54 loops=41)
                                        ->  Hash  (cost=49.93..49.93 rows=5
width=32) (actual time=4.10..4.10 rows=0 loops=41)
                                              ->  Nested Loop
(cost=0.00..49.93 rows=5 width=32) (actual time=0.16..3.95 rows=95 loops=41)
                                                    ->  Seq Scan on event_days
(cost=0.00..25.00 rows=5 width=4) (actual time=0.12..2.31 rows=95 loops=41)
                                                    ->  Index Scan using
events_pkey on events  (cost=0.00..4.97 rows=1 width=28) (actual
time=0.01..0.01 rows=1 loops=3895)
          ->  Nested Loop  (cost=0.00..19.47 rows=1 width=12) (actual
time=0.04..0.04 rows=0 loops=147)
                ->  Index Scan using idx_teams_tree_node on teams_tree
(cost=0.00..8.58 rows=2 width=4) (actual time=0.01..0.02 rows=2 loops=147)
                ->  Index Scan using teams_users_pk on teams_users
(cost=0.00..4.83 rows=1 width=8) (actual time=0.01..0.01 rows=0 loops=252)
          ->  Index Scan using event_staff_table_pk on event_staff
(cost=0.00..4.95 rows=1 width=4) (actual time=0.01..0.01 rows=0 loops=41)
Total runtime: 548.20 msec

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Help analyzing 7.2.4 EXPLAIN

От
Tom Lane
Дата:
Josh Berkus <josh@agliodbs.com> writes:
>  Becuase it's 7.2.4, it's a little hard to tell exactly which part of the
> query is taking up 90% of the processing time.

Keep in mind that in the subqueries, the "actual time" shown is the time
per iteration --- you should multiply by the "loops" value to get an
accurate idea of where the time is going.  With that in mind, it's real
clear that the first subplan is eating the bulk of the time.

I think you are probably right that execution of the
if_addendee_conflict() function is the main cost.  But
given this subquery that's not too surprising:

>      (SELECT if_addendee_conflict(users.user_id, 3272, '2003-04-15 10:00', '1 days'::INTERVAL,
>          events.event_id, events.event_date, events.duration, event_cats.status, '30 minutes') as cflt
>         FROM events, event_types, event_cats, event_days
>         WHERE events.event_id = event_days.event_id
>             and events.etype_id = event_types.etype_id
>              AND event_types.ecat_id = event_cats.ecat_id
>              AND event_days.event_day
>                  BETWEEN '2003-04-15' AND '2003-04-16 10:00'
>         ORDER BY cflt LIMIT 1)  AS conflict

What you have here is a subquery that will execute
if_addendee_conflict() for *each* row of the events table; then throw
away all but one of the results.  And then do that over again for each
user row.  It looks to me like if_addendee_conflict() is being called
nearly 4000 times in this query.  No wonder it's slow.

The first thing that pops to mind is whether you really need the *first*
conflict, or would it be enough to find any old conflict?  If you could
dispense with the ORDER BY then at least some evaluations of
if_addendee_conflict() could be saved.

Realistically, though, I think you're going to have to refactor the work
to make this perform reasonably.  How much of what
if_addendee_conflict() does is actually dependent on the user_id?  Could
you separate out tests that depend only on the event, and do that in a
separate pass that is done only once per event, instead once per
event*user?  If you could reduce the number of events that need to be
examined for any given user, you could get somewhere.

Also, I don't see where this query checks to see if the user is actually
interested in attending the event.  Is that one of the things
if_addendee_conflict checks?  If so, you should pull it out and make it
a join condition.  You're essentially forcing the stupidest possible
join algorithm by burying that condition inside a user-defined function.
It would win to check that sooner instead of later, since presumably the
set of interesting events for any one user is a lot smaller than the set
of all events.

            regards, tom lane


Re: Help analyzing 7.2.4 EXPLAIN

От
Josh Berkus
Дата:
Tom,

> Keep in mind that in the subqueries, the "actual time" shown is the time
> per iteration --- you should multiply by the "loops" value to get an
> accurate idea of where the time is going.  With that in mind, it's real
> clear that the first subplan is eating the bulk of the time.

Thanks, that's what I thought, but I wanted confirmation.

> The first thing that pops to mind is whether you really need the *first*
> conflict, or would it be enough to find any old conflict?  If you could
> dispense with the ORDER BY then at least some evaluations of
> if_addendee_conflict() could be saved.

The problem is that I need the lowest-sorted non-NULL conflict.   The majority
(95%) of the runs of if_attendee_conflict will return NULL.  But we can't
know that until we run the test, which is a bit too complex for a case
statement.

Now, if I could figure out a way to stop testing for a particular user the
first time if_attendee_conflict returned a particular result, that could cut
the number of subquery loops by 1/3.   Any ideas?

> Realistically, though, I think you're going to have to refactor the work
> to make this perform reasonably.  How much of what
> if_addendee_conflict() does is actually dependent on the user_id?

Almost all of it.  The question being answered by the query is "Please give me
the list of all users, plus which of them have a conflict for that particular
date and time and what kind of conflict it is".

>Could
> you separate out tests that depend only on the event, and do that in a
> separate pass that is done only once per event, instead once per
> event*user?  If you could reduce the number of events that need to be
> examined for any given user, you could get somewhere.

Regrettably, no.  We have to run it for each user.   I was acutally hoping to
come up with a way of running for less events, acutally ....

>
> Also, I don't see where this query checks to see if the user is actually
> interested in attending the event.  Is that one of the things
> if_addendee_conflict checks?

No.  <grin> the users aren't given a choice about what they want to attend --
the purpose of the query is to supply the calendar staff with a list of who's
available so the users can be assigned -- whether they want to or not.

Well, we'll see if the current incarnation bogs down in a couple of months,
and I'll rework the query if so.  Thanks for the advice!

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Help analyzing 7.2.4 EXPLAIN

От
Josh Berkus
Дата:
Tom,

If you're interested, here's the query I ended up with.  It's much uglier than
the original query, but gives me slightly more data (one bit of information
is seperated into 2 columns rather than rolled up), is 100ms faster, and
should not slow down much with the growth of the tables:

SELECT users.user_id, (users.fname || COALESCE(' ' || users.minit, '') || ' '
|| users.lname) as atty_name,
    users.lname,
    COALESCE (
    (SELECT if_addendee_conflict(users.user_id, 3272, '2003-04-15 10:00', '1
days'::INTERVAL,
        events.event_id, events.event_date, events.duration, event_cats.status,
        '30 minutes', staff_id) as cflt
        FROM event_types, event_cats, event_days, events, event_staff
        WHERE events.event_id = event_days.event_id
            and events.etype_id = event_types.etype_id
            AND event_types.ecat_id = event_cats.ecat_id
            AND event_days.event_day BETWEEN '2003-04-15' AND '2003-04-16 10:00'
            AND events.event_id <> 3272
            AND events.event_id = event_staff.event_id
            AND event_staff.staff_id = users.user_id
            AND event_cats.status IN (1,3)
        ORDER BY cflt LIMIT 1),
    (SELECT 'LEAVE'::TEXT
     FROM event_types, event_cats, event_days, events
     WHERE events.event_id = event_days.event_id
            and events.etype_id = event_types.etype_id
            AND event_types.ecat_id = event_cats.ecat_id
            AND event_days.event_day BETWEEN '2003-04-15' AND '2003-04-16 10:00'
            AND events.event_id <> 3272
            AND event_cats.status = 4)
         ) AS conflict,
    (SELECT (staff_id > 0) FROM event_staff
        WHERE event_id = 3272
        AND staff_id = users.user_id) as assigned
FROM users
WHERE EXISTS (SELECT teams_users.user_id FROM teams_users JOIN teams_tree
    ON teams_users.team_id = teams_tree.team_id WHERE teams_tree.treeno
    BETWEEN 3 and 4 AND teams_users.user_id = users.user_id)
    AND users.status > 0
ORDER BY conflict, users.lname, atty_name;

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco