Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

Поиск
Список
Период
Сортировка
От Thomas F.O'Connell
Тема Poor Performance with Distinct Subqueries with EXISTS and EXCEPT
Дата
Msg-id DAD07F94-435F-11D9-A917-000D93AE0944@alumni.brown.edu
обсуждение исходный текст
Ответы Re: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT  (Pierre-Frédéric Caillaud<lists@boutiquenumerique.com>)
Список pgsql-general
I'm trying to do some research and reporting for an email application
by domain name. This has led to a confounding attempt to do any of the
legwork in SQL via postgres.

Here is my foundational query:

SELECT DISTINCT split_part( u.email, '@', 2 )
FROM user AS u, message AS m
WHERE u.id = m.user_id
AND NOT EXISTS (
    SELECT 1
    FROM message_action ma
    WHERE ma.message_id = m.id
)
EXCEPT
SELECT DISTINCT split_part( u.email, '@', 2 )
FROM user AS u, message AS m
WHERE u.id = m.user_id
AND EXISTS (
    SELECT 1
    FROM message_action ma
    WHERE ma.message_id = m.id
)

This is designed to give me unique domain names for all users who have
not committed an action on an email. The way I figure this needed to
work was to get all members joined to messages that didn't have an
action. That's the part of the query before the EXCEPT. The EXISTS
checks for an action on the message.

Now, since I'm actually interested in unique domain names rather than
unique users, I need to get all the unique domain names corresponding
to users who have acted on a message. That's what the part of the query
after the EXCEPT is.

This query performs abysmally for even small numbers of users and
messages (each on the order of 1-10 thousand). Honestly, I have not
gotten it to finish for even these small cases. In one situation on a
development database, it filled up $PGDATA/base. This is another
instance of my surprising myself with my ability to generate
slow-running queries where I don't fully understand the interaction
between postgres and what I think I'm asking of postgres in SQL.

I'd love to deepen my knowledge of joins in general, so if anyone has
any suggestions for improvements on the EXISTS checks, then I'm anxious
to learn alternatives. One thing I don't fully understand is why all
the scans are Seq Scans. These IDs are all integer primary/foreign keys
(with indexes).

Here's an example of a plan for a user with no messages:


          QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------
  SetOp Except  (cost=479.01..479.02 rows=1 width=24) (actual
time=176.258..176.258 rows=0 loops=1)
    ->  Sort  (cost=479.01..479.02 rows=2 width=24) (actual
time=176.253..176.253 rows=0 loops=1)
          Sort Key: split_part
          ->  Append  (cost=239.48..479.00 rows=2 width=24) (actual
time=176.230..176.230 rows=0 loops=1)
                ->  Subquery Scan "*SELECT* 1"  (cost=239.48..239.50
rows=1 width=24) (actual time=90.390..90.390 rows=0 loops=1)
                      ->  Unique  (cost=239.48..239.49 rows=1 width=24)
(actual time=90.386..90.386 rows=0 loops=1)
                            ->  Sort  (cost=239.48..239.49 rows=1
width=24) (actual time=90.382..90.382 rows=0 loops=1)
                                  Sort Key: split_part((u.email)::text,
'@'::text, 2)
                                  ->  Nested Loop  (cost=0.00..239.47
rows=1 width=24) (actual time=90.370..90.370 rows=0 loops=1)
                                        ->  Seq Scan on user u
(cost=0.00..239.46 rows=1 width=24) (actual time=0.096..36.555
rows=10117 loops=1)
                                              Filter: (id = user_id)
                                        ->  Seq Scan on message m
(cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0
loops=10117)
                                              Filter: (NOT (subplan))
                                              SubPlan
                                                ->  Seq Scan on
message_action ma  (cost=0.00..0.00 rows=1 width=0) (never executed)
                                                      Filter:
(message_id = $0)
                ->  Subquery Scan "*SELECT* 2"  (cost=239.48..239.50
rows=1 width=24) (actual time=85.835..85.835 rows=0 loops=1)
                      ->  Unique  (cost=239.48..239.49 rows=1 width=24)
(actual time=85.830..85.830 rows=0 loops=1)
                            ->  Sort  (cost=239.48..239.49 rows=1
width=24) (actual time=85.828..85.828 rows=0 loops=1)
                                  Sort Key: split_part((u.email)::text,
'@'::text, 2)
                                  ->  Nested Loop  (cost=0.00..239.47
rows=1 width=24) (actual time=85.792..85.792 rows=0 loops=1)
                                        ->  Seq Scan on user u
(cost=0.00..239.46 rows=1 width=24) (actual time=0.017..32.322
rows=10117 loops=1)
                                              Filter: (id = user_id)
                                        ->  Seq Scan on message m
(cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0
loops=10117)
                                              Filter: (subplan)
                                              SubPlan
                                                ->  Seq Scan on
message_action ma  (cost=0.00..0.00 rows=1 width=0) (never executed)
                                                      Filter:
(message_id = $0)
  Total runtime: 176.691 ms
(29 rows)

Here's version information: PostgreSQL 7.4.6 on i686-pc-linux-gnu,
compiled by GCC 2.95.4

Thanks.

-tfo


В списке pgsql-general по дате отправления:

Предыдущее
От: Joachim Zobel
Дата:
Сообщение: createlang plperl fails with 8.0 beta5
Следующее
От: "Artistic-HO- IT-Department"
Дата:
Сообщение: Need Help and suggestion.