Обсуждение: Hash join seq scan slow

Поиск
Список
Период
Сортировка

Hash join seq scan slow

От
Aldo Sarmiento
Дата:
Hello,

I'm assuming this topic has probably been bludgeoned to a pulp, but my google-fu can't seem to find a solution.

I have two relatively largish tables that I'm trying to join that result in a slow query.

Hardware: 

2014 iMac w/ SSD & i5 processor

Tables:
contacts: 1.14 million rows
permissions: 2.49 million rows

contacts have many permissions

Goal: get first page of contacts (limit 40) a user has access to, sorted by creation date.

For simplicity's sake, I've taken out some of the complexity of how I retrieve the permissions & just got all permissions with id less than 2,100,000 to simulate access of ~151k contacts.

stage4=# EXPLAIN ANALYZE WITH perms AS (
stage4(#   SELECT DISTINCT(contact_id) from permissions where id < 2100000
stage4(# ) SELECT
stage4-#     contacts.id,
stage4-#     contacts.first_name
stage4-#   FROM contacts
stage4-#   INNER JOIN perms ON  perms.contact_id = contacts.id
stage4-#   ORDER BY contacts.updated_at desc NULLS LAST LIMIT 40 OFFSET 0;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=226777.45..226777.55 rows=40 width=20) (actual time=1556.133..1556.143 rows=40 loops=1)
   CTE perms
     ->  HashAggregate  (cost=34891.61..35404.25 rows=51264 width=4) (actual time=107.107..151.455 rows=151920 loops=1)
           Group Key: permissions.contact_id
           ->  Bitmap Heap Scan on permissions  (cost=3561.50..34416.43 rows=190074 width=4) (actual time=14.839..50.445 rows=192372 loops=1)
                 Recheck Cond: (id < 2100000)
                 Heap Blocks: exact=2435
                 ->  Bitmap Index Scan on permissions_pkey  (cost=0.00..3513.98 rows=190074 width=0) (actual time=14.496..14.496 rows=192372 loops=1)
                       Index Cond: (id < 2100000)
   ->  Sort  (cost=191373.19..191501.35 rows=51264 width=20) (actual time=1556.132..1556.137 rows=40 loops=1)
         Sort Key: contacts.updated_at DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Hash Join  (cost=180911.60..189752.76 rows=51264 width=20) (actual time=1124.969..1532.269 rows=124152 loops=1)
               Hash Cond: (perms.contact_id = contacts.id)
               ->  CTE Scan on perms  (cost=0.00..1025.28 rows=51264 width=4) (actual time=107.110..203.330 rows=151920 loops=1)
               ->  Hash  (cost=159891.71..159891.71 rows=1144871 width=20) (actual time=1017.354..1017.354 rows=1145174 loops=1)
                     Buckets: 65536  Batches: 32  Memory Usage: 2521kB
                     ->  Seq Scan on contacts  (cost=0.00..159891.71 rows=1144871 width=20) (actual time=0.035..684.361 rows=1145174 loops=1)
 Planning time: 0.222 ms
 Execution time: 1561.693 ms
(20 rows)

It is to my understanding that the query requires the entire 150k matched contacts to be joined in order for the Sort to run its operation. I don't see a way around this, however, I can't also see how this can be a unique case where it's acceptable to have a 1.5 second query time? There has to be lots of other companies out there that manage much more data that needs to be sorted for the presentation layer.

Thanks in advance,

-Aldo

Re: Hash join seq scan slow

От
Jeff Janes
Дата:
On Tue, Apr 19, 2016 at 1:07 AM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:
> Hello,
>
> I'm assuming this topic has probably been bludgeoned to a pulp, but my
> google-fu can't seem to find a solution.
>
> I have two relatively largish tables that I'm trying to join that result in
> a slow query.
>
> Hardware:
>
> 2014 iMac w/ SSD & i5 processor
>
> Tables:
> contacts: 1.14 million rows
> permissions: 2.49 million rows
>
> contacts have many permissions
>
> Goal: get first page of contacts (limit 40) a user has access to, sorted by
> creation date.
>
> For simplicity's sake, I've taken out some of the complexity of how I
> retrieve the permissions & just got all permissions with id less than
> 2,100,000 to simulate access of ~151k contacts.

I think you have simplified it rather too much.  There is no reason to
think anything we suggest would carry over to the real case.


>
> stage4=# EXPLAIN ANALYZE WITH perms AS (
> stage4(#   SELECT DISTINCT(contact_id) from permissions where id < 2100000
> stage4(# ) SELECT
> stage4-#     contacts.id,
> stage4-#     contacts.first_name
> stage4-#   FROM contacts
> stage4-#   INNER JOIN perms ON  perms.contact_id = contacts.id
> stage4-#   ORDER BY contacts.updated_at desc NULLS LAST LIMIT 40 OFFSET 0;
>                                                                       QUERY
> PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=226777.45..226777.55 rows=40 width=20) (actual
> time=1556.133..1556.143 rows=40 loops=1)
>    CTE perms
>      ->  HashAggregate  (cost=34891.61..35404.25 rows=51264 width=4) (actual
> time=107.107..151.455 rows=151920 loops=1)
>            Group Key: permissions.contact_id
>            ->  Bitmap Heap Scan on permissions  (cost=3561.50..34416.43
> rows=190074 width=4) (actual time=14.839..50.445 rows=192372 loops=1)
>                  Recheck Cond: (id < 2100000)
>                  Heap Blocks: exact=2435
>                  ->  Bitmap Index Scan on permissions_pkey
> (cost=0.00..3513.98 rows=190074 width=0) (actual time=14.496..14.496
> rows=192372 loops=1)
>                        Index Cond: (id < 2100000)
>    ->  Sort  (cost=191373.19..191501.35 rows=51264 width=20) (actual
> time=1556.132..1556.137 rows=40 loops=1)
>          Sort Key: contacts.updated_at DESC NULLS LAST
>          Sort Method: top-N heapsort  Memory: 27kB
>          ->  Hash Join  (cost=180911.60..189752.76 rows=51264 width=20)
> (actual time=1124.969..1532.269 rows=124152 loops=1)
>                Hash Cond: (perms.contact_id = contacts.id)
>                ->  CTE Scan on perms  (cost=0.00..1025.28 rows=51264
> width=4) (actual time=107.110..203.330 rows=151920 loops=1)
>                ->  Hash  (cost=159891.71..159891.71 rows=1144871 width=20)
> (actual time=1017.354..1017.354 rows=1145174 loops=1)
>                      Buckets: 65536  Batches: 32  Memory Usage: 2521kB
>                      ->  Seq Scan on contacts  (cost=0.00..159891.71
> rows=1144871 width=20) (actual time=0.035..684.361 rows=1145174 loops=1)
>  Planning time: 0.222 ms
>  Execution time: 1561.693 ms
> (20 rows)

In this type of plan (visiting lots of tuples, but returning only a
few), the timing overhead of EXPLAIN (ANALYZE) can be massive.

I would also run it with "EXPLAIN (ANALYZE, TIMING OFF)" and make sure
the overall execution time between the two methods is comparable.  If
they are not, then you cannot trust the data from "EXPLAIN (ANALYZE)".
(Run it several times, alternating between them, to make sure you
aren't just seeing cache effect)


> It is to my understanding that the query requires the entire 150k matched
> contacts to be joined in order for the Sort to run its operation. I don't
> see a way around this, however, I can't also see how this can be a unique
> case where it's acceptable to have a 1.5 second query time? There has to be
> lots of other companies out there that manage much more data that needs to
> be sorted for the presentation layer.

There are lots of approaches to solving it.  One is to realizing that
none of your customers are actually interested in scrolling through
150k records, 40 records at a time.  You could also use materialized
views, or denormalize the data so that you can build and use a
multi-column index (although there are risks in that as well)

Cheers,

Jeff