Обсуждение: Nested Loops vs. Hash Joins or Merge Joins

Поиск
Список
Период
Сортировка

Nested Loops vs. Hash Joins or Merge Joins

От
Ketema Harris
Дата:
I am attempting to learn more about the way Pg decides what operators to use in its query planning and executions.  I have moderately complicated table layout, but it is mostly normalized I recently created a query:

select account.acct_name as "Customer Name", NULL as "Facility", account.acct_number as "LDC Acct#", account.svc_address as "Service Address",
account.svc_address as "Service Address", account.svc_city as "Service City", account.svc_state as "Service State", account.svc_city as "Service City",
account.svc_zip as "Service Zip", product.ldc_name as "LDC", NULL as "ESI Rate", NULL as "LDC Rate",
account.billing_address as "Mailing Address1", account.billing_address_2 as "Mailing Address2",
account.billing_city || ', ' || account.billing_state as "City, State", account.billing_zip as "Zip", customer.first_name || ' ' || customer.last_name
as "Contact", customer.phone as "Phone", customer.class as "Customer Class", NULL as "Tax Exempt", NULL as "Exempt%",
marketer_divisions.channel_partner_code as "Channel Partner", NULL as "AE", NULL as "Annual Use MCF", account.rate as "Trigger Price",
marketer_divisions.channel_partner_fee as "Channel Partner Fee"
from naes.reconciliation
   inner join naes.application
      inner join naes.account
         inner join naes.marketer_product
            inner join naes.marketer_divisions
               inner join naes.cities
               on marketer_divisions.city_id = cities.city_id
            on marketer_product.division_id = marketer_divisions.division_id
            inner join naes.product
            on marketer_product.ldc_id = product.ldc_id
         on account.marketer_product_id = marketer_product.marketer_product_id
         inner join naes.customer
         on account.customer_id = customer.customer_id
      on account.app_id = application.app_id and account.acct_id = application.acct_id
   on reconciliation.app_id = application.app_id and reconciliation.transferred_date is NULL;

The query runs fine I have no performance issues with it, but here are two query plans for the above query, one with nested loops on, the other with them off:

Nested Loops on:

Nested Loop  (cost=3.33..11.37 rows=1 width=268) (actual time=2.166..2.982 rows=3 loops=1)
   Join Filter: ("outer".city_id = "inner".city_id)
   ->  Nested Loop  (cost=3.33..10.32 rows=1 width=272) (actual time=2.136..2.863 rows=3 loops=1)
         Join Filter: ("outer".division_id = "inner".division_id)plication.app_id and reco=
         ->  Nested Loop  (cost=3.33..9.27 rows=1 width=231) (actual time=2.119..2.763 rows=3 loops=1)
               Join Filter: ("outer".ldc_id = "inner".ldc_id)
               ->  Nested Loop  (cost=3.33..8.23 rows=1 width=218) (actual time=2.101..2.659 rows=3 loops=1)
                     ->  Nested Loop  (cost=3.33..5.15 rows=1 width=151) (actual time=2.068..2.559 rows=3 loops=1)
                           Join Filter: ("inner".app_id = "outer".app_id)
                           ->  Merge Join  (cost=3.33..4.11 rows=1 width=159) (actual time=1.096..1.477 rows=31 loops=1)
                                 Merge Cond: ("outer".marketer_product_id = "inner".marketer_product_id)
                                 ->  Index Scan using "PK_marketer_product_id" on marketer_product  (cost=0.00..3.04 rows=4 width=12) (actual time=0.017..0.033 rows=4 loops=1)
                                 ->  Sort  (cost=3.33..3.33 rows=1 width=155) (actual time=1.065..1.180 rows=31 loops=1)
                                       Sort Key: account.marketer_product_id
                                       ->  Hash Join  (cost=1.75..3.32 rows=1 width=155) (actual time=0.457..0.848 rows=31 loops=1)
                                             Hash Cond: (("outer".app_id = "inner".app_id) AND ("outer".acct_id = "inner".acct_id))
                                             ->  Seq Scan on account  (cost=0.00..1.28 rows=28 width=155) (actual time=0.007..0.160 rows=34 loops=1)
                                             ->  Hash  (cost=1.50..1.50 rows=50 width=8) (actual time=0.413..0.413 rows=50 loops=1)
                                                   ->  Seq Scan on application  (cost=0.00..1.50 rows=50 width=8) (actual time=0.006..0.209 rows=50 loops=1)
                           ->  Seq Scan on reconciliation  (cost=0.00..1.03 rows=1 width=4) (actual time=0.005..0.016 rows=3 loops=31)
                                 Filter: (transferred_date IS NULL)
                     ->  Index Scan using customer_pkey on customer  (cost=0.00..3.06 rows=1 width=75) (actual time=0.011..0.015 rows=1 loops=3)
                           Index Cond: ("outer".customer_id = customer.customer_id)
               ->  Seq Scan on product  (cost=0.00..1.02 rows=2 width=21) (actual time=0.005..0.013 rows=2 loops=3)
         ->  Seq Scan on marketer_divisions  (cost=0.00..1.02 rows=2 width=49) (actual time=0.005..0.013 rows=2 loops=3)
   ->  Seq Scan on cities  (cost=0.00..1.02 rows=2 width=4) (actual time=0.005..0.013 rows=2 loops=3)
 Total runtime: 3.288 ms

Nested Loops off:

Hash Join  (cost=8.27..11.78 rows=1 width=268) (actual time=1.701..1.765 rows=3 loops=1)
   Hash Cond: ("outer".city_id = "inner".city_id)
   ->  Hash Join  (cost=7.24..10.73 rows=1 width=272) (actual time=1.629..1.667 rows=3 loops=1)
         Hash Cond: ("outer".customer_id = "inner".customer_id)
         ->  Seq Scan on customer  (cost=0.00..3.32 rows=32 width=75) (actual time=0.006..0.136 rows=33 loops=1)
         ->  Hash  (cost=7.24..7.24 rows=1 width=205) (actual time=1.366..1.366 rows=3 loops=1)
               ->  Hash Join  (cost=6.43..7.24 rows=1 width=205) (actual time=1.243..1.333 rows=3 loops=1)
                     Hash Cond: ("outer".division_id = "inner".division_id)
                     ->  Hash Join  (cost=5.40..6.20 rows=1 width=164) (actual time=1.184..1.252 rows=3 loops=1)
                           Hash Cond: ("outer".ldc_id = "inner".ldc_id)
                           ->  Merge Join  (cost=4.38..5.16 rows=1 width=151) (actual time=1.124..1.169 rows=3 loops=1)
                                 Merge Cond: ("outer".marketer_product_id = "inner".marketer_product_id)
                                 ->  Index Scan using "PK_marketer_product_id" on marketer_product  (cost=0.00..3.04 rows=4 width=12) (actual time=0.012..0.019 rows=2 loops=1)
                                 ->  Sort  (cost=4.38..4.38 rows=1 width=147) (actual time=1.098..1.109 rows=3 loops=1)
                                       Sort Key: account.marketer_product_id
                                       ->  Hash Join  (cost=2.78..4.37 rows=1 width=147) (actual time=1.007..1.064 rows=3 loops=1)
                                             Hash Cond: ("outer".app_id = "inner".app_id)
                                             ->  Hash Join  (cost=1.75..3.32 rows=1 width=155) (actual time=0.494..0.875 rows=31 loops=1)
                                                   Hash Cond: (("outer".app_id = "inner".app_id) AND ("outer".acct_id = "inner".acct_id))
                                                   ->  Seq Scan on account  (cost=0.00..1.28 rows=28 width=155) (actual time=0.007..0.154 rows=34 loops=1)
                                                   ->  Hash  (cost=1.50..1.50 rows=50 width=8) (actual time=0.451..0.451 rows=50 loops=1)
                                                         ->  Seq Scan on application  (cost=0.00..1.50 rows=50 width=8) (actual time=0.006..0.223 rows=50 loops=1)
                                             ->  Hash  (cost=1.03..1.03 rows=1 width=4) (actual time=0.042..0.042 rows=3 loops=1)
                                                   ->  Seq Scan on reconciliation  (cost=0.00..1.03 rows=1 width=4) (actual time=0.007..0.019 rows=3 loops=1)
                                                         Filter: (transferred_date IS NULL)
                           ->  Hash  (cost=1.02..1.02 rows=2 width=21) (actual time=0.036..0.036 rows=2 loops=1)
                                 ->  Seq Scan on product  (cost=0.00..1.02 rows=2 width=21) (actual time=0.005..0.014 rows=2 loops=1)
                     ->  Hash  (cost=1.02..1.02 rows=2 width=49) (actual time=0.036..0.036 rows=2 loops=1)
                           ->  Seq Scan on marketer_divisions  (cost=0.00..1.02 rows=2 width=49) (actual time=0.007..0.016 rows=2 loops=1)
   ->  Hash  (cost=1.02..1.02 rows=2 width=4) (actual time=0.039..0.039 rows=2 loops=1)
         ->  Seq Scan on cities  (cost=0.00..1.02 rows=2 width=4) (actual time=0.009..0.017 rows=2 loops=1)
 Total runtime: 2.084 ms

With nested loops enabled does it choose to use them because it sees the estimated start up cost with loops as less?  Does it not know that the total query would be faster with the Hash Joins?  This query is in development right now, and as such there are not many rows.  When it goes to production the reconciliation table will grow by about 50 – 100 rows per day where the transferred_date is NULL (this is the driving criteria behind this query.)  As the table grows can I expect Pg to realize the the nested loops will be slower and will it switch to the Hash Joins?  If not how would I force it to use the Hash Joins without just turning off nested loops completely?  Is it a good idea to turn off nested loops completely?

Statistics collecting and auto vacuum is enabled btw.  I have an erd diagram showing the table structures if anyone is interested in looking at it, just let me know.

Thanks,
Ketema

Re: Nested Loops vs. Hash Joins or Merge Joins

От
"Jim C. Nasby"
Дата:
On Thu, May 11, 2006 at 08:57:48AM -0400, Ketema Harris wrote:
> Nested Loops on:
> Nested Loop  (cost=3.33..11.37 rows=1 width=268) (actual time=2.166..2.982
>
> Nested Loops off:
> Hash Join  (cost=8.27..11.78 rows=1 width=268) (actual time=1.701..1.765
>
> With nested loops enabled does it choose to use them because it sees the
> estimated start up cost with loops as less?  Does it not know that the total
> query would be faster with the Hash Joins?  This query is in development

Yes it does know; re-read the output.

I believe the cases where the planner will look at startup cost over
total cost are pretty limited; when LIMIT is used and I think sometimes
when a CURSOR is used.

> Statistics collecting and auto vacuum is enabled btw.  I have an erd diagram
> showing the table structures if anyone is interested in looking at it, just
> let me know.

Note that it's not terribly uncommon for the default stats target to be
woefully inadequate for large sets of data, not that 100 rows a day is
large. But it probably wouldn't hurt to bump the defaulst stats target
up to 30 or 50 anyway.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461