Обсуждение: Query performance issue

Поиск
Список
Период
Сортировка

Query performance issue

От
Nagaraj Raj
Дата:
Hi,

I have a query performance issue, it takes a long time, and not even getting explain analyze the output. this query joining on 3 tables which have around 
a - 176223509
b - 286887780
c - 214219514



explain
select  Count(a."individual_entity_proxy_id")
from "prospect" a
inner join "individual_demographic" b
on a."individual_entity_proxy_id" = b."individual_entity_proxy_id"
inner join "household_demographic" c 
on a."household_entity_proxy_id" = c."household_entity_proxy_id"
where (((a."last_contacted_anychannel_dttm" is null) 
           or (a."last_contacted_anychannel_dttm" < TIMESTAMP '2020-11-23 0:00:00.000000'))
                   and (a."shared_paddr_with_customer_ind" = 'N') 
               and (a."profane_wrd_ind" = 'N') 
               and (a."tmo_ofnsv_name_ind" = 'N')
                   and (a."has_individual_address" = 'Y') 
               and (a."has_last_name" = 'Y') 
       and (a."has_first_name" = 'Y'))
                   and ((b."tax_bnkrpt_dcsd_ind" = 'N') 
   and (b."govt_prison_ind" = 'N') 
   and (b."cstmr_prspct_ind" = 'Prospect'))
                   and (( c."hspnc_lang_prfrnc_cval" in ('B', 'E', 'X') ) 
   or (c."hspnc_lang_prfrnc_cval" is null));
-- Explain output

 "Finalize Aggregate  (cost=32813309.28..32813309.29 rows=1 width=8)"
"  ->  Gather  (cost=32813308.45..32813309.26 rows=8 width=8)"
"        Workers Planned: 8"
"        ->  Partial Aggregate  (cost=32812308.45..32812308.46 rows=1 width=8)"
"              ->  Merge Join  (cost=23870130.00..32759932.46 rows=20950395 width=8)"
"                    Merge Cond: (a.individual_entity_proxy_id = b.individual_entity_proxy_id)"
"                    ->  Sort  (cost=23870127.96..23922503.94 rows=20950395 width=8)"
"                          Sort Key: a.individual_entity_proxy_id"
"                          ->  Hash Join  (cost=13533600.42..21322510.26 rows=20950395 width=8)"
"                                Hash Cond: (a.household_entity_proxy_id = c.household_entity_proxy_id)"
"                                ->  Parallel Seq Scan on prospect a  (cost=0.00..6863735.60 rows=22171902 width=16)"
"                                      Filter: (((last_contacted_anychannel_dttm IS NULL) OR (last_contacted_anychannel_dttm < '2020-11-23 00:00:00'::timestamp without time zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND (profane_wrd_ind = 'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 'Y'::bpchar) AND (has_last_name = 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))"
"                                ->  Hash  (cost=10801715.18..10801715.18 rows=166514899 width=8)"
"                                      ->  Seq Scan on household_demographic c  (cost=0.00..10801715.18 rows=166514899 width=8)"
"                                            Filter: (((hspnc_lang_prfrnc_cval)::text = ANY ('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS NULL))"
"                    ->  Index Only Scan using indx_individual_demographic_prxyid_taxind_prspctind_prsnind on individual_demographic b  (cost=0.57..8019347.13 rows=286887776 width=8)"
"                          Index Cond: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND (cstmr_prspct_ind = 'Prospect'::text) AND (govt_prison_ind = 'N'::bpchar))"
                         

Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle




Server configuration is: 
Version: 10.11
RAM - 320GB
vCPU - 32
 "maintenance_work_mem" 256MB
"work_mem"             1GB
"shared_buffers" 64GB


Any suggestions? 


Thanks,
Rj







Re: Query performance issue

От
Justin Pryzby
Дата:
On Fri, Jan 22, 2021 at 01:53:26AM +0000, Nagaraj Raj wrote:
> Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle
> Postgres 11 | db<>fiddle
> Server configuration is: Version: 10.11RAM - 320GBvCPU - 32 "maintenance_work_mem" 256MB"work_mem"            
1GB"shared_buffers"64GB
 

> Aggregate  (cost=31.54..31.55 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1)
>   ->  Nested Loop  (cost=0.00..31.54 rows=1 width=8) (actual time=0.007..0.008 rows=0 loops=1)
>         Join Filter: (a.household_entity_proxy_id = c.household_entity_proxy_id)
>         ->  Nested Loop  (cost=0.00..21.36 rows=1 width=16) (actual time=0.006..0.007 rows=0 loops=1)
>               Join Filter: (a.individual_entity_proxy_id = b.individual_entity_proxy_id)
>               ->  Seq Scan on prospect a  (cost=0.00..10.82 rows=1 width=16) (actual time=0.006..0.006 rows=0
loops=1)
>                     Filter: (((last_contacted_anychannel_dttm IS NULL) OR (last_contacted_anychannel_dttm <
'2020-11-2300:00:00'::timestamp without time zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND
(profane_wrd_ind= 'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 'Y'::bpchar) AND
(has_last_name= 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))
 
>               ->  Seq Scan on individual_demographic b  (cost=0.00..10.53 rows=1 width=8) (never executed)
>                     Filter: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND (govt_prison_ind = 'N'::bpchar) AND
((cstmr_prspct_ind)::text= 'Prospect'::text))
 
>         ->  Seq Scan on household_demographic c  (cost=0.00..10.14 rows=3 width=8) (never executed)
>               Filter: (((hspnc_lang_prfrnc_cval)::text = ANY ('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS
NULL))
> Planning Time: 1.384 ms
> Execution Time: 0.206 ms
> 13 rows

It's doing nested loops with estimated rowcount=1, which indicates a bad
underestimate, and suggests that the conditions are redundant or correlated.

Maybe you can handle this with MV stats on the correlated columns:

CREATE STATISTICS prospect_stats (dependencies) ON
    shared_paddr_with_customer_ind, profane_wrd_ind, tmo_ofnsv_name_ind, has_individual_address, has_last_name,
has_first_name
    FROM prospect;
CREATE STATISTICS individual_demographic_stats (dependencies) ON
    tax_bnkrpt_dcsd_ind, govt_prison_ind, cstmr_prspct_ind
    FROM individual_demographic_stats 
ANALYZE prospect, individual_demographic_stats ;

Since it's expensive to compute stats on large number of columns, I'd then
check *which* are correlated and then only compute MV stats on those.  This
will show col1=>col2: X where X approaches 1, the conditions are highly
correlated:
SELECT * FROM pg_statistic_ext; -- pg_statistic_ext_data since v12

Also, as a diagnostic tool to get "explain analyze" to finish, you can
SET enable_nestloop=off;

-- 
Justin



Re: Query performance issue

От
Tomas Vondra
Дата:

On 1/22/21 3:35 AM, Justin Pryzby wrote:
> On Fri, Jan 22, 2021 at 01:53:26AM +0000, Nagaraj Raj wrote:
>> Tables ddl are attached in dbfiddle -- Postgres 11 | db<>fiddle
>> Postgres 11 | db<>fiddle
>> Server configuration is: Version: 10.11RAM - 320GBvCPU - 32 "maintenance_work_mem" 256MB"work_mem"            
1GB"shared_buffers"64GB
 
> 
>> Aggregate  (cost=31.54..31.55 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1)
>>    ->  Nested Loop  (cost=0.00..31.54 rows=1 width=8) (actual time=0.007..0.008 rows=0 loops=1)
>>          Join Filter: (a.household_entity_proxy_id = c.household_entity_proxy_id)
>>          ->  Nested Loop  (cost=0.00..21.36 rows=1 width=16) (actual time=0.006..0.007 rows=0 loops=1)
>>                Join Filter: (a.individual_entity_proxy_id = b.individual_entity_proxy_id)
>>                ->  Seq Scan on prospect a  (cost=0.00..10.82 rows=1 width=16) (actual time=0.006..0.006 rows=0
loops=1)
>>                      Filter: (((last_contacted_anychannel_dttm IS NULL) OR (last_contacted_anychannel_dttm <
'2020-11-2300:00:00'::timestamp without time zone)) AND (shared_paddr_with_customer_ind = 'N'::bpchar) AND
(profane_wrd_ind= 'N'::bpchar) AND (tmo_ofnsv_name_ind = 'N'::bpchar) AND (has_individual_address = 'Y'::bpchar) AND
(has_last_name= 'Y'::bpchar) AND (has_first_name = 'Y'::bpchar))
 
>>                ->  Seq Scan on individual_demographic b  (cost=0.00..10.53 rows=1 width=8) (never executed)
>>                      Filter: ((tax_bnkrpt_dcsd_ind = 'N'::bpchar) AND (govt_prison_ind = 'N'::bpchar) AND
((cstmr_prspct_ind)::text= 'Prospect'::text))
 
>>          ->  Seq Scan on household_demographic c  (cost=0.00..10.14 rows=3 width=8) (never executed)
>>                Filter: (((hspnc_lang_prfrnc_cval)::text = ANY ('{B,E,X}'::text[])) OR (hspnc_lang_prfrnc_cval IS
NULL))
>> Planning Time: 1.384 ms
>> Execution Time: 0.206 ms
>> 13 rows
> 
> It's doing nested loops with estimated rowcount=1, which indicates a bad
> underestimate, and suggests that the conditions are redundant or correlated.
> 

No, it's not. The dbfiddle does that because it's using empty tables, 
but the plan shared by Nagaraj does not contain any nested loops.

Nagaraj, if the EXPLAIN ANALYZE does not complete, there are two things 
you can do to determine which part of the plan is causing trouble.

Firstly, you can profile the backend using perf or some other profiles, 
and if we're lucky the function will give us some hints about which node 
type is using the CPU.

Secondly, you can "cut" the query into smaller parts, to run only parts 
of the plan - essentially start from inner-most join, and incrementally 
add more and more tables until it gets too long.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Query performance issue

От
Michael Lewis
Дата:
What indexes exist on those tables? How many rows do you expect to get back in total? Is the last_contacted_anychannel_dttm clause restrictive, or does that include most of the prospect table (check pg_stats for the histogram if you don't know).

and (a."shared_paddr_with_customer_ind" = 'N') 
               and (a."profane_wrd_ind" = 'N') 
               and (a."tmo_ofnsv_name_ind" = 'N')
                   and (a."has_individual_address" = 'Y') 
               and (a."has_last_name" = 'Y') 
       and (a."has_first_name" = 'Y'))

Are these conditions expected to throw out very few rows, or most of the table?

If you change both joins to EXISTS clauses, do you get the same plan when you run explain?