Обсуждение: Oddball data distribution giving me planner headaches

Поиск
Список
Период
Сортировка

Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:
Oddball data distribution giving me headaches.

We have a distinct 'customer' table with customer_id, type and name/demographic information.
Assume some 1 million rows in the customer table.

We then have a customer 'relationship' table which simply contains 2 columns…designating parent and child
relationships…butallowing complex hierarchies between customers. 


CREATE TABLE customer_rel
(
  parent_customer integer NOT NULL,
  child_customer integer NOT NULL,
 )

8 million rows in this table.  Oddball distribution.  We have some 8 levels of hierarchy (customer type) represented
withthis table.  Every customer gets an entry where parent/child is themselves…and then for every 'upline'.  At the
highestlevel, we have 'distributors' which have all other customer types underneath them.  Assuming we had some 68
distributors,the entries where THEY are the parent_customer represent nearly a million rows of the 8 million. 

I have extracted a simple case from a larger query that was generating an off-beat plan because of the unexpected
plannerrow-counts being spewed by a low level query. 

explain analyze
select * from customer_rel where parent_customer in (select customer_id from customer where
customer_type='DISTRIBUTOR')


"  Nested Loop  (cost=25429.44..29626.39 rows=931 width=0) (actual time=216.325..1238.091 rows=1025401 loops=1)"
"        ->  HashAggregate  (cost=25429.44..25430.80 rows=136 width=4) (actual time=216.304..216.339 rows=68 loops=1)"
"              ->  Seq Scan on customer  (cost=0.00..25429.10 rows=136 width=4) (actual time=0.018..216.226 rows=68
loops=1)"
"                    Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"        ->  Index Scan using rel_parent on customer_rel  (cost=0.00..30.76 rows=7 width=4) (actual time=0.006..8.190
rows=15079loops=68)" 
"              Index Cond: (parent_customer = customer.customer_id)"
"Total runtime: 1514.810 ms"

The fact that the top level nested loop THINKS it only will be returning 931 rows (instead of over 1 million) is the
killerhere…3 orders of magnitude.  The results of this  query are used as part of a bigger query and the screwed up
statsare causing all sorts of havoc upline. 

I'm experimenting in 9.1.0…have set the statistics to 1000 (and 10000) on both columns (parent and child) to little
effect. Have hardcoded the n_distinct on the parent_customer column to be 1,000,000 also…with no effect (doing analyze
oftable after each change) 

Does this oddball data distribution doom me to poor planning forever?
Any other thoughts?


Re: Oddball data distribution giving me planner headaches

От
"David Johnston"
Дата:
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Amiel
Sent: Friday, December 02, 2011 3:20 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Oddball data distribution giving me planner headaches

explain analyze
select * from customer_rel where parent_customer in (select customer_id from customer where
customer_type='DISTRIBUTOR')


"  Nested Loop  (cost=25429.44..29626.39 rows=931 width=0) (actual time=216.325..1238.091 rows=1025401 loops=1)"
"        ->  HashAggregate  (cost=25429.44..25430.80 rows=136 width=4) (actual time=216.304..216.339 rows=68 loops=1)"
"              ->  Seq Scan on customer  (cost=0.00..25429.10 rows=136 width=4) (actual time=0.018..216.226 rows=68
loops=1)"
"                    Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"        ->  Index Scan using rel_parent on customer_rel  (cost=0.00..30.76 rows=7 width=4) (actual time=0.006..8.190
rows=15079loops=68)" 
"              Index Cond: (parent_customer = customer.customer_id)"
"Total runtime: 1514.810 ms"

------------------------------------------------------------

What kind of plan does the following give?

EXPLAIN ANALYZE
SELECT *
FROM customer_rel p
JOIN customer c ON (p.parent_customer = c.customer_id)
WHERE c.customer_type = 'DISTRIBUTOR'
;

David J.


Re: Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:
--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> From: David Johnston <polobo@yahoo.com>
> What kind of plan does the following give?
>
> EXPLAIN ANALYZE
> SELECT *
> FROM customer_rel p
> JOIN customer c ON (p.parent_customer = c.customer_id)
> WHERE c.customer_type = 'DISTRIBUTOR'

Nearly identical output

"Nested Loop  (cost=0.00..29624.69 rows=931 width=97) (actual time=0.032..1330.208 rows=1025401 loops=1)"
"  ->  Seq Scan on customer c  (cost=0.00..25429.10 rows=136 width=71) (actual time=0.017..212.059 rows=68 loops=1)"
"        Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"  ->  Index Scan using rel_parent on customer_rel p  (cost=0.00..30.76 rows=7 width=26) (actual time=0.006..7.732
rows=15079loops=68)" 
"        Index Cond: (parent_customer = c.customer_id)"
"Total runtime: 1544.281 ms"


Re: Oddball data distribution giving me planner headaches

От
"David Johnston"
Дата:
-----Original Message-----
From: Jeff Amiel [mailto:becauseimjeff@yahoo.com]
Sent: Friday, December 02, 2011 3:52 PM
To: pgsql-general@postgresql.org; David Johnston
Subject: RE: [GENERAL] Oddball data distribution giving me planner headaches


--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> From: David Johnston <polobo@yahoo.com> What kind of plan does the
> following give?
>
> EXPLAIN ANALYZE
> SELECT *
> FROM customer_rel p
> JOIN customer c ON (p.parent_customer = c.customer_id) WHERE
> c.customer_type = 'DISTRIBUTOR'

Nearly identical output

"Nested Loop  (cost=0.00..29624.69 rows=931 width=97) (actual
time=0.032..1330.208 rows=1025401 loops=1)"
"  ->  Seq Scan on customer c  (cost=0.00..25429.10 rows=136 width=71)
(actual time=0.017..212.059 rows=68 loops=1)"
"        Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"  ->  Index Scan using rel_parent on customer_rel p  (cost=0.00..30.76
rows=7 width=26) (actual time=0.006..7.732 rows=15079 loops=68)"
"        Index Cond: (parent_customer = c.customer_id)"
"Total runtime: 1544.281 ms"

-----------------------------------------------------

What happens if you disable, say, nested loops and/or index scans?

David J.



Re: Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:

--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> What happens if you disable, say, nested loops and/or index
> scans?

planner selects different join/indexing techniques (query is slower) but row estimates (bad) remain identical.

Re: Oddball data distribution giving me planner headaches

От
"David Johnston"
Дата:
-----Original Message-----
From: Jeff Amiel [mailto:becauseimjeff@yahoo.com]
Sent: Friday, December 02, 2011 4:15 PM
To: pgsql-general@postgresql.org; David Johnston
Subject: RE: [GENERAL] Oddball data distribution giving me planner headaches



--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> What happens if you disable, say, nested loops and/or index scans?

planner selects different join/indexing techniques (query is slower) but row
estimates (bad) remain identical.


---------------------------------------------------------
My, possibly naïve, observation:

So aside from the fact the estimates seem to be off the planner has still
chosen the most effective plan?  In that situation no matter how accurate
you get the statistics you will not gain any performance because the planner
will never choose a different plan.

IF you can find a combination of settings that forces a better performing
plan THEN steps can be taken (by others more skilled than myself) to
configure the statistics/planner system to get that better plan naturally.

David J.



Re: Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:

--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> From: David Johnston <polobo@yahoo.com>

> ---------------------------------------------------------
> My, possibly naïve, observation:
>
> So aside from the fact the estimates seem to be off the
> planner has still
> chosen the most effective plan?  In that situation no
> matter how accurate
> you get the statistics you will not gain any performance
> because the planner
> will never choose a different plan.
>
Thanks.....
I'm not interested in optimizing this query.....it's fast and efficient.  However, the planner thinks that it is going
toreturn a million rows when it is only going to return one thousand.  When this query is used as an element of a
BIGGERquery, that causes me all sorts of planner issues.... 
That's the problem I am trying to solve...why the planner is 3 orders of magnitude off in row estimation.


Re: Oddball data distribution giving me planner headaches

От
Tom Lane
Дата:
Jeff Amiel <becauseimjeff@yahoo.com> writes:
> Oddball data distribution giving me headaches.
> [ 'distributor' customers have many more child customers than average ]
> Does this oddball data distribution doom me to poor planning forever?

The only real fix for that will require cross-column statistics, which
we don't have yet --- without such, there's no way for the planner to
know that distributors have an atypical number of child customers.

At the moment I think the only way to work around this is to denormalize
your schema a bit.  For instance, if you were to include the parent
customer_type directly into the customer_rel table, that would not only
save one join in this query but it would expose the stats the planner
needs to realize that it's going to get a lot of matches.  If you don't
like that particular solution there are probably other ways to get the
same result, but they're all going to require schema changes.

            regards, tom lane

Re: Oddball data distribution giving me planner headaches

От
"David Johnston"
Дата:
-----Original Message-----
From: Jeff Amiel [mailto:becauseimjeff@yahoo.com]
Sent: Friday, December 02, 2011 5:07 PM
To: pgsql-general@postgresql.org; David Johnston
Subject: RE: [GENERAL] Oddball data distribution giving me planner headaches



--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> From: David Johnston <polobo@yahoo.com>

> ---------------------------------------------------------
> My, possibly naïve, observation:
>
> So aside from the fact the estimates seem to be off the planner has
> still chosen the most effective plan?  In that situation no matter how
> accurate you get the statistics you will not gain any performance
> because the planner will never choose a different plan.
>
Thanks.....
I'm not interested in optimizing this query.....it's fast and efficient.
However, the planner thinks that it is going to return a million rows when
it is only going to return one thousand.  When this query is used as an
element of a BIGGER query, that causes me all sorts of planner issues....
That's the problem I am trying to solve...why the planner is 3 orders of
magnitude off in row estimation.

--------------------------------------------------------

Can you wrap the query into an SQL or PL/pgSQL function so that, at least,
then planner will not be able to see the embedded plan info in the outer
queries?  You use-case may allow you to create one or more VIEWs wrapping
the function call with pre-determined parameters so that you do not lose the
ability to write simple select queries without the need for explicit
function calls.

No idea how that would impact the planner for the other queries but maybe
worth a try while waiting for someone more knowledgeable than myself to
respond.

David J.



Re: Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:

--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote:

> --------------------------------------------------------
>
> Can you wrap the query into an SQL or PL/pgSQL function so
> that, at least,
> then planner will not be able to see the embedded plan info
> in the outer
> queries?  You use-case may allow you to create one or
> more VIEWs wrapping
> the function call with pre-determined parameters so that
> you do not lose the
> ability to write simple select queries without the need for
> explicit
> function calls.
>
> No idea how that would impact the planner for the other
> queries but maybe
> worth a try while waiting for someone more knowledgeable
> than myself to
> respond.


I had actually considered this...because you can define what the 'row estimate' is for a function....except in my case,
Ihave some dynamic inputs that would change what those estimates would need to be. 
Thanks for the help though!


Re: Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:

--- On Fri, 12/2/11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The only real fix for that will require cross-column
> statistics, which
> we don't have yet --- without such, there's no way for the
> planner to
> know that distributors have an atypical number of child
> customers.
>

I suspected as such.


> At the moment I think the only way to work around this is
> to denormalize
> your schema a bit. 

And I feared as much.
It's biting me in other areas as well...this unusual distribution of data...certain types of customers have completely
differentdata patterns than others.   
Back to the drawing board...thanks!



Re: Oddball data distribution giving me planner headaches

От
Jeff Amiel
Дата:

--- On Fri, 12/2/11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The only real fix for that will require cross-column
> statistics, which
> we don't have yet --- without such, there's no way for the
> planner to
> know that distributors have an atypical number of child
> customers.


The only caveat that I can think of here is that each of the customer_id's in the subselect ARE in the MCV list for the
parent_customercolumn for that customer_rel table.  If I had listed SPECIFIC customer ids in the subselect, I would get
spot-onrow estimates.  Is there anything there that gives me any hope? 




Re: Oddball data distribution giving me planner headaches

От
Harald Fuchs
Дата:
Jeff Amiel <becauseimjeff@yahoo.com> writes:

>> At the moment I think the only way to work around this is
>> to denormalize
>> your schema a bit.

> And I feared as much.
> It's biting me in other areas as well...this unusual distribution of data...certain types of customers have
completelydifferent data patterns than others.   
> Back to the drawing board...thanks!

I find your table structure anyway somewhat strange.  For an ordinary
parent/child relationship a parent_id column in the customer table would
be enough.  Do you really have an m:n relationship between parents and
children?