Обсуждение: Oddball data distribution giving me planner headaches
Oddball data distribution giving me headaches. We have a distinct 'customer' table with customer_id, type and name/demographic information. Assume some 1 million rows in the customer table. We then have a customer 'relationship' table which simply contains 2 columns…designating parent and child relationships…butallowing complex hierarchies between customers. CREATE TABLE customer_rel ( parent_customer integer NOT NULL, child_customer integer NOT NULL, ) 8 million rows in this table. Oddball distribution. We have some 8 levels of hierarchy (customer type) represented withthis table. Every customer gets an entry where parent/child is themselves…and then for every 'upline'. At the highestlevel, we have 'distributors' which have all other customer types underneath them. Assuming we had some 68 distributors,the entries where THEY are the parent_customer represent nearly a million rows of the 8 million. I have extracted a simple case from a larger query that was generating an off-beat plan because of the unexpected plannerrow-counts being spewed by a low level query. explain analyze select * from customer_rel where parent_customer in (select customer_id from customer where customer_type='DISTRIBUTOR') " Nested Loop (cost=25429.44..29626.39 rows=931 width=0) (actual time=216.325..1238.091 rows=1025401 loops=1)" " -> HashAggregate (cost=25429.44..25430.80 rows=136 width=4) (actual time=216.304..216.339 rows=68 loops=1)" " -> Seq Scan on customer (cost=0.00..25429.10 rows=136 width=4) (actual time=0.018..216.226 rows=68 loops=1)" " Filter: (customer_type = 'DISTRIBUTOR'::bpchar)" " -> Index Scan using rel_parent on customer_rel (cost=0.00..30.76 rows=7 width=4) (actual time=0.006..8.190 rows=15079loops=68)" " Index Cond: (parent_customer = customer.customer_id)" "Total runtime: 1514.810 ms" The fact that the top level nested loop THINKS it only will be returning 931 rows (instead of over 1 million) is the killerhere…3 orders of magnitude. The results of this query are used as part of a bigger query and the screwed up statsare causing all sorts of havoc upline. I'm experimenting in 9.1.0…have set the statistics to 1000 (and 10000) on both columns (parent and child) to little effect. Have hardcoded the n_distinct on the parent_customer column to be 1,000,000 also…with no effect (doing analyze oftable after each change) Does this oddball data distribution doom me to poor planning forever? Any other thoughts?
-----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Amiel Sent: Friday, December 02, 2011 3:20 PM To: pgsql-general@postgresql.org Subject: [GENERAL] Oddball data distribution giving me planner headaches explain analyze select * from customer_rel where parent_customer in (select customer_id from customer where customer_type='DISTRIBUTOR') " Nested Loop (cost=25429.44..29626.39 rows=931 width=0) (actual time=216.325..1238.091 rows=1025401 loops=1)" " -> HashAggregate (cost=25429.44..25430.80 rows=136 width=4) (actual time=216.304..216.339 rows=68 loops=1)" " -> Seq Scan on customer (cost=0.00..25429.10 rows=136 width=4) (actual time=0.018..216.226 rows=68 loops=1)" " Filter: (customer_type = 'DISTRIBUTOR'::bpchar)" " -> Index Scan using rel_parent on customer_rel (cost=0.00..30.76 rows=7 width=4) (actual time=0.006..8.190 rows=15079loops=68)" " Index Cond: (parent_customer = customer.customer_id)" "Total runtime: 1514.810 ms" ------------------------------------------------------------ What kind of plan does the following give? EXPLAIN ANALYZE SELECT * FROM customer_rel p JOIN customer c ON (p.parent_customer = c.customer_id) WHERE c.customer_type = 'DISTRIBUTOR' ; David J.
--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > From: David Johnston <polobo@yahoo.com> > What kind of plan does the following give? > > EXPLAIN ANALYZE > SELECT * > FROM customer_rel p > JOIN customer c ON (p.parent_customer = c.customer_id) > WHERE c.customer_type = 'DISTRIBUTOR' Nearly identical output "Nested Loop (cost=0.00..29624.69 rows=931 width=97) (actual time=0.032..1330.208 rows=1025401 loops=1)" " -> Seq Scan on customer c (cost=0.00..25429.10 rows=136 width=71) (actual time=0.017..212.059 rows=68 loops=1)" " Filter: (customer_type = 'DISTRIBUTOR'::bpchar)" " -> Index Scan using rel_parent on customer_rel p (cost=0.00..30.76 rows=7 width=26) (actual time=0.006..7.732 rows=15079loops=68)" " Index Cond: (parent_customer = c.customer_id)" "Total runtime: 1544.281 ms"
-----Original Message----- From: Jeff Amiel [mailto:becauseimjeff@yahoo.com] Sent: Friday, December 02, 2011 3:52 PM To: pgsql-general@postgresql.org; David Johnston Subject: RE: [GENERAL] Oddball data distribution giving me planner headaches --- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > From: David Johnston <polobo@yahoo.com> What kind of plan does the > following give? > > EXPLAIN ANALYZE > SELECT * > FROM customer_rel p > JOIN customer c ON (p.parent_customer = c.customer_id) WHERE > c.customer_type = 'DISTRIBUTOR' Nearly identical output "Nested Loop (cost=0.00..29624.69 rows=931 width=97) (actual time=0.032..1330.208 rows=1025401 loops=1)" " -> Seq Scan on customer c (cost=0.00..25429.10 rows=136 width=71) (actual time=0.017..212.059 rows=68 loops=1)" " Filter: (customer_type = 'DISTRIBUTOR'::bpchar)" " -> Index Scan using rel_parent on customer_rel p (cost=0.00..30.76 rows=7 width=26) (actual time=0.006..7.732 rows=15079 loops=68)" " Index Cond: (parent_customer = c.customer_id)" "Total runtime: 1544.281 ms" ----------------------------------------------------- What happens if you disable, say, nested loops and/or index scans? David J.
--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > What happens if you disable, say, nested loops and/or index > scans? planner selects different join/indexing techniques (query is slower) but row estimates (bad) remain identical.
-----Original Message----- From: Jeff Amiel [mailto:becauseimjeff@yahoo.com] Sent: Friday, December 02, 2011 4:15 PM To: pgsql-general@postgresql.org; David Johnston Subject: RE: [GENERAL] Oddball data distribution giving me planner headaches --- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > What happens if you disable, say, nested loops and/or index scans? planner selects different join/indexing techniques (query is slower) but row estimates (bad) remain identical. --------------------------------------------------------- My, possibly naïve, observation: So aside from the fact the estimates seem to be off the planner has still chosen the most effective plan? In that situation no matter how accurate you get the statistics you will not gain any performance because the planner will never choose a different plan. IF you can find a combination of settings that forces a better performing plan THEN steps can be taken (by others more skilled than myself) to configure the statistics/planner system to get that better plan naturally. David J.
--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > From: David Johnston <polobo@yahoo.com> > --------------------------------------------------------- > My, possibly naïve, observation: > > So aside from the fact the estimates seem to be off the > planner has still > chosen the most effective plan? In that situation no > matter how accurate > you get the statistics you will not gain any performance > because the planner > will never choose a different plan. > Thanks..... I'm not interested in optimizing this query.....it's fast and efficient. However, the planner thinks that it is going toreturn a million rows when it is only going to return one thousand. When this query is used as an element of a BIGGERquery, that causes me all sorts of planner issues.... That's the problem I am trying to solve...why the planner is 3 orders of magnitude off in row estimation.
Jeff Amiel <becauseimjeff@yahoo.com> writes: > Oddball data distribution giving me headaches. > [ 'distributor' customers have many more child customers than average ] > Does this oddball data distribution doom me to poor planning forever? The only real fix for that will require cross-column statistics, which we don't have yet --- without such, there's no way for the planner to know that distributors have an atypical number of child customers. At the moment I think the only way to work around this is to denormalize your schema a bit. For instance, if you were to include the parent customer_type directly into the customer_rel table, that would not only save one join in this query but it would expose the stats the planner needs to realize that it's going to get a lot of matches. If you don't like that particular solution there are probably other ways to get the same result, but they're all going to require schema changes. regards, tom lane
-----Original Message----- From: Jeff Amiel [mailto:becauseimjeff@yahoo.com] Sent: Friday, December 02, 2011 5:07 PM To: pgsql-general@postgresql.org; David Johnston Subject: RE: [GENERAL] Oddball data distribution giving me planner headaches --- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > From: David Johnston <polobo@yahoo.com> > --------------------------------------------------------- > My, possibly naïve, observation: > > So aside from the fact the estimates seem to be off the planner has > still chosen the most effective plan? In that situation no matter how > accurate you get the statistics you will not gain any performance > because the planner will never choose a different plan. > Thanks..... I'm not interested in optimizing this query.....it's fast and efficient. However, the planner thinks that it is going to return a million rows when it is only going to return one thousand. When this query is used as an element of a BIGGER query, that causes me all sorts of planner issues.... That's the problem I am trying to solve...why the planner is 3 orders of magnitude off in row estimation. -------------------------------------------------------- Can you wrap the query into an SQL or PL/pgSQL function so that, at least, then planner will not be able to see the embedded plan info in the outer queries? You use-case may allow you to create one or more VIEWs wrapping the function call with pre-determined parameters so that you do not lose the ability to write simple select queries without the need for explicit function calls. No idea how that would impact the planner for the other queries but maybe worth a try while waiting for someone more knowledgeable than myself to respond. David J.
--- On Fri, 12/2/11, David Johnston <polobo@yahoo.com> wrote: > -------------------------------------------------------- > > Can you wrap the query into an SQL or PL/pgSQL function so > that, at least, > then planner will not be able to see the embedded plan info > in the outer > queries? You use-case may allow you to create one or > more VIEWs wrapping > the function call with pre-determined parameters so that > you do not lose the > ability to write simple select queries without the need for > explicit > function calls. > > No idea how that would impact the planner for the other > queries but maybe > worth a try while waiting for someone more knowledgeable > than myself to > respond. I had actually considered this...because you can define what the 'row estimate' is for a function....except in my case, Ihave some dynamic inputs that would change what those estimates would need to be. Thanks for the help though!
--- On Fri, 12/2/11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The only real fix for that will require cross-column > statistics, which > we don't have yet --- without such, there's no way for the > planner to > know that distributors have an atypical number of child > customers. > I suspected as such. > At the moment I think the only way to work around this is > to denormalize > your schema a bit. And I feared as much. It's biting me in other areas as well...this unusual distribution of data...certain types of customers have completely differentdata patterns than others. Back to the drawing board...thanks!
--- On Fri, 12/2/11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The only real fix for that will require cross-column > statistics, which > we don't have yet --- without such, there's no way for the > planner to > know that distributors have an atypical number of child > customers. The only caveat that I can think of here is that each of the customer_id's in the subselect ARE in the MCV list for the parent_customercolumn for that customer_rel table. If I had listed SPECIFIC customer ids in the subselect, I would get spot-onrow estimates. Is there anything there that gives me any hope?
Jeff Amiel <becauseimjeff@yahoo.com> writes: >> At the moment I think the only way to work around this is >> to denormalize >> your schema a bit. > And I feared as much. > It's biting me in other areas as well...this unusual distribution of data...certain types of customers have completelydifferent data patterns than others. > Back to the drawing board...thanks! I find your table structure anyway somewhat strange. For an ordinary parent/child relationship a parent_id column in the customer table would be enough. Do you really have an m:n relationship between parents and children?