Oddball data distribution giving me planner headaches

Поиск

Список

Период

Сортировка

От	Jeff Amiel
Тема	Oddball data distribution giving me planner headaches
Дата	2 декабря 2011 г. 16:19:54
Msg-id	1322857179.53033.YahooMailClassic@web65513.mail.ac4.yahoo.com обсуждение исходный текст
Ответы	Re: Oddball data distribution giving me planner headaches Re: Oddball data distribution giving me planner headaches
Список	pgsql-general

Дерево обсуждения

Oddball data distribution giving me headaches.

We have a distinct 'customer' table with customer_id, type and name/demographic information.
Assume some 1 million rows in the customer table.

We then have a customer 'relationship' table which simply contains 2 columns…designating parent and child
relationships…butallowing complex hierarchies between customers. 


CREATE TABLE customer_rel
(
  parent_customer integer NOT NULL,
  child_customer integer NOT NULL,
 )

8 million rows in this table.  Oddball distribution.  We have some 8 levels of hierarchy (customer type) represented
withthis table.  Every customer gets an entry where parent/child is themselves…and then for every 'upline'.  At the
highestlevel, we have 'distributors' which have all other customer types underneath them.  Assuming we had some 68
distributors,the entries where THEY are the parent_customer represent nearly a million rows of the 8 million. 

I have extracted a simple case from a larger query that was generating an off-beat plan because of the unexpected
plannerrow-counts being spewed by a low level query. 

explain analyze
select * from customer_rel where parent_customer in (select customer_id from customer where
customer_type='DISTRIBUTOR')


"  Nested Loop  (cost=25429.44..29626.39 rows=931 width=0) (actual time=216.325..1238.091 rows=1025401 loops=1)"
"        ->  HashAggregate  (cost=25429.44..25430.80 rows=136 width=4) (actual time=216.304..216.339 rows=68 loops=1)"
"              ->  Seq Scan on customer  (cost=0.00..25429.10 rows=136 width=4) (actual time=0.018..216.226 rows=68
loops=1)"
"                    Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"        ->  Index Scan using rel_parent on customer_rel  (cost=0.00..30.76 rows=7 width=4) (actual time=0.006..8.190
rows=15079loops=68)" 
"              Index Cond: (parent_customer = customer.customer_id)"
"Total runtime: 1514.810 ms"

The fact that the top level nested loop THINKS it only will be returning 931 rows (instead of over 1 million) is the
killerhere…3 orders of magnitude.  The results of this  query are used as part of a bigger query and the screwed up
statsare causing all sorts of havoc upline. 

I'm experimenting in 9.1.0…have set the statistics to 1000 (and 10000) on both columns (parent and child) to little
effect. Have hardcoded the n_distinct on the parent_customer column to be 1,000,000 also…with no effect (doing analyze
oftable after each change) 

Does this oddball data distribution doom me to poor planning forever?
Any other thoughts?

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Oddball data distribution giving me planner headaches