Обсуждение: Slow query
I am running a query that joins against several large tables (~5 million rows each). The query takes an exteremely long time to run, and the explain output is a bit beyond my level of understanding. It is an auto-generated query, so the aliases are fairly ugly. I can clean them up (rename them) if it would help. Also, let me know if I can send any more information that would help (e.g. table schema) Also, is there any resources where I can get a better understanding of what PostgreSQL means when it says "Sort" "Sort Key" "Bitmap Index Scan" "Hash Cond" etc. etc. - and how to recognize problems by looking at the output. I can understand the output for simple queries (e.g. is the planner using an index or performing a seq. scan), but when you get to more complex queries like the one below I lose my way =) I would really appreciate it if someone from this list could tell me if there is anything that is obviously wrong with the query or schema and what I could do to improve the performance. PostgreSQL 8.1 RedHat Enterprise Linux 4 --QUERY select distinct city4_.region_id as region1_29_, city4_1_.name as name29_, city4_.state_id as state2_30_ from registered_voters registered0_ inner join registered_voter_addresses addresses1_ on registered0_.registered_voter_id=addresses1_.registered_voter_id inner join registered_voter_addresses_regions regions2_ on addresses1_.address_id=regions2_.registered_voter_addresses_address_id inner join regions region3_ on regions2_.regions_region_id=region3_.region_id inner join cities city4_ on addresses1_.city_id=city4_.region_id inner join regions city4_1_ on city4_.region_id=city4_1_.region_id where region3_.region_id='093c44e8-f3b2-4c60-8be3-2b4d148f9f5a' order by city4_1_.name --EXPLAIN/ANALYZE OUTPUT "Unique (cost=3572907.42..3623589.94 rows=4076438 width=93) (actual time=2980825.714..3052333.753 rows=1124 loops=1)" " -> Sort (cost=3572907.42..3585578.05 rows=5068252 width=93) (actual time=2980825.710..2987407.888 rows=4918204 loops=1)" " Sort Key: city4_1_.name, city4_.region_id, city4_.state_id" " -> Hash Join (cost=717783.40..1430640.10 rows=5068252 width=93) (actual time=1400141.559..2016131.467 rows=4918204 loops=1)" " Hash Cond: (("outer".registered_voter_addresses_address_id)::text = ("inner".address_id)::text)" " -> Bitmap Heap Scan on registered_voter_addresses_regions regions2_ (cost=54794.95..575616.49 rows=5116843 width=80) (actual time=45814.469..155044.478 rows=4918205 loops=1)" " Recheck Cond: ('093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'::text = (regions_region_id)::text)" " -> Bitmap Index Scan on reg_voter_address_region_region_idx (cost=0.00..54794.95 rows=5116843 width=0) (actual time=45807.157..45807.157 rows=4918205 loops=1)" " Index Cond: ('093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'::text = (regions_region_id)::text)" " -> Hash (cost=642308.89..642308.89 rows=741420 width=173) (actual time=1354217.934..1354217.934 rows=4918204 loops=1)" " -> Hash Join (cost=328502.66..642308.89 rows=741420 width=173) (actual time=204565.031..1268303.832 rows=4918204 loops=1)" " Hash Cond: (("outer".registered_voter_id)::text = ("inner".registered_voter_id)::text)" " -> Seq Scan on registered_voters registered0_ (cost=0.00..173703.02 rows=4873202 width=40) (actual time=0.005..39364.261 rows=4873167 loops=1)" " -> Hash (cost=303970.34..303970.34 rows=748528 width=213) (actual time=204523.861..204523.861 rows=4918204 loops=1)" " -> Hash Join (cost=263.22..303970.34 rows=748528 width=213) (actual time=101.628..140936.062 rows=4918204 loops=1)" " Hash Cond: (("outer".city_id)::text = ("inner".region_id)::text)" " -> Seq Scan on registered_voter_addresses addresses1_ (cost=0.00..271622.23 rows=4919923 width=120) (actual time=0.025..98416.667 rows=4918205 loops=1)" " -> Hash (cost=260.35..260.35 rows=1147 width=173) (actual time=101.582..101.582 rows=1147 loops=1)" " -> Hash Join (cost=48.80..260.35 rows=1147 width=173) (actual time=88.608..98.984 rows=1147 loops=1)" " Hash Cond: (("outer".region_id)::text = ("inner".region_id)::text)" " -> Seq Scan on regions city4_1_ (cost=0.00..162.39 rows=7539 width=53) (actual time=0.048..35.204 rows=7539 loops=1)" " -> Hash (cost=45.93..45.93 rows=1147 width=120) (actual time=48.896..48.896 rows=1147 loops=1)" " -> Nested Loop (cost=0.00..45.93 rows=1147 width=120) (actual time=35.791..47.012 rows=1147 loops=1)" " -> Index Scan using regions_pkey on regions region3_ (cost=0.00..5.99 rows=1 width=40) (actual time=35.761..35.763 rows=1 loops=1)" " Index Cond: ((region_id)::text = '093c44e8-f3b2-4c60-8be3-2b4d148f9f5a'::text)" " -> Seq Scan on cities city4_ (cost=0.00..28.47 rows=1147 width=80) (actual time=0.022..9.476 rows=1147 loops=1)" "Total runtime: 3052707.269 ms"
"Jeremy Haile" <jhaile@fastmail.fm> writes: > I am running a query that joins against several large tables (~5 million > rows each). The query takes an exteremely long time to run, and the > explain output is a bit beyond my level of understanding. It is an > auto-generated query, so the aliases are fairly ugly. Yah :-( > select distinct city4_.region_id as region1_29_, city4_1_.name as > name29_, city4_.state_id as state2_30_ > from registered_voters registered0_ > inner join registered_voter_addresses addresses1_ on > registered0_.registered_voter_id=addresses1_.registered_voter_id > inner join registered_voter_addresses_regions regions2_ on > addresses1_.address_id=regions2_.registered_voter_addresses_address_id > inner join regions region3_ on > regions2_.regions_region_id=region3_.region_id > inner join cities city4_ on > addresses1_.city_id=city4_.region_id > inner join regions city4_1_ on > city4_.region_id=city4_1_.region_id > where region3_.region_id='093c44e8-f3b2-4c60-8be3-2b4d148f9f5a' > order by city4_1_.name AFAICS the planner is doing about the best you can hope the machine to do --- it's not making any serious estimation errors, and the plan is pretty reasonable for the given query. The problem is that you are forming a very large join result (4918204 rows) and then doing a DISTINCT that reduces this to only 1124 rows ... but the damage of computing that huge join has already been done. The machine is not going to be able to think its way out of this one --- it's up to you to think of a better formulation of the query. Offhand I'd try something involving joining just city4_/city4_1_ (which should not need DISTINCT, I think) and then using WHERE EXISTS(SELECT ... FROM the-other-tables) to filter out the cities you don't want. The reason this can be a win is that the EXISTS formulation will stop running the sub-select as soon as it's produced a single row for the current city, rather than generating thousands of similar rows that will be thrown away by DISTINCT as you have here. This assumes that the fraction of cities passing the query is substantial, as it appears from the rowcounts in your EXPLAIN output. If only a tiny fraction of them passed, then the time wasted in failing EXISTS probes might eat up the savings. regards, tom lane