Обсуждение: Rewriting DISTINCT and losing performance
Hi all, I know we've covered this before but I'm having trouble with it today. I have some geographic data in tables that I'm working with. I have a country, state and city table. I was selecting the country_name out of the country table but discovered that some countries (like Antarctica) didn't have cities in the city table. I resolved to query the country table for only country_name's which had country_id's in the city table - meaning the country had cities listed. The problem was I had a couple different sources (in separate tables) with some extraneous column data so I chose to consolidate the city tables from the different sources and column data that I don't need because I don't have the hardware to support it. That was the end of my query time. Here's the original table and query: # \d geo.world_city Table "geo.world_city" Column | Type | Modifiers ------------+------------------------+----------- city_id | integer | not null state_id | smallint | country_id | smallint | rc | smallint | latitude | numeric(9,7) | longitude | numeric(10,7) | dsg | character(5) | cc1 | character(2) | adm1 | character(2) | city_name | character varying(200) | Indexes: "world_city_pk" PRIMARY KEY, btree (city_id) "idx_world_city_cc1" btree (cc1) "idx_world_city_cc1_adm1" btree (cc1, adm1) "idx_world_city_country_id" btree (country_id) "idx_world_city_name_first_letter" btree (state_id, "substring"(lower(city_name::text), 1, 1)) "idx_world_city_state_id" btree (state_id) explain analyze SELECT country_id, country_name FROM geo.country WHERE country_id IN (select country_id FROM geo.world_city) ; QUERY PLAN ----------------------------------------------------------------------------- -------------------------------------------------------------------------- Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual time=85.502..3479.449 rows=231 loops=1) -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual time=0.089..0.658 rows=244 loops=1) -> Index Scan using idx_world_city_country_id on world_city (cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1 loops=244) Index Cond: (country.country_id = world_city.country_id) Total runtime: 3479.921 ms Odd that it took 3 seconds because every previous run has been much quicker. The next run was: QUERY PLAN ----------------------------------------------------------------------------- ------------------------------------------------------------------------ Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual time=0.087..6.967 rows=231 loops=1) -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual time=0.028..0.158 rows=244 loops=1) -> Index Scan using idx_world_city_country_id on world_city (cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1 loops=244) Index Cond: (country.country_id = world_city.country_id) Total runtime: 7.132 ms (5 rows) But that was irrelevant. I created a new table and eliminated the data and it looks like this: # \d geo.city Table "geo.city" Column | Type | Modifiers ------------+------------------------+----------- city_id | integer | not null state_id | smallint | country_id | smallint | latitude | numeric(9,7) | longitude | numeric(10,7) | city_name | character varying(100) | Indexes: "city_pk" PRIMARY KEY, btree (city_id) "idx_city_country_id" btree (country_id) CLUSTER Foreign-key constraints: "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id) ON UPDATE CASCADE ON DELETE CASCADE explain analyze SELECT country_id, country_name FROM geo.country WHERE country_id IN (select country_id FROM geo.city) ; -- won't complete in a reasonable amount of time. This one won't use the country_id index. The two tables have almost the same number of rows: cmi=# select count(*) from geo.world_city; count --------- 1953314 (1 row) cmi=# select count(*) from geo.city; count --------- 2122712 (1 row) I tried to force it and didn't see any improvement. I've vacuummed, analyzed, clustered. Can someone help me to get only the countries who have cities in the city table in a reasonable amount of time? -------------------------------------------------------
Chuck, > explain analyze > SELECT country_id, country_name > FROM geo.country > WHERE country_id IN > (select country_id FROM geo.city) > ; > > -- won't complete in a reasonable amount of time. Can we see the plan? --Josh
Chuck D. wrote: > Table "geo.city" > Column | Type | Modifiers > ------------+------------------------+----------- > city_id | integer | not null > state_id | smallint | > country_id | smallint | > latitude | numeric(9,7) | > longitude | numeric(10,7) | > city_name | character varying(100) | > Indexes: > "city_pk" PRIMARY KEY, btree (city_id) > "idx_city_country_id" btree (country_id) CLUSTER > Foreign-key constraints: > "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id) > ON UPDATE CASCADE ON DELETE CASCADE Any good reason why country_id is NULLable? -- Richard Huxton Archonet Ltd
On Monday 21 May 2007 03:14, Josh Berkus wrote: > Chuck, > > Can we see the plan? > > --Josh > Sorry Josh, I guess I could have just used EXPLAIN instead of EXPLAIN ANALYZE. # explain SELECT country_id, country_name FROM geo.country WHERE country_id IN (select country_id FROM geo.city) ; QUERY PLAN -------------------------------------------------------------------- Nested Loop IN Join (cost=0.00..1252.60 rows=155 width=15) Join Filter: (country.country_id = city.country_id) -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) -> Seq Scan on city (cost=0.00..43409.12 rows=2122712 width=2) (4 rows) Versus the same query using the older, larger world_city table: # explain SELECT country_id, country_name FROM geo.country WHERE country_id IN (select country_id FROM geo.world_city) ; QUERY PLAN -------------------------------------------------------------------------------------------------------- Nested Loop IN Join (cost=0.00..23.16 rows=155 width=15) -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) -> Index Scan using idx_world_city_country_id on world_city (cost=0.00..706.24 rows=12602 width=2) Index Cond: (country.country_id = world_city.country_id) (4 rows)
On Monday 21 May 2007 05:40, Richard Huxton wrote: > Chuck D. wrote: > > Any good reason why country_id is NULLable? It has been a while since I imported the data so it took some time to examine it but here is what I found. In the original data, some cities do not have coutries. Strange huh? Most were in the Gaza Strip, No Man's Land or disputed territory where several countries claimed ownership. This is according to USGS and the board of names. Recognizing that this did me no good in my application I decided to repair that data so that country_id could have a NOT NULL modifier.
Chuck D. wrote: > On Monday 21 May 2007 03:14, Josh Berkus wrote: >> Chuck, >> >> Can we see the plan? >> >> --Josh >> > > Sorry Josh, I guess I could have just used EXPLAIN instead of EXPLAIN > ANALYZE. > > # explain > SELECT country_id, country_name > FROM geo.country > WHERE country_id IN > (select country_id FROM geo.city) > ; > QUERY PLAN > -------------------------------------------------------------------- > Nested Loop IN Join (cost=0.00..1252.60 rows=155 width=15) > Join Filter: (country.country_id = city.country_id) > -> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) > -> Seq Scan on city (cost=0.00..43409.12 rows=2122712 width=2) The only thing I can think of is that the CLUSTERing on city.country_id makes the system think it'll be cheaper to seq-scan the whole table. I take it you have got 2 million rows in "city"? -- Richard Huxton Archonet Ltd
On Monday 21 May 2007 11:34, Richard Huxton wrote: > Chuck D. wrote: > > The only thing I can think of is that the CLUSTERing on city.country_id > makes the system think it'll be cheaper to seq-scan the whole table. > > I take it you have got 2 million rows in "city"? Well here is where it gets strange. The CLUSTER was just one thing I tried to do to enhance the performance. I had the same result prior to cluster. However, after updating that country_id column to NOT NULL and eliminating NULL values it will use the country_id index and perform quickly. Oddly enough, the original table, world_city still has NULL values in the country_id column and it has always used the country_id index. Doesn't that seem a bit strange? Does it have to do with the smaller size of the new table maybe?
"Chuck D." <pgsql-performance@nullmx.com> writes: > Doesn't that seem a bit strange? Does it have to do with the smaller size of > the new table maybe? No, it seems to be a planner bug: http://archives.postgresql.org/pgsql-hackers/2007-05/msg00920.php I imagine that your table statistics are close to the critical point where a bitmap scan looks cheaper or more expensive than a plain index scan, and so the chosen plan varies depending on more-or-less chance factors. Certainly getting rid of NULLs shouldn't have had any direct impact on this choice. regards, tom lane