Обсуждение: Rewriting DISTINCT and losing performance

Поиск
Список
Период
Сортировка

Rewriting DISTINCT and losing performance

От
"Chuck D."
Дата:
Hi all,

I know we've covered this before but I'm having trouble with it today.

I have some geographic data in tables that I'm working with.  I have a
country, state and city table.  I was selecting the country_name out of the
country table but discovered that some countries (like Antarctica) didn't
have cities in the city table.

I resolved to query the country table for only country_name's which had
country_id's in the city table - meaning the country had cities listed.

The problem was I had a couple different sources (in separate tables) with
some extraneous column  data so I chose to consolidate the city tables from
the different sources and column data that I don't need because I don't have
the hardware to support it.

That was the end of my query time.

Here's the original table and query:

# \d geo.world_city
             Table "geo.world_city"
   Column   |          Type          | Modifiers
------------+------------------------+-----------
 city_id    | integer                | not null
 state_id   | smallint               |
 country_id | smallint               |
 rc         | smallint               |
 latitude   | numeric(9,7)           |
 longitude  | numeric(10,7)          |
 dsg        | character(5)           |
 cc1        | character(2)           |
 adm1       | character(2)           |
 city_name  | character varying(200) |
Indexes:
    "world_city_pk" PRIMARY KEY, btree (city_id)
    "idx_world_city_cc1" btree (cc1)
    "idx_world_city_cc1_adm1" btree (cc1, adm1)
    "idx_world_city_country_id" btree (country_id)
    "idx_world_city_name_first_letter" btree
(state_id, "substring"(lower(city_name::text), 1, 1))
    "idx_world_city_state_id" btree (state_id)

explain analyze
SELECT   country_id, country_name
FROM     geo.country
WHERE country_id IN
 (select country_id FROM geo.world_city)
;

                                                                      QUERY
PLAN
-----------------------------------------------------------------------------
--------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..167.97 rows=155 width=15) (actual
time=85.502..3479.449 rows=231 loops=1)
   ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15) (actual
time=0.089..0.658 rows=244 loops=1)
   ->  Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1
loops=244)
         Index Cond: (country.country_id = world_city.country_id)
 Total runtime: 3479.921 ms

Odd that it took 3 seconds because every previous run has been much quicker.
The next run was:

                                                                     QUERY
PLAN
-----------------------------------------------------------------------------
------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..167.97 rows=155 width=15) (actual
time=0.087..6.967 rows=231 loops=1)
   ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15) (actual
time=0.028..0.158 rows=244 loops=1)
   ->  Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1
loops=244)
         Index Cond: (country.country_id = world_city.country_id)
 Total runtime: 7.132 ms
(5 rows)


But that was irrelevant.  I created a new table and eliminated the data and
 it looks like this:

# \d geo.city
                Table "geo.city"
   Column   |          Type          | Modifiers
------------+------------------------+-----------
 city_id    | integer                | not null
 state_id   | smallint               |
 country_id | smallint               |
 latitude   | numeric(9,7)           |
 longitude  | numeric(10,7)          |
 city_name  | character varying(100) |
Indexes:
    "city_pk" PRIMARY KEY, btree (city_id)
    "idx_city_country_id" btree (country_id) CLUSTER
Foreign-key constraints:
    "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id)
ON UPDATE CASCADE ON DELETE CASCADE

explain analyze
SELECT   country_id, country_name
FROM     geo.country
WHERE country_id IN
 (select country_id FROM geo.city)
;

-- won't complete in a reasonable amount of time.

This one won't use the country_id index. The two tables have almost the same
number of rows:

cmi=# select count(*) from geo.world_city;
  count
---------
 1953314
(1 row)

cmi=# select count(*) from geo.city;
  count
---------
 2122712
(1 row)


 I tried to force it and didn't see any improvement.  I've vacuummed,
analyzed, clustered.  Can someone help me to get only the countries who have
cities in the city table in a reasonable amount of time?

-------------------------------------------------------

Re: Rewriting DISTINCT and losing performance

От
Josh Berkus
Дата:
Chuck,

> explain analyze
> SELECT   country_id, country_name
> FROM     geo.country
> WHERE country_id IN
>  (select country_id FROM geo.city)
> ;
>
> -- won't complete in a reasonable amount of time.

Can we see the plan?

--Josh


Re: Rewriting DISTINCT and losing performance

От
Richard Huxton
Дата:
Chuck D. wrote:
>                 Table "geo.city"
>    Column   |          Type          | Modifiers
> ------------+------------------------+-----------
>  city_id    | integer                | not null
>  state_id   | smallint               |
>  country_id | smallint               |
>  latitude   | numeric(9,7)           |
>  longitude  | numeric(10,7)          |
>  city_name  | character varying(100) |
> Indexes:
>     "city_pk" PRIMARY KEY, btree (city_id)
>     "idx_city_country_id" btree (country_id) CLUSTER
> Foreign-key constraints:
>     "city_state_id_fk" FOREIGN KEY (state_id) REFERENCES geo.state(state_id)
> ON UPDATE CASCADE ON DELETE CASCADE

Any good reason why country_id is NULLable?

--
   Richard Huxton
   Archonet Ltd

Re: Rewriting DISTINCT and losing performance

От
"Chuck D."
Дата:
On Monday 21 May 2007 03:14, Josh Berkus wrote:
> Chuck,
>
> Can we see the plan?
>
> --Josh
>

Sorry Josh,  I guess I could have just used EXPLAIN instead of EXPLAIN
ANALYZE.

# explain
SELECT   country_id, country_name
FROM     geo.country
WHERE country_id IN
 (select country_id FROM geo.city)
;
                             QUERY PLAN
--------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..1252.60 rows=155 width=15)
   Join Filter: (country.country_id = city.country_id)
   ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15)
   ->  Seq Scan on city  (cost=0.00..43409.12 rows=2122712 width=2)
(4 rows)


Versus the same query using the older, larger world_city table:

# explain
SELECT   country_id, country_name
FROM     geo.country
WHERE country_id IN
 (select country_id FROM geo.world_city)
;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Nested Loop IN Join  (cost=0.00..23.16 rows=155 width=15)
   ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15)
   ->  Index Scan using idx_world_city_country_id on world_city
(cost=0.00..706.24 rows=12602 width=2)
         Index Cond: (country.country_id = world_city.country_id)
(4 rows)



Re: Rewriting DISTINCT and losing performance

От
"Chuck D."
Дата:
On Monday 21 May 2007 05:40, Richard Huxton wrote:
> Chuck D. wrote:
>
> Any good reason why country_id is NULLable?

It has been a while since I imported the data so it took some time to examine
it but here is what I found.

In the original data, some cities do not have coutries.  Strange huh?  Most
were in the Gaza Strip, No Man's Land or disputed territory where several
countries claimed ownership.  This is according to USGS and the board of
names.

Recognizing that this did me no good in my application I decided to repair
that data so that country_id could have a NOT NULL modifier.



Re: Rewriting DISTINCT and losing performance

От
Richard Huxton
Дата:
Chuck D. wrote:
> On Monday 21 May 2007 03:14, Josh Berkus wrote:
>> Chuck,
>>
>> Can we see the plan?
>>
>> --Josh
>>
>
> Sorry Josh,  I guess I could have just used EXPLAIN instead of EXPLAIN
> ANALYZE.
>
> # explain
> SELECT   country_id, country_name
> FROM     geo.country
> WHERE country_id IN
>  (select country_id FROM geo.city)
> ;
>                              QUERY PLAN
> --------------------------------------------------------------------
>  Nested Loop IN Join  (cost=0.00..1252.60 rows=155 width=15)
>    Join Filter: (country.country_id = city.country_id)
>    ->  Seq Scan on country  (cost=0.00..6.44 rows=244 width=15)
>    ->  Seq Scan on city  (cost=0.00..43409.12 rows=2122712 width=2)

The only thing I can think of is that the CLUSTERing on city.country_id
makes the system think it'll be cheaper to seq-scan the whole table.

I take it you have got 2 million rows in "city"?
--
   Richard Huxton
   Archonet Ltd

Re: Rewriting DISTINCT and losing performance

От
"Chuck D."
Дата:
On Monday 21 May 2007 11:34, Richard Huxton wrote:
> Chuck D. wrote:
>
> The only thing I can think of is that the CLUSTERing on city.country_id
> makes the system think it'll be cheaper to seq-scan the whole table.
>
> I take it you have got 2 million rows in "city"?

Well here is where it gets strange.  The CLUSTER was just one thing I tried to
do to enhance the performance.  I had the same result prior to cluster.

However, after updating that country_id column to NOT NULL and eliminating
NULL values it will use the country_id index and perform quickly.  Oddly
enough, the original table, world_city still has NULL values in the
country_id column and it has always used the country_id index.

Doesn't that seem a bit strange?  Does it have to do with the smaller size of
the new table maybe?

Re: Rewriting DISTINCT and losing performance

От
Tom Lane
Дата:
"Chuck D." <pgsql-performance@nullmx.com> writes:
> Doesn't that seem a bit strange?  Does it have to do with the smaller size of
> the new table maybe?

No, it seems to be a planner bug:
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00920.php

I imagine that your table statistics are close to the critical point
where a bitmap scan looks cheaper or more expensive than a plain index
scan, and so the chosen plan varies depending on more-or-less chance
factors.  Certainly getting rid of NULLs shouldn't have had any direct
impact on this choice.

            regards, tom lane