Обсуждение: 2 machines, same database, same query, 10 times slower?

Поиск
Список
Период
Сортировка

2 machines, same database, same query, 10 times slower?

От
Antonio Goméz Soto
Дата:
Hi,

I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same hardware, with the same database layout,
they have different data, and the same query run 10 times as slow on one machine compared to the other.

I have run ANALYZE on both machines, here are the query plans for both machines. Can someone explain
this to me? I have trouble reading EXPLAIN output...

Query:

explain select company.name as cname, call_flags, bill_duration as call_duration,
       date_part('epoch',start_time) as start_time,
       src_name, src_type, src_num, src_id,
       dial_name, dial_type, dial_num, dial_id,
       ans_name, ans_type, ans_num, ans_id,
       sessionid
  from cdr, company, phoneline, contact
  where (src_id = contact.id or dial_id = contact.id or ans_id = contact.id)
        and contact.id = '2' and phoneline.function='contact' and
        phoneline.lookupid = contact.id and phoneline.status != 'deleted' and
        (src_company=company.id or dial_company=company.id) and company.id > 2
  order by start_time DESC
  limit 10;

This is the query plan on machine #1 (query takes 2 seconds) :

 Limit  (cost=106128.33..106128.36 rows=10 width=160)
   ->  Sort  (cost=106128.33..106166.98 rows=15458 width=160)
         Sort Key: date_part('epoch'::text, cdr.start_time)
         ->  Nested Loop  (cost=49.38..104275.65 rows=15458 width=160)
               ->  Nested Loop  (cost=2.10..103880.57 rows=7729 width=164)
                     Join Filter: (("outer".src_company = "inner".id) OR ("outer".dial_company = "inner".id))
                     ->  Nested Loop  (cost=0.00..103054.09 rows=6595 width=163)
                           Join Filter: (("inner".src_id = "outer".id) OR ("inner".dial_id = "outer".id) OR
("inner".ans_id= "outer".id)) 
                           ->  Index Scan using contact_pkey on contact  (cost=0.00..5.94 rows=1 width=4)
                                 Index Cond: (id = 2)
                           ->  Seq Scan on cdr  (cost=0.00..77039.87 rows=1486187 width=159)
                     ->  Materialize  (cost=2.10..2.16 rows=5 width=13)
                           ->  Seq Scan on company  (cost=0.00..2.10 rows=5 width=13)
                                 Filter: (id > 2)
               ->  Materialize  (cost=47.28..47.30 rows=2 width=4)
                     ->  Seq Scan on phoneline  (cost=0.00..47.28 rows=2 width=4)
                           Filter: ((("function")::text = 'contact'::text) AND ((status)::text <> 'deleted'::text) AND
(lookupid= 2)) 
(17 rows)

This is the query plan on machine two (query takes 38 seconds):

 Limit  (cost=424555.76..424555.79 rows=10 width=170)
   ->  Sort  (cost=424555.76..424574.34 rows=7432 width=170)
         Sort Key: date_part('epoch'::text, cdr.start_time)
         ->  Nested Loop  (cost=422353.60..424077.90 rows=7432 width=170)
               ->  Nested Loop  (cost=422064.10..423621.19 rows=3716 width=174)
                     Join Filter: (("inner".src_company = "outer".id) OR ("inner".dial_company = "outer".id))
                     ->  Bitmap Heap Scan on company  (cost=2.09..49.23 rows=26 width=21)
                           Recheck Cond: (id > 2)
                           ->  Bitmap Index Scan on company_pkey  (cost=0.00..2.09 rows=26 width=0)
                                 Index Cond: (id > 2)
                     ->  Materialize  (cost=422062.01..422085.24 rows=2323 width=165)
                           ->  Nested Loop  (cost=0.00..422059.69 rows=2323 width=165)
                                 Join Filter: (("inner".src_id = "outer".id) OR ("inner".dial_id = "outer".id) OR
("inner".ans_id= "outer".id)) 
                                 ->  Index Scan using contact_pkey on contact  (cost=0.00..6.01 rows=1 width=4)
                                       Index Cond: (id = 2)
                                 ->  Seq Scan on cdr  (cost=0.00..408379.70 rows=781370 width=161)
               ->  Materialize  (cost=289.50..289.52 rows=2 width=4)
                     ->  Seq Scan on phoneline  (cost=0.00..289.50 rows=2 width=4)
                           Filter: ((("function")::text = 'contact'::text) AND ((status)::text <> 'deleted'::text) AND
(lookupid= 2)) 
(19 rows)

Thanks,
Antonio

Re: 2 machines, same database, same query, 10 times slower?

От
"Tomas Vondra"
Дата:
Hi,

On 8 Květen 2012, 16:48, Antonio Goméz Soto wrote:
> Hi,
>
> I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same
> hardware, with the same database layout,
> they have different data, and the same query run 10 times as slow on one
> machine compared to the other.

First of all, to analyze runtime differences it's important to provide
EXPLAIN ANALYZE output, not just EXPLAIN. Re-run the queries and use
explain.depesz.com to post the output.

Second, what do you mean 'different data'? If there is different amount of
data, it may be perfectly expected that the query runs much slower on the
machine with more data. For example the plans contain this:

A: Seq Scan on cdr  (cost=0.00..77039.87 rows=1486187 width=159)
B: Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161)

That suggests that the second database contains about 1/2 the rows.

The seq scan nodes reveal another interesting fact - while the expected
row count is about 50% in the second plan, the estimated cost is about 5x
higher (both compared to the first plan).

The important thing here is that most of the cost estimate comes from the
number of pages, therefore I suppose the cdr occupies about 5x the space
in the second case, although it's much more 'sparse'.

Do this on both machines to verify that

   SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr';

That might happen for example by deleting a lot of rows recently (without
running VACUUM FULL after) or by not running autovacuum at all. Which is
quite likely, because it was introduced in 8.1 and was off by default.

BTW if you care about performance, you should upgrade to a more recent
version (preferably 9.x) because 8.1 is not supported for several years
IIRC and there were many improvements since then.

Tomas


Re: 2 machines, same database, same query, 10 times slower?

От
Antonio Goméz Soto
Дата:
Hi Tomas,

thanks for responding.

Op 08-05-12 17:34, Tomas Vondra schreef:
> Hi,
>
> On 8 Květen 2012, 16:48, Antonio Goméz Soto wrote:
>> Hi,
>>
>> I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same
>> hardware, with the same database layout,
>> they have different data, and the same query run 10 times as slow on one
>> machine compared to the other.
>
> First of all, to analyze runtime differences it's important to provide
> EXPLAIN ANALYZE output, not just EXPLAIN. Re-run the queries and use
> explain.depesz.com to post the output.
>

Allright, thanks, didn't know that. Reran the queries, and they are posted here:

The slow one: http://explain.depesz.com/s/2Si

The fast one: http://explain.depesz.com/s/c9m3

> Second, what do you mean 'different data'? If there is different amount of
> data, it may be perfectly expected that the query runs much slower on the
> machine with more data. For example the plans contain this:
>
> A: Seq Scan on cdr  (cost=0.00..77039.87 rows=1486187 width=159)
> B: Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161)
>
> That suggests that the second database contains about 1/2 the rows.
>

That is true.

> The seq scan nodes reveal another interesting fact - while the expected
> row count is about 50% in the second plan, the estimated cost is about 5x
> higher (both compared to the first plan).
>
> The important thing here is that most of the cost estimate comes from the
> number of pages, therefore I suppose the cdr occupies about 5x the space
> in the second case, although it's much more 'sparse'.
>
> Do this on both machines to verify that
>
>    SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr';

Slow machine:

 relpages | reltuples
----------+-----------
   400566 |    982321

Fast machine:

 relpages |  reltuples
----------+-------------
    62076 | 1.48375e+06


>
> That might happen for example by deleting a lot of rows recently (without
> running VACUUM FULL after) or by not running autovacuum at all. Which is
> quite likely, because it was introduced in 8.1 and was off by default.
>

Autovacuum is running on both machines and does not report errors. But
I did not run a vacuum full. There currently are users on the machine,
so I can try that later tonight.

> BTW if you care about performance, you should upgrade to a more recent
> version (preferably 9.x) because 8.1 is not supported for several years
> IIRC and there were many improvements since then.
>

I would like to, but I am bound to distribution-supplied software versions.

Thanks a lot for helping,
Antonio

> Tomas
>


Re: 2 machines, same database, same query, 10 times slower?

От
Pavel Stehule
Дата:
2012/5/8 Antonio Goméz Soto <antonio.gomez.soto@gmail.com>:
> Hi Tomas,
>
> thanks for responding.
>
> Op 08-05-12 17:34, Tomas Vondra schreef:
>> Hi,
>>
>> On 8 Květen 2012, 16:48, Antonio Goméz Soto wrote:
>>> Hi,
>>>
>>> I am running PostgreSQL 8.1, on CentOS 5. I have two machines, same
>>> hardware, with the same database layout,
>>> they have different data, and the same query run 10 times as slow on one
>>> machine compared to the other.
>>
>> First of all, to analyze runtime differences it's important to provide
>> EXPLAIN ANALYZE output, not just EXPLAIN. Re-run the queries and use
>> explain.depesz.com to post the output.
>>
>
> Allright, thanks, didn't know that. Reran the queries, and they are posted here:
>
> The slow one: http://explain.depesz.com/s/2Si
>
> The fast one: http://explain.depesz.com/s/c9m3
>

probably cdr table on "slow machine" needs VACUUM FULL.

Regards

Pavel

>> Second, what do you mean 'different data'? If there is different amount of
>> data, it may be perfectly expected that the query runs much slower on the
>> machine with more data. For example the plans contain this:
>>
>> A: Seq Scan on cdr  (cost=0.00..77039.87 rows=1486187 width=159)
>> B: Seq Scan on cdr (cost=0.00..408379.70 rows=781370 width=161)
>>
>> That suggests that the second database contains about 1/2 the rows.
>>
>
> That is true.
>
>> The seq scan nodes reveal another interesting fact - while the expected
>> row count is about 50% in the second plan, the estimated cost is about 5x
>> higher (both compared to the first plan).
>>
>> The important thing here is that most of the cost estimate comes from the
>> number of pages, therefore I suppose the cdr occupies about 5x the space
>> in the second case, although it's much more 'sparse'.
>>
>> Do this on both machines to verify that
>>
>>    SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr';
>
> Slow machine:
>
>  relpages | reltuples
> ----------+-----------
>   400566 |    982321
>
> Fast machine:
>
>  relpages |  reltuples
> ----------+-------------
>    62076 | 1.48375e+06
>
>
>>
>> That might happen for example by deleting a lot of rows recently (without
>> running VACUUM FULL after) or by not running autovacuum at all. Which is
>> quite likely, because it was introduced in 8.1 and was off by default.
>>
>
> Autovacuum is running on both machines and does not report errors. But
> I did not run a vacuum full. There currently are users on the machine,
> so I can try that later tonight.
>
>> BTW if you care about performance, you should upgrade to a more recent
>> version (preferably 9.x) because 8.1 is not supported for several years
>> IIRC and there were many improvements since then.
>>
>
> I would like to, but I am bound to distribution-supplied software versions.
>
> Thanks a lot for helping,
> Antonio
>
>> Tomas
>>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: 2 machines, same database, same query, 10 times slower?

От
Tomas Vondra
Дата:
On 8.5.2012 20:32, Antonio Goméz Soto wrote:
>> > Do this on both machines to verify that
>> >
>> >    SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr';
> Slow machine:
>
>  relpages | reltuples
> ----------+-----------
>    400566 |    982321
>
> Fast machine:
>
>  relpages |  reltuples
> ----------+-------------
>     62076 | 1.48375e+06
>
>

Yup, that's probably the culprit. The slow machine has to read 400000
pages while the fast one just 62000. Each page is 8kB, so this is 3GB
vs. 500MB difference.

T.

Re: 2 machines, same database, same query, 10 times slower?

От
Antonio Goméz Soto
Дата:
Tomas,

running vacuum full cdr brought down the query time to 4 seconds..

Thanks a lot,
Antonio


Op 08-05-12 21:15, Tomas Vondra schreef:
> On 8.5.2012 20:32, Antonio Goméz Soto wrote:
>>>> Do this on both machines to verify that
>>>>
>>>>    SELECT relpages, reltuples FROM pg_class WHERE relname = 'cdr';
>> Slow machine:
>>
>>  relpages | reltuples
>> ----------+-----------
>>    400566 |    982321
>>
>> Fast machine:
>>
>>  relpages |  reltuples
>> ----------+-------------
>>     62076 | 1.48375e+06
>>
>>
>
> Yup, that's probably the culprit. The slow machine has to read 400000
> pages while the fast one just 62000. Each page is 8kB, so this is 3GB
> vs. 500MB difference.
>
> T.
>


Re: 2 machines, same database, same query, 10 times slower?

От
Tomas Vondra
Дата:
On 9.5.2012 22:50, Antonio Goméz Soto wrote:
> Tomas,
>
> running vacuum full cdr brought down the query time to 4 seconds..

Fine. I'd recommend checking the other tables too - if this one was
bloated, it's likely there might be others.

Anyway, you should find out how this happened and fix it. This kind of
bloat is usually caused by either frequent changes (updated / deletes)
without sufficient vacuuming, or batch updates hitting many rows at once
(e.g. during maintenance).

Or maybe the vacuum was not running at all - I do remember there were
frequent issues with free-space map. See this:

  http://www.varlena.com/GeneralBits/Tidbits/perf.html#maxfsmp
  http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

That should give you a good idea.

Tomas