Re: Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version

От: PFC
Тема: Re: Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version
Дата: ,
Msg-id: op.ux1x15arcigqcu@soyouz
(см: обсуждение, исходный текст)
Ответ на: Why is PostgreSQL so slow on Windows ( Postgres 8.3.7) version  (parimala)
Список: pgsql-performance

> ,some of our queries to the database taking long time to return the
> results.

>  fsync: off     (even we tested this parameter is on ,we observed the same
> slowness )

    If your queries take long time to return results, I suppose you are
talking about SELECTs.

    fsync = off will not make SELECTs faster (only inserts, updates, deletes)
but it is not worth it as you risk data loss.

    synchronous_commit = on has about the same advantages (faster...) as
fsync=off, but with no risk of data loss, so it is much better !


> We have 300k row's in PolledData Table.In each STATSDATA table ,we have
> almost 12 to 13 million rows.

    OK. So you insert 13 million rows per day ?
    That is about 150 rows per second.

> Every one minute interval ,we insert data into to STATSDATA table.

    I assume you are making an INSERT INTO statsdata VALUES (...... 150
values .....)
    and not 150 inserts, yes ?

> First Query :
> SELECT COUNT(*) FROM (
    SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL
    FROM PolledData, STATSDATA8_21_2009 WHERE
    ( ( PolledData.ID=STATSDATA8_21_2009.POLLID)
    AND ( ( TTIME >= 1250838027454)
    AND ( TTIME <=1250838079654) ) ) ) t1;

* You could rewrite as :

SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL
 FROM PolledData
JOIN STATSDATA8_21_2009 ON ( PolledData.ID = STATSDATA8_21_2009.POLLID)
WHERE TTIME BETWEEN ... AND ...

- It is exactly the same query, but much easier to read.

* some ANALYZE-ing of your tables would be useful, since the estimates
 from the planner look suspiciously different from reality
- ANALYZE is fast, you can run it often if you INSERT rows all the time

* You are joining on POLLID which is a NUMERIC in one table and a BIGINT
in the other table.
- Is there any reason for this type difference ?
- Could you use BIGINT in both tables ?
- BIGINT is faster than NUMERIC and uses less space.
- Type conversions use CPU cycles too.

* Should StatsData.ID have a foreign key REFERENCES PolledData.ID ?
- This won't make the query faster, but if you know all rows in StatsData
reference rows in PolledData (because of the FK constraint) and you want a
count(*) like above, you don't need to JOIN.

* TTIME >= 1250838027454 AND TTIME <=1250838079654
- TTIME should be TIMESTAMP (with or without TIMEZONE) or BIGINT but
certainly not NUMERIC
- An index on StatsData.TTIME would be useful, it would avoid Seq Scan,
replacing it with a Bitmap Scan, much faster

* work_mem
- since you have few connections you could increase work_mem

> Second Query :

    Same as first query

> Third Query

SELECT COUNT(*) FROM ( SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME,
VAL
 FROM PolledData, STATSDATA8_21_2009
WHERE ( ( PolledData.ID=STATSDATA8_21_2009.POLLID)
AND ( ( TTIME >= 1250838027454) AND ( TTIME <=1250838027454) ) )

union all  SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL FROM
PolledData, STATSDATA8_20_2009
WHERE ( ( PolledData.ID=STATSDATA8_20_2009.POLLID) AND ( ( TTIME >=
1250767134601) AND ( TTIME <= 1250767134601) ) ) )t1 ;

Basically this is, again, exactly the same query as above, but two times,
and UNION ALL'ed

* You could rewrite it like this :

SELECT ID, PolledData.AGENT, NAME, INSTANCE, TTIME, VAL
 FROM
( SELECT ... FROM STATSDATA8_21_2009 WHERE TTIME BETWEEN ... AND ... )
UNION ALL SELECT ... FROM STATSDATA8_20_2009 WHERE TTIME BETWEEN ... AND
... )
)
JOIN STATSDATA8_21_2009 ON ( PolledData.ID = STATSDATA8_21_2009.POLLID)

* If TTIME is the current time, and you insert data as it comes, data in
StatsData tables is probably already ordered on TTIME.
- If it is not the case, once a table is filled and becomes read-only,
consider CLUSTER on the index you created on TTIME
- It will make range queries on TTIME much faster

* Query plan
Seq Scan on statsdata8_21_2009  (cost=0.00..70574.88 rows=1 width=32)
(actual time=0.047..29066.227 rows=227 loops=1)
Seq Scan on statsdata8_20_2009  (cost=0.00..382519.60 rows=1 width=32)
(actual time=3136.008..93985.540 rows=1 loops=1)

Postgres thinks there is 1 row in those tables... that's probably not the
case !
The first one returns 227 rows, so the plan chosen in a catastrophe.

I was a bit intrigued by your query, so I made a little test...

BEGIN;
CREATE TABLE test( x INT, y INT );
INSERT INTO test (SELECT n,n FROM generate_series( 1,1000000 ) AS n );
CREATE INDEX test_x ON test( x );
CREATE INDEX test_y ON test( y );
COMMIT;

ANALYZE test;

test=> EXPLAIN ANALYZE SELECT * FROM test a JOIN test b ON (b.x=a.x) WHERE
a.x BETWEEN 0 AND 10000;
                                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=480.53..23759.14 rows=10406 width=16) (actual
time=15.614..1085.085 rows=10000 loops=1)
    Hash Cond: (b.x = a.x)
    ->  Seq Scan on test b  (cost=0.00..14424.76 rows=999976 width=8)
(actual time=0.013..477.516 rows=1000000 loops=1)
    ->  Hash  (cost=350.46..350.46 rows=10406 width=8) (actual
time=15.581..15.581 rows=10000 loops=1)
          ->  Index Scan using test_x on test a  (cost=0.00..350.46
rows=10406 width=8) (actual time=0.062..8.537 rows=10000 loops=1)
                Index Cond: ((x >= 0) AND (x <= 10000))
  Total runtime: 1088.462 ms
(7 lignes)

test=> set enable_seqscan TO 0;
SET
test=> EXPLAIN ANALYZE SELECT * FROM test a JOIN test b ON (b.x=a.x) WHERE
a.x BETWEEN 0 AND 10000;
                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..30671.03 rows=10406 width=16) (actual
time=0.075..85.897 rows=10000 loops=1)
    ->  Index Scan using test_x on test a  (cost=0.00..350.46 rows=10406
width=8) (actual time=0.066..8.377 rows=10000 loops=1)
          Index Cond: ((x >= 0) AND (x <= 10000))
    ->  Index Scan using test_x on test b  (cost=0.00..2.90 rows=1 width=8)
(actual time=0.005..0.006 rows=1 loops=10000)
          Index Cond: (b.x = a.x)
  Total runtime: 90.160 ms
(6 lignes)

test=> set enable_nestloop TO 0;
SET
test=> EXPLAIN ANALYZE SELECT * FROM test a JOIN test b ON (b.x=a.x) WHERE
a.x BETWEEN 0 AND 10000;
                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Merge Join  (cost=0.00..31200.45 rows=10406 width=16) (actual
time=0.081..35.735 rows=10000 loops=1)
    Merge Cond: (a.x = b.x)
    ->  Index Scan using test_x on test a  (cost=0.00..350.46 rows=10406
width=8) (actual time=0.059..8.093 rows=10000 loops=1)
          Index Cond: ((x >= 0) AND (x <= 10000))
    ->  Index Scan using test_x on test b  (cost=0.00..28219.98 rows=999976
width=8) (actual time=0.016..7.494 rows=10001 loops=1)
  Total runtime: 40.013 ms
(6 lignes)


I wonder why it doesn't choose the merge join at first...






В списке pgsql-performance по дате сообщения:

От: Robert Haas
Дата:
Сообщение: Re: FW: Performance 8.4.0
От: Richard Huxton
Дата:
Сообщение: Re: Greenplum MapReduce