Обсуждение: selects from large tables

Поиск
Список
Период
Сортировка

selects from large tables

От
Nikk Anderson
Дата:

Hi All,

We are using Postgres 7.1, on Solaris 8 - hardware is a 400mhz Netra X1, 512Mb ram, with the database on a separate partition.

Our main result tables are getting really big, and we don't want to delete any data yet.  Currently, our largest table has around 10 million rows and is going up at a rate of around 1 million per month.  The table has 13 integer, one boolean and one timestamp column.  We index the table on an ID number and the timestamp.  We vacuum analyse the table every night.  The performance has steadily degraded, and the more data we try and select, the longer the select queries take.

The queries are not complex, and do not involve any unions etc, eg:

SELECT * FROM table_name WHERE column1 = 454 AND time BETWEEN '2002-10-13 13:44:00.0' AND '2002-11-14'

SELECT count(DISTINCT id) FROM table_name WHERE column1 = 454 AND time BETWEEN '2002-10-13 13:44:00.0' AND '2002-11-14

See various queries and explains at the end this email for more info on the type of queries we are doing.
Most of the queries use a sequence scan - disabling this and forcing index scan decreases performance further for those queries.

These queries are sometimes taking over 2 minutes to perform!!!! If we reduce the table size significantly (i.e. around 1 million rows)is is obviously faster - down to a few seconds.

We then tried the DB on a clean installation of Solaris 9, on a dual 400mhz processor SunE250 with 2Gb ram, and 2 scsi 17gb disks.  We put the database onto the second disk.  Surprisingly the performance is only 5-10% greater.  I expected far more, due to the increased power of the machine.  Looking at the os info on this machine, the IO wait is negligible as is the cpu usage.  So this machine is not working as hard as the Netra X1, though the time taken to perform queries is not too much different.

We have tried tweaking the shared buffers and sort mem (also tweaking kernel shared mem size), which make little difference, and in fact if we increase it to around 25% of total memory performance degrades slightly.  We have changed from the default amount of shared buffers, to 64000 to give us access to 25% of the total system memory.

Any ideas on how we can select data more quickly from large tables?

Other ideas we had was to split the data over multiple table by id (resulting in several thousand tables), however this would make management of the database in terms of keys, triggers and integrity very difficult and messy.

I hope someone can offer some advice.

Cheers

Nikk

- Queries and explain plans

select count(*) from table_name;
NOTICE:  QUERY PLAN:
Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
  ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=0)

hawkdb=# explain select count(job_id) from table_name;
NOTICE:  QUERY PLAN:
Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
  ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=4)

hawkdb=# explain select * from table_name;
NOTICE:  QUERY PLAN:
Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)

hawkdb=# explain select count(*) from table_name where job_id = 13;
NOTICE:  QUERY PLAN:
Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
  ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412 width=0)

hawkdb=# explain select * from table_name where job_id = 13;
NOTICE:  QUERY PLAN:
Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)

hawkdb=# explain select * from table_name where job_id = 1;
NOTICE:  QUERY PLAN:
Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1 width=57)

hawkdb=#explain select * from table_name where time > '2002-10-10';
NOTICE:  QUERY PLAN:
Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)

hawkdb=# explain select * from http_result where time < '2002-10-10';
NOTICE:  QUERY PLAN:
Index Scan using table_name_time on table_name  (cost=0.00..75879.17 rows=19669 width=57)

Nikk Anderson

Parallel ltd.
Cranfield Innovation Centre
University Way
Cranfield
Bedfordshire
MK43 0BT

http://www.nexuswatch.com
http://www.parallel.ltd.uk

Tel: +44 (0)8700 PARALLEL (727255)
Fax: +44 (0)8700 PARAFAX  (727232)

******************************************************************
Privileged/Confidential Information may be contained in this
message.  If you are not the addressee indicated in this message
(or responsible for delivery of the message to such person), you
may not copy or deliver this message to anyone. In such case, you
should destroy this message and kindly notify the sender by reply
email. Please advise immediately if you or your employer do not
consent to Internet email for messages of this kind.  Opinions,
conclusions and other information in this message that do not
relate to the official business of Parallel shall be understood
as neither given nor endorsed by it.

Unless agreed otherwise by way of a signed agreement, any business
conducted by Parallel shall be subject to its Standard Terms
and Conditions which are available upon request.
******************************************************************

Re: selects from large tables

От
Stephan Szabo
Дата:
On Mon, 18 Nov 2002, Nikk Anderson wrote:

> Any ideas on how we can select data more quickly from large tables?

Are these row estimates realistic? It's estimating nearly 20 million rows
to be returned by some of the queries (unless I'm misreading the
number - possible since it's 5am here).  At that point you almost
certainly want to be using a cursor rather than plain queries since even a
small width result (say 50 bytes) gives a very large (1 gig) result set.

> - Queries and explain plans
>
> select count(*) from table_name;
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
>   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=0)
>
> hawkdb=# explain select count(job_id) from table_name;
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
>   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=4)
>
> hawkdb=# explain select * from table_name;
> NOTICE:  QUERY PLAN:
> Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
>
> hawkdb=# explain select count(*) from table_name where job_id = 13;
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
>   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412 width=0)
>
> hawkdb=# explain select * from table_name where job_id = 13;
> NOTICE:  QUERY PLAN:
> Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
>
> hawkdb=# explain select * from table_name where job_id = 1;
> NOTICE:  QUERY PLAN:
> Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> width=57)
>
> hawkdb=#explain select * from table_name where time > '2002-10-10';
> NOTICE:  QUERY PLAN:
> Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
>
> hawkdb=# explain select * from http_result where time < '2002-10-10';
> NOTICE:  QUERY PLAN:
> Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> rows=19669 width=57)


Re: selects from large tables

От
Tom Lane
Дата:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Mon, 18 Nov 2002, Nikk Anderson wrote:

>> Any ideas on how we can select data more quickly from large tables?

> Are these row estimates realistic?

Showing EXPLAIN ANALYZE results would be much more useful than just
EXPLAIN.

            regards, tom lane

Re: selects from large tables

От
Nikk Anderson
Дата:

Hi,
Thanks for the reply Stephen, the data is 'somewhat' realistic.....

The data in the table is actually synthetic, but the structure is the same as our live system, and the queries are similar to those we actually carry out. 

As the data was synthetic there was a bit of repetition (19 million rows of repetition!! ) of the item used in the where clause, meaning that most of the table was returned by the queries - oops!  So, I have done is some more realistic queries from our live system, and put the time it takes, and the explain results.  Just to note that the explain's estimated number of rows is way out - its guesses are way too low.

Typically a normal query on our live system returns between 200 and 30000 rows depending on the reports a user wants to generate.  In prior testing, we noted that using SELECT COUNT( ..   was slower than other queries, which is why we though we would test counts first.

Here are some more realistic results, which still take a fair whack of time........

Starting query 0
Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335
Time taken = 697 ms
Index Scan using http_timejobid on xx  (cost=0.00..17.01 rows=4 width=57)
This query returns 500 rows of data

Starting query 1
Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335
Time taken = 15 seconds
Index Scan using http_timejobid on xx  (cost=0.00..705.57 rows=175 width=57)
This query return 3582 rows

Starting query 2
Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
Time taken = 65 seconds
Index Scan using http_timejobid on xx  (cost=0.00..3327.55 rows=832 width=57)
This query returns 15692 rows

Starting query 3
Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;

Time taken = 241 seconds
Index Scan using http_timejobid on xx  (cost=0.00..10111.36 rows=2547 width=57)
This query returns 48768 rows

Cheers

Nikk


-----Original Message-----
From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
Sent: 18 November 2002 13:02
To: Nikk Anderson
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables

On Mon, 18 Nov 2002, Nikk Anderson wrote:

> Any ideas on how we can select data more quickly from large tables?

Are these row estimates realistic? It's estimating nearly 20 million rows
to be returned by some of the queries (unless I'm misreading the
number - possible since it's 5am here).  At that point you almost
certainly want to be using a cursor rather than plain queries since even a
small width result (say 50 bytes) gives a very large (1 gig) result set.

> - Queries and explain plans
>
> select count(*) from table_name;
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
>   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=0)
>
> hawkdb=# explain select count(job_id) from table_name;
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
>   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=4)
>
> hawkdb=# explain select * from table_name;
> NOTICE:  QUERY PLAN:
> Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
>
> hawkdb=# explain select count(*) from table_name where job_id = 13;
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
>   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412 width=0)
>
> hawkdb=# explain select * from table_name where job_id = 13;
> NOTICE:  QUERY PLAN:
> Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
>
> hawkdb=# explain select * from table_name where job_id = 1;
> NOTICE:  QUERY PLAN:
> Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> width=57)
>
> hawkdb=#explain select * from table_name where time > '2002-10-10';
> NOTICE:  QUERY PLAN:
> Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
>
> hawkdb=# explain select * from http_result where time < '2002-10-10';
> NOTICE:  QUERY PLAN:
> Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> rows=19669 width=57)

Re: selects from large tables

От
Nikk Anderson
Дата:

Hi,

Unfortunately explain analyze does not work on our postgres version (7.1) ?

I think I will download and compile 7.2, and try to compile in 64bit mode to see if that helps improve performance.

Cheers

Nikk

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 18 November 2002 15:04
To: Stephan Szabo
Cc: Nikk Anderson; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> On Mon, 18 Nov 2002, Nikk Anderson wrote:

>> Any ideas on how we can select data more quickly from large tables?

> Are these row estimates realistic?

Showing EXPLAIN ANALYZE results would be much more useful than just
EXPLAIN.

                        regards, tom lane

Re: selects from large tables

От
Andrew Sullivan
Дата:
On Mon, Nov 18, 2002 at 03:36:08PM -0000, Nikk Anderson wrote:
> Hi,
>
> Unfortunately explain analyze does not work on our postgres version (7.1) ?

No, it doesn't.

> I think I will download and compile 7.2, and try to compile in 64bit mode to
> see if that helps improve performance.

I have seen something like a 40% improvement in performance from 7.1
to 7.2 on Solaris 7 in my tests.  There are some problems with the 64
bit compilation, by the way, so make sure that you check out the
FAQ and test carefully.  You need to make some modifications of the
source files in order to avoid some buggly libraries on Solaris.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: selects from large tables

От
"Charles H. Woloszynski"
Дата:
Nikk:

Are you doing vaccums on these tables?  I was under the understanding
that the estimated row count should be close to the real row count
returned, and when it is not (as it looks in your case), the primary
reason for the disconnect is that the stats for the tables are
out-of-date.

Since it used the indexes, I am not sure if the old stats are causing
any issues, but I suspect they are not helping.

Also, do you do any clustering of the data (since the queries are mostly
time limited)?  I am wondering if the system is doing lots of seeks to
get the data (implying that the data is all over the disk and not
clustered).

Charlie

Nikk Anderson wrote:

> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the
> same as our live system, and the queries are similar to those we
> actually carry out.
>
> As the data was synthetic there was a bit of repetition (19 million
> rows of repetition!! ) of the item used in the where clause, meaning
> that most of the table was returned by the queries - oops!  So, I have
> done is some more realistic queries from our live system, and put the
> time it takes, and the explain results.  Just to note that the
> explain's estimated number of rows is way out - its guesses are way
> too low.
>
> Typically a normal query on our live system returns between 200 and
> 30000 rows depending on the reports a user wants to generate.  In
> prior testing, we noted that using SELECT COUNT( ..   was slower than
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx  (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..705.57 rows=175
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..3327.55 rows=832
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..10111.36 rows=2547
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here).  At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE:  QUERY PLAN:
> > Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> > rows=19669 width=57)
>

--


Charles H. Woloszynski

ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015

tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com





Re: selects from large tables

От
Tom Lane
Дата:
"Charles H. Woloszynski" <chw@clearmetrix.com> writes:
> Are you doing vaccums on these tables?  I was under the understanding
> that the estimated row count should be close to the real row count
> returned, and when it is not (as it looks in your case), the primary
> reason for the disconnect is that the stats for the tables are
> out-of-date.

The fact that he's using 7.1 doesn't help any; the statistics mechanisms
in 7.1 are pretty weak compared to 7.2.

> Also, do you do any clustering of the data (since the queries are mostly
> time limited)?  I am wondering if the system is doing lots of seeks to
> get the data (implying that the data is all over the disk and not
> clustered).

It would also be interesting to try a two-column index ordered the other
way (timestamp as the major sort key instead of ID).  Can't tell if that
will be a win without more info about the data properties, but it's
worth looking at.

            regards, tom lane

Re: selects from large tables

От
Stephan Szabo
Дата:
On Mon, 18 Nov 2002, Nikk Anderson wrote:

> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....

Tom's said most of what I would have, except that if you've got a wide
variation based on job_id you may want to change the statistics gathering
defaults for that column with ALTER TABLE ALTER COLUMN SET STATISTICS when
you get to 7.2.



Re: selects from large tables

От
Nikk Anderson
Дата:

Hi Charlie,
We do a vacuum analyze every night at midnight.  I thought that perhaps the analyzing was not being done correctly, so I manually did a vacuum analyze and the estimated row counts were way still out. 

I will look into clustering the data and see what effect that may have.

Thanks

Nikk

-----Original Message-----
From: Charles H. Woloszynski [mailto:chw@clearmetrix.com]
Sent: 18 November 2002 15:46
To: Nikk Anderson
Cc: 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables

Nikk:

Are you doing vaccums on these tables?  I was under the understanding
that the estimated row count should be close to the real row count
returned, and when it is not (as it looks in your case), the primary
reason for the disconnect is that the stats for the tables are
out-of-date. 

Since it used the indexes, I am not sure if the old stats are causing
any issues, but I suspect they are not helping. 

Also, do you do any clustering of the data (since the queries are mostly
time limited)?  I am wondering if the system is doing lots of seeks to
get the data (implying that the data is all over the disk and not
clustered). 

Charlie

Nikk Anderson wrote:

> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the
> same as our live system, and the queries are similar to those we
> actually carry out.
>
> As the data was synthetic there was a bit of repetition (19 million
> rows of repetition!! ) of the item used in the where clause, meaning
> that most of the table was returned by the queries - oops!  So, I have
> done is some more realistic queries from our live system, and put the
> time it takes, and the explain results.  Just to note that the
> explain's estimated number of rows is way out - its guesses are way
> too low.
>
> Typically a normal query on our live system returns between 200 and
> 30000 rows depending on the reports a user wants to generate.  In
> prior testing, we noted that using SELECT COUNT( ..   was slower than
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx  (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..705.57 rows=175
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..3327.55 rows=832
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..10111.36 rows=2547
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here).  At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE:  QUERY PLAN:
> > Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> > rows=19669 width=57)
>

--

Charles H. Woloszynski

ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015

tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com



---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: selects from large tables

От
Nikk Anderson
Дата:

Hi Tom,

Yes, we should upgrade to 7.2 soon, its just that as it is a live system running 24x7 we are careful about upgrading core components so we do not disrupt our data collection agents too much.

Here is some table info, we currently index by time then ID.  Generally, data will be selected by ID, then time range.  Clustering may help on this. 

    Attribute    |           Type           | Modifier
-----------------+--------------------------+----------
 job_id          | integer                  | not null
 server_id       | integer                  | not null
 time            | timestamp with time zone | not null
 availability    | boolean                  |
 connection_time | integer                  |
 dns_setup       | integer                  |
 server_response | integer                  |
 frontpage_size  | integer                  |
 frontpage_time  | integer                  |
 transfer_size   | integer                  |
 transfer_time   | integer                  |
 error_id        | integer                  |
 redirect_time   | integer                  |
 polling_id      | integer                  | not null
Indices: http_result_pk,
         http_timejobid

Thanks

Nikk

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 18 November 2002 16:25
To: Charles H. Woloszynski
Cc: Nikk Anderson; 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables

"Charles H. Woloszynski" <chw@clearmetrix.com> writes:
> Are you doing vaccums on these tables?  I was under the understanding
> that the estimated row count should be close to the real row count
> returned, and when it is not (as it looks in your case), the primary
> reason for the disconnect is that the stats for the tables are
> out-of-date. 

The fact that he's using 7.1 doesn't help any; the statistics mechanisms
in 7.1 are pretty weak compared to 7.2.

> Also, do you do any clustering of the data (since the queries are mostly
> time limited)?  I am wondering if the system is doing lots of seeks to
> get the data (implying that the data is all over the disk and not
> clustered).

It would also be interesting to try a two-column index ordered the other
way (timestamp as the major sort key instead of ID).  Can't tell if that
will be a win without more info about the data properties, but it's
worth looking at.

                        regards, tom lane

Re: selects from large tables

От
Robert Treat
Дата:
You might want to check out 7.3 while your at it.  It's currently
planned to be released around Dec 1st, which might fit in nicely with
your upgrade schedule.

Robert Treat

On Mon, 2002-11-18 at 11:36, Nikk Anderson wrote:
> Hi Tom,
>
> Yes, we should upgrade to 7.2 soon, its just that as it is a live system
> running 24x7 we are careful about upgrading core components so we do not
> disrupt our data collection agents too much.
>
> Here is some table info, we currently index by time then ID.  Generally,
> data will be selected by ID, then time range.  Clustering may help on
> this.
>
>
>     Attribute    |           Type           | Modifier
> -----------------+--------------------------+----------
>  job_id          | integer                  | not null
>  server_id       | integer                  | not null
>  time            | timestamp with time zone | not null
>  availability    | boolean                  |
>  connection_time | integer                  |
>  dns_setup       | integer                  |
>  server_response | integer                  |
>  frontpage_size  | integer                  |
>  frontpage_time  | integer                  |
>  transfer_size   | integer                  |
>  transfer_time   | integer                  |
>  error_id        | integer                  |
>  redirect_time   | integer                  |
>  polling_id      | integer                  | not null
> Indices: http_result_pk,
>          http_timejobid
>
> Thanks
>
> Nikk
>
>
> -----Original Message-----
> From: Tom Lane [ mailto:tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us> ]
> Sent: 18 November 2002 16:25
> To: Charles H. Woloszynski
> Cc: Nikk Anderson; 'Stephan Szabo'; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
> "Charles H. Woloszynski" <chw@clearmetrix.com> writes:
> > Are you doing vaccums on these tables?  I was under the understanding
> > that the estimated row count should be close to the real row count
> > returned, and when it is not (as it looks in your case), the primary
> > reason for the disconnect is that the stats for the tables are
> > out-of-date.
>
> The fact that he's using 7.1 doesn't help any; the statistics mechanisms
>
> in 7.1 are pretty weak compared to 7.2.
>
> > Also, do you do any clustering of the data (since the queries are
> mostly
> > time limited)?  I am wondering if the system is doing lots of seeks to
>
> > get the data (implying that the data is all over the disk and not
> > clustered).
>
> It would also be interesting to try a two-column index ordered the other
>
> way (timestamp as the major sort key instead of ID).  Can't tell if that
>
> will be a win without more info about the data properties, but it's
> worth looking at.
>
>                         regards, tom lane
>




Re: selects from large tables

От
Nikk Anderson
Дата:

Hi,

I tried a test cluster on a copy of our real data - all 10 million rows or so.  WOW!   The normal select performance improved drastically. 

Selecting 3 months worth of data was taking 146 seconds to retrieve.  After clustering it took 7.7 seconds!  We are now looking into ways we can automate clustering to keep the table up to date.  The cluster itself took around 2.5 hours.

As our backend systems are writing hundreds of rows of data in per minute into the table that needs clustering - will cluster handle locking the tables when dropping the old, and renaming the clustered data?  What happens to the data being added to the table while cluster is running? Our backend systems may have some problems if the table does not exist when it tries to insert, and we don't want to lose any data.

Thanks

Nikk

-----Original Message-----
From: Charles H. Woloszynski [mailto:chw@clearmetrix.com]
Sent: 18 November 2002 15:46
To: Nikk Anderson
Cc: 'Stephan Szabo'; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] selects from large tables

Nikk:

Are you doing vaccums on these tables?  I was under the understanding
that the estimated row count should be close to the real row count
returned, and when it is not (as it looks in your case), the primary
reason for the disconnect is that the stats for the tables are
out-of-date. 

Since it used the indexes, I am not sure if the old stats are causing
any issues, but I suspect they are not helping. 

Also, do you do any clustering of the data (since the queries are mostly
time limited)?  I am wondering if the system is doing lots of seeks to
get the data (implying that the data is all over the disk and not
clustered). 

Charlie

Nikk Anderson wrote:

> Hi,
> Thanks for the reply Stephen, the data is 'somewhat' realistic.....
>
> The data in the table is actually synthetic, but the structure is the
> same as our live system, and the queries are similar to those we
> actually carry out.
>
> As the data was synthetic there was a bit of repetition (19 million
> rows of repetition!! ) of the item used in the where clause, meaning
> that most of the table was returned by the queries - oops!  So, I have
> done is some more realistic queries from our live system, and put the
> time it takes, and the explain results.  Just to note that the
> explain's estimated number of rows is way out - its guesses are way
> too low.
>
> Typically a normal query on our live system returns between 200 and
> 30000 rows depending on the reports a user wants to generate.  In
> prior testing, we noted that using SELECT COUNT( ..   was slower than
> other queries, which is why we though we would test counts first.
>
>
> Here are some more realistic results, which still take a fair whack of
> time........
>
>
> Starting query 0
> Query 0: SELECT * FROM xx WHERE time BETWEEN '2002-11-17 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 697 ms
> Index Scan using http_timejobid on xx  (cost=0.00..17.01 rows=4 width=57)
> This query returns 500 rows of data
>
>
> Starting query 1
> Query 1: SELECT * FROM xx WHERE time BETWEEN '2002-11-11 14:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335
> Time taken = 15 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..705.57 rows=175
> width=57)
> This query return 3582 rows
>
> Starting query 2
> Query 2: SELECT * FROM xx WHERE time BETWEEN '2002-10-19 15:08:58.021'
> AND '2002-11-18 14:08:58.021' AND job_id = 335;
> Time taken = 65 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..3327.55 rows=832
> width=57)
> This query returns 15692 rows
>
> Starting query 3
> Query 3: SELECT * FROM xx_result WHERE time BETWEEN '2002-08-20
> 15:08:58.021' AND '2002-11-18 14:08:58.021' AND job_id = 335;
>
> Time taken = 241 seconds
> Index Scan using http_timejobid on xx  (cost=0.00..10111.36 rows=2547
> width=57)
> This query returns 48768 rows
>
>
> Cheers
>
> Nikk
>
>
>
>
> -----Original Message-----
> From: Stephan Szabo [mailto:sszabo@megazone23.bigpanda.com]
> Sent: 18 November 2002 13:02
> To: Nikk Anderson
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] selects from large tables
>
>
>
> On Mon, 18 Nov 2002, Nikk Anderson wrote:
>
> > Any ideas on how we can select data more quickly from large tables?
>
> Are these row estimates realistic? It's estimating nearly 20 million rows
> to be returned by some of the queries (unless I'm misreading the
> number - possible since it's 5am here).  At that point you almost
> certainly want to be using a cursor rather than plain queries since even a
> small width result (say 50 bytes) gives a very large (1 gig) result set.
>
> > - Queries and explain plans
> >
> > select count(*) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=0)
> >
> > hawkdb=# explain select count(job_id) from table_name;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=488700.65..488700.65 rows=1 width=4)
> >   ->  Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412
> width=4)
> >
> > hawkdb=# explain select * from table_name;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..439527.12 rows=19669412 width=57)
> >
> > hawkdb=# explain select count(*) from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Aggregate  (cost=537874.18..537874.18 rows=1 width=0)
> >   ->  Seq Scan on table_name  (cost=0.00..488700.65 rows=19669412
> width=0)
> >
> > hawkdb=# explain select * from table_name where job_id = 13;
> > NOTICE:  QUERY PLAN:
> > Seq Scan on http_result  (cost=0.00..488700.65 rows=19669412 width=57)
> >
> > hawkdb=# explain select * from table_name where job_id = 1;
> > NOTICE:  QUERY PLAN:
> > Index Scan using http_result_pk on table_name  (cost=0.00..5.01 rows=1
> > width=57)
> >
> > hawkdb=#explain select * from table_name where time > '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Seq Scan on table_name  (cost=0.00..488700.65 rows=19649743 width=57)
> >
> > hawkdb=# explain select * from http_result where time < '2002-10-10';
> > NOTICE:  QUERY PLAN:
> > Index Scan using table_name_time on table_name  (cost=0.00..75879.17
> > rows=19669 width=57)
>

--

Charles H. Woloszynski

ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015

tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com



---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: selects from large tables

От
Bruce Momjian
Дата:
Nikk Anderson wrote:
> Hi,
>
> I tried a test cluster on a copy of our real data - all 10 million rows or
> so.  WOW!   The normal select performance improved drastically.
> Selecting 3 months worth of data was taking 146 seconds to retrieve.  After
> clustering it took 7.7 seconds!  We are now looking into ways we can
> automate clustering to keep the table up to date.  The cluster itself took
> around 2.5 hours.
>
> As our backend systems are writing hundreds of rows of data in per minute
> into the table that needs clustering - will cluster handle locking the
> tables when dropping the old, and renaming the clustered data?  What happens
> to the data being added to the table while cluster is running? Our backend
> systems may have some problems if the table does not exist when it tries to
> insert, and we don't want to lose any data.

CLUSTER will exclusively lock the table from read/write during the
CLUSTER.  Sorry.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: selects from large tables

От
Tom Lane
Дата:
Nikk Anderson <Nikk.Anderson@parallel.ltd.uk> writes:
> As our backend systems are writing hundreds of rows of data in per minute
> into the table that needs clustering - will cluster handle locking the
> tables when dropping the old, and renaming the clustered data?  What happens
> to the data being added to the table while cluster is running?

Nothing, because there won't be any: cluster acquires exclusive lock on
the table while it runs.  Any would-be inserter will block till it's done.

If you are clustering by timestamp of insertion, and you never update or
delete rows, then I think it's a one-time-and-you're-done kind of task
anyway --- newly inserted rows will always get added at the end, and so
will be in timestamp order anyway.  But if you need to update the table
then things aren't so nice :-(

            regards, tom lane

PS: it's not really necessary to quote the entire thread in every
message, and it's definitely not nice to do so twice in both plain
text and HTML :-(.  Please have some consideration for the size of
your emails that Marc is archiving for posterity ...

Re: selects from large tables

От
Rod Taylor
Дата:
On Wed, 2002-11-20 at 10:08, Nikk Anderson wrote:
> Hi,
>
> I tried a test cluster on a copy of our real data - all 10 million
> rows or so.  WOW!   The normal select performance improved
> drastically.
>
> Selecting 3 months worth of data was taking 146 seconds to retrieve.
> After clustering it took 7.7 seconds!  We are now looking into ways we
> can automate clustering to keep the table up to date.  The cluster
> itself took around 2.5 hours.
>
> As our backend systems are writing hundreds of rows of data in per
> minute into the table that needs clustering - will cluster handle
> locking the tables when dropping the old, and renaming the clustered
> data?  What happens to the data being added to the table while cluster
> is running? Our backend systems may have some problems if the table
> does not exist when it tries to insert, and we don't want to lose any
> data.

The table will be locked while cluster is running.  Meaning, any new
data will have to sit and wait.

Cluster won't buy much on a mostly clustered table.  But it's probably
worth it for you to do it when 20% of the tuples turnover (deleted,
updated, inserts, etc).


I'm a little curious to know when the last time you had run a VACUUM
FULL on that table was.

--
Rod Taylor <rbt@rbt.ca>


Re: selects from large tables

От
Nikk Anderson
Дата:

Hi,
> I'm a little curious to know when the last time you had run a VACUUM
> FULL on that table was.
>
>--
>Rod Taylor <rbt@rbt.ca>

We do a VACUUM ANALYZE every night, there is no option for FULL on our version (7.1)

Nikk

Re: selects from large tables

От
Rod Taylor
Дата:
> We do a VACUUM ANALYZE every night, there is no option for FULL on our
> version (7.1)

Oh, I see.  Old stuff :)

--
Rod Taylor <rbt@rbt.ca>


Re: selects from large tables

От
Hannu Krosing
Дата:
Nikk Anderson kirjutas K, 20.11.2002 kell 20:08:
> Hi,
>
> I tried a test cluster on a copy of our real data - all 10 million
> rows or so.  WOW!   The normal select performance improved
> drastically.
>
> Selecting 3 months worth of data was taking 146 seconds to retrieve.
> After clustering it took 7.7 seconds!  We are now looking into ways we
> can automate clustering to keep the table up to date.  The cluster
> itself took around 2.5 hours.
>
> As our backend systems are writing hundreds of rows of data in per
> minute into the table that needs clustering - will cluster handle
> locking the tables when dropping the old, and renaming the clustered
> data?  What happens to the data being added to the table while cluster
> is running? Our backend systems may have some problems if the table
> does not exist when it tries to insert, and we don't want to lose any
> data.

You could use a staging table that takes all the inserts and the
contents of which are moved (begin;insert into big select from
small;delete from small;commit;vacuum full small;) to the main table
once a day (or week or month) just before clustering the big one.

Then do all your selects from a UNION view on both - thus you have a big
fast clustered table and non-clustered "live" table which stays small.
That should make your selects fast(er).

-----------------
Hannu