Обсуждение: Advice for optimizing queries using Large Tables

Поиск
Список
Период
Сортировка

Advice for optimizing queries using Large Tables

От
"Shaun Grannis"
Дата:
Hi,



I hope someone out there has experience with large tables in Postgres.



I'm working with a table containing over 65 million records in Postgres v
7.1.3. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board) with
3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives configured in a
software RAID 0 Array running under RedHat Linux v. 7.2. Queries don't seem
to be running as fast as "they should".



One of the columns (of type INTEGER) in this table holds a 4-digit value.
The values are all within the range 1930-2001. I've indexed this column
using the statements:



    CREATE INDEX value_idx ON table (value);

    VACUUM ANALYZE table;



This query:



    SELECT count(*) FROM table WHERE value=1999;



takes approximately 45 minutes to execute, and returns a count of approx 2.2
million records. My "instinct" is that this is much too slow for a query of
an indexed column running on this hardware. Here is the EXPLAIN output:



    EXPLAIN SELECT count(*) FROM table WHERE value=1999;

    NOTICE: QUERY PLAN:



    Aggregate (cost=477861.60..477861.60 rows=1 width=0)

        -> Index Scan using value_idx on table (cost=0.00..477553.70
rows=123157 width=0)



    EXPLAIN



Here's the table schema:



Table "table"

Attribute  | Type          | Modifier
-----------+---------------+----------
col01      | character(9)  |
col02      | character(15) |
col03      | character(15) |
col04      | character(1)  |
col05      | character(15) |
col06      | character(15) |
col07      | character(15) |
col08      | character(15) |
col09      | character(1)  |
col10      | integer       |
col11      | integer       |
col12      | integer       |
col13      | integer       |
col14      | integer       |
value      | integer       |


Indices: value_idx



I've set SHMMAX and SHMMALL to 2075721728 in /proc/sys/kernel/, and in
postgresql.conf I've set the parameters to:



sort_mem = 262144 # sorting memory (in bytes) = 256 MB

shared_buffers = 249036 # this is 1.5GB/8192

fsync = true



Does anyone have any advice for optimizing the SELECT query listed above? Is
this as fast as Postgresql will perform? Any good pointers on working with
large tables in Postgres?



I appreciate your advice.



Best regards,



Shaun Grannis



Re: Advice for optimizing queries using Large Tables

От
Tom Lane
Дата:
"Shaun Grannis" <shaun_grannis@hotmail.com> writes:
> I'm working with a table containing over 65 million records in Postgres v
> 7.1.3.

> This query:
>     SELECT count(*) FROM table WHERE value=1999;
> takes approximately 45 minutes to execute, and returns a count of approx 2.2
> million records.

A query scanning 1/30th of the table almost certainly should use a
seqscan not an indexscan.  Does it get faster if you do "set
enable_seqscan to off"?

>     Aggregate (cost=477861.60..477861.60 rows=1 width=0)
>         -> Index Scan using value_idx on table (cost=0.00..477553.70
> rows=123157 width=0)

Hmm.  The reason that the planner is making the wrong plan choice is the
drastic underestimation of the number of matched rows.  With so few
distinct values in the column I'd have expected 7.1 to get a more
accurate estimate, but it's probably not worth worrying about at this
point.  The short answer is to update to 7.2 --- it has much better
statistics-gathering code and should pick the right plan.

            regards, tom lane

PS: this is a refreshing change from the usual "I want an indexscan,
why aren't I getting one?" type of planner mistake ;-)

Re: Advice for optimizing queries using Large Tables

От
Francisco Reyes
Дата:
On Sat, 9 Mar 2002, Shaun Grannis wrote:

> I'm working with a table containing over 65 million records in Postgres v
> 7.1.3. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board) with
> 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives configured in a
> software RAID 0 Array running under RedHat Linux v. 7.2. Queries don't seem
> to be running as fast as "they should".


Have you considered moving to a SCSI setup?

>     SELECT count(*) FROM table WHERE value=1999;
> takes approximately 45 minutes to execute, and returns a count of approx 2.2
> million records. My "instinct" is that this is much too slow for a query of
> an indexed column running on this hardware.

As it was suggested you may want to consider going to 7.2

> Here's the table schema:
> Table "table"
>
> Attribute  | Type          | Modifier
> -----------+---------------+----------
> col01      | character(9)  |
> col02      | character(15) |
> col03      | character(15) |
> col04      | character(1)  |
> col05      | character(15) |
> col06      | character(15) |
> col07      | character(15) |
> col08      | character(15) |
> col09      | character(1)  |
> col10      | integer       |
> col11      | integer       |
> col12      | integer       |
> col13      | integer       |
> col14      | integer       |
> value      | integer       |


> Does anyone have any advice for optimizing the SELECT query listed above?

Another "optimization"/trick we do here is to split tables into the most
needed info and the least needed info. If you have a part of the data
which is used often you put it in the "main" table. If you have data which
is big and not used often you put it in a second table. In our case we
have a few tables where the size of the data which is not used often can
be from 2 to 4 times the size of the data used often. This helps a lot
with joins and sequential scans.


>Is this as fast as Postgresql will perform? Any good pointers on working
>with large tables in Postgres?

I would suspect this is not as fast as PostgreSQL can perform. Although my
data set is not 60+ million records to do a somewhat complex  aggregate +
join of 5+ million records takes on the 30 minutes range.


Re: Advice for optimizing queries using Large Tables

От
"Gregory Wood"
Дата:
> > I'm working with a table containing over 65 million records in Postgres
v
> > 7.1.3. The machine is a dual-processor Athlon MP1900 (Tyan Tiger board)
with
> > 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives configured in a
> > software RAID 0 Array running under RedHat Linux v. 7.2. Queries don't
seem
> > to be running as fast as "they should".
>
> Have you considered moving to a SCSI setup?

Particularly in light of IBM's recent decision to change the drive specs...
according to IBM you shouldn't have those drives powered on more than 333
hours a month (or roughly 8 hours a day):

"Q: Would you recommend this drive in a server role?
A: No, the drive is intended to be on for no more than about 8 hours a day.
If it were only used during that period and then shut down for the day, then
it would be fine, but it definitely should NOT be used in a 24/7 role for
those customers concerned with reliability."

Quote from: http://www.storagereview.com/

Greg