Re: Slow performance when querying millions of rows

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Slow performance when querying millions of rows
Дата
Msg-id 4E0A58EB.90509@2ndQuadrant.com
обсуждение исходный текст
Ответ на Slow performance when querying millions of rows  (Craig McIlwee <craig.mcilwee@openroadsconsulting.com>)
Ответы Re: Slow performance when querying millions of rows
Список pgsql-performance
On 06/28/2011 05:28 PM, Craig McIlwee wrote:
> Autovacuum is disabled for these tables since the data is never
> updated.  The tables that we are testing with at the moment will not
> grow any larger and have been both clustered and analyzed.

Note that any such prep to keep from ever needing to maintain these
tables in the future should include the FREEZE option, possibly with
some parameters tweaked first to make it more aggressive.  Autovacuum
will eventually revisit them in order to prevent transaction ID
wrap-around, even if it's disabled.  If you're going to the trouble of
prepping them so they are never touched again, you should do a freeze
with the right parameters to keep this from happening again.

> work_mem: 512MB
> shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query
> plan and took the same amount of time to execute give or take a few
> seconds

shared_buffers doesn't normally impact the query plan; it impacts how
much churn there is between the database and the operating system cache,
mainly important for making write-heavy work efficient.  On Windows,
you'll probably be safe to set this to 512MB and forget about it.  It
doesn't benefit from large values anyway.

This is a very large work_mem setting however, so be careful that you
won't have many users connecting at once if you're going to use it.
Each connection can use a multiple of work_mem, making it quite possible
you could run out of memory with this configuration.  If that low user
count is true, you may want to make sure you're enforcing it by lowering
max_connections, as a safety measure to prevent problems.

> Since the daily tables are only 360MB, I would hope that the entire
> table could be pulled into memory with one large sequential read.  Of
> course this assumes that the file pieces are stored contiguously, but
> auto defrag is enabled and shows low fragmentation so Im trusting (as
> much as one can) Windows to do the right thing here.  My drives have a
> 150MB/s sustained max throughput, and considering that data is spread
> across 5 drives I would hope to at least be able to reach the single
> disk theoretical limit and read an entire table plus the index into
> memory about 4 to 5 seconds.  Based on the analyze output, each daily
> table averages 6 to 7 seconds, so Im pretty close there and maybe just
> limited by disk speed?

One thing to note is that your drive speed varies based on what part of
the disk things are located at; the slower parts of the drive will be
much less than 150MB/s.

On Linux servers it's impossible to reach something close to the disk's
raw speed without making the operating system read-ahead feature much
more aggressive than it is by default.  Because PostgreSQL fetches a
single block at a time, to keep the drive completely busy something has
to notice the pattern of access and be reading data ahead of when the
database even asks for it.  You may find a parameter you can tune in the
properties for the drives somewhere in the Windows Control Panel.  And
there's a read-ahead setting on your PERC card that's better than
nothing you may not have turned on (not as good as the Linux one, but
it's useful).  There are two useful settings there ("on" and "adaptive"
if I recall correctly) that you can try, to see which works better.

> Ive read documentation that says I should be able to set statistics
> values for an entire table as opposed to per column, but havent found
> how to do that.  I guess I was either too lazy to update statistics on
> each column or just didnt think it would help much.

You can adjust the statistics target across the entire database using
the default_statistics_target setting, or you can tweak them per column
using ALTER TABLE.  There is no table-level control.  I find it
difficult to answer questions about whether there is enough stats or not
without actually looking at pg_stats to see how the database is
interpreting the data, and comparing it against the real distribution.
This is an area where flailing about trying things doesn't work very
well; you need to be very systematic about the analysis and testing
strategy if you're going to get anywhere useful.  It's not easy to do.

As a larger commentary on what you're trying to do, applications like
this often find themselves at a point one day where you just can't allow
arbitrary user queries to run against them anymore.  What normally
happens then is that the most common things that people really need end
up being run one and stored in some summary form, using techniques such
as materialized views:  http://wiki.postgresql.org/wiki/Materialized_Views

In your case, I would start now on trying to find the common patters to
the long running reports that people generate, and see if it's possible
to pre-compute some portion of them and save that summary.  And you may
find yourself in a continuous battle with business requests regardless.
It's often key decision makers who feel they should be able to query any
way they want, regardless of its impact on the database.  Finding a
middle ground there is usually challenging.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
Comprehensive and Customized PostgreSQL Training Classes:
http://www.2ndquadrant.us/postgresql-training/


В списке pgsql-performance по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: Slow performance when querying millions of rows
Следующее
От: "Craig McIlwee"
Дата:
Сообщение: Re: Slow performance when querying millions of rows