Обсуждение: Defining performance.

Поиск
Список
Период
Сортировка

Defining performance.

От
Paul Lathrop
Дата:
Hello all,

I've been struggling with some performance questions regarding our
Postgres databases. Here's the background:

We run 4 ~25-30Gb databases which cache information from eBay. These
databases have had performance issues since before I joined the company.
The databases have gone through a number of iterations. Initially, they
were deployed as one huge database - performance was apparently
unacceptable. They were split, and tried on a variety of hardware
platforms. When I joined the company last year, the databases were
deployed on 12-disk RAID5 arrays on dual-proc AMD machines with 4Gb of
RAM, running Debian Woody and Postgres 7.2. These systems seemed to
suffer a gradually decreasing performance accompanied by a gradually
growing disk space usage. The DBA had come to the conclusion that the
VACUUM command did/does not work on these systems, because even after a
VACUUM FULL, the size of the database was continually increasing. So, as
things stand with the PG7.2 machines, vacuuming is run nightly, and
whenever the database size reaches 40Gb on disk (the point at which
performance has degraded below tolerance), the DBA exports the data,
deletes the database, and then imports the data, shrinking it to the
actual size of the dataset.

This process is time-consuming, costly, and the servers that we are
deployed on do not meet our stability requirements. So, after much
pushing, I was able to deploy a 12-disk RAID5 dual-proc AMD64 machine
with 16Gb of RAM running FreeBSD and Postgres 8.1.

The performance increase was immediate, obvious, and dramatic, as you
might expect from such a large boost in the underlying hardware.

This upgrade appears to have solved the VACUUM issue - regular VACUUM
commands now seem able to maintain the database size at a steady-state
(taking into account fluctuations associated with actual changes in the
dataset size!). We are now planning on moving the other three databases
to the new platform and hardware.

However, we still are suffering a gradual decrease in performance over
time - or so the application engineers claim. The DBA and I have been
banging our heads against this for a month.

Which brings me to the questions:

1) How does one define 'performance' anyway? Is it average time to
complete a query? If so, what kind of query? Is it some other metric?

2) I've combed the archives and seen evidence that people out there are
running much much larger databases on comparable hardware with decent
performance. Is this true, or is my dataset at about the limit of my
hardware?

3) Though this may seem irrelevant, since we are moving away from the
platform, it would still be good to know - was VACUUM actually
completely useless on PG7.2 or is there some other culprit on these
legacy machines?

4) Much of my reading of the PG docs and list archives seems to suggest
that much of performance tuning is done at the query level - you have to
know how to ask for information in an efficient way. To that end, I took
a look at some of the queries we get on a typical day. On average, 24
times per minute, our application causes a unique key violation. This
struck me as strange, but the VP of Engineering says this is a
performance ENHANCEMENT - the code doesn't bother checking for the
unique key because it depends on the database to enforce that. My
interpretation of reading the archives & docs seems to indicate that
this could be causing the constantly increasing database size... so now
that I've rambled about it, does an INSERT transaction that is rolled
back due to a unique key violation leave dead rows in the table? If so, why?

I really appreciate any information you guys can give me. I'm convinced
that PG is the best database for our needs, but I need to be able to get
this database performing well enough to convince the bigwigs.

Regards,
Paul Lathrop
Systems Administrator


Вложения

Re: Defining performance.

От
Tobias Brox
Дата:
[Paul Lathrop - Thu at 02:59:27PM -0800]
> growing disk space usage. The DBA had come to the conclusion that the
> VACUUM command did/does not work on these systems, because even after a
> VACUUM FULL, the size of the database was continually increasing. So, as
> things stand with the PG7.2 machines, vacuuming is run nightly, and
> whenever the database size reaches 40Gb on disk (the point at which
> performance has degraded below tolerance), the DBA exports the data,
> deletes the database, and then imports the data, shrinking it to the
> actual size of the dataset.

We found one reason why vacuuming didn't always work for us - we had
long running transactions - in addition to killing the vacuum, it did
really nasty things to the performance in general.

To check for those transactions, I think it's needed to turn on
stats_command_string in the config.

I use this query to check:

select * from pg_stat_activity where current_query<>'<IDLE>' order by
query_start ;

If you spot any "<IDLE> in transaction" with an old query_start
timestamp, then that's most probably the reason.

Long running transactions doesn't have to be idle ... check the pg_locks
view for the lowest transactionid and compare (through the pid) with the
pg_stat_activity view to find the actual backend.

> However, we still are suffering a gradual decrease in performance over
> time - or so the application engineers claim. The DBA and I have been
> banging our heads against this for a month.

We're having the same issues, so we do the dumping and restoring every
now and then to be sure everything is properly cleaned up.  With 8.1.

> 1) How does one define 'performance' anyway? Is it average time to
> complete a query? If so, what kind of query? Is it some other metric?

We have the same kind of problem, and the project leader (I sometimes
refer him as the "bottleneck" ;-) is most concerned about iowait at our
cpu graphs.  Anyway, we do have other measures:

 - our applications does log the duration of each request towards the
   application as well as each query towards the database.  If the
   request (this is web servers) is taking "too long" time, it's logged
   as error instead of debug.  If a significant number of such errors
   is due to database calls taking too much time, then the performance
   is bad.  Unfortunately, we have no way to automate such checking.

 - I've setting up two scripts pinging that pg_stat_activity view every
   now and then, logging how much "gruff" it finds there.  Those two
   scripts are eventually to be merged.  One is simply logging what it
   finds, the other is a plugin system to the Munin graphing package.

I've thrown the scripts we use out here:

http://oppetid.no/~tobixen/pg_activity_log.txt
http://oppetid.no/~tobixen/pg_activity.munin.txt

(I had to rename them to .txt to get the web server to play along).

Those are very as-is, should certainly be modified a bit to fit to any
other production environment. :-)

The pg_activity_log dumps a single number indicating the "stress level"
of the database to a file.  I think this stress number, when taking out
i.e. the 20% worst numbers from the file for each day, can indicate
something about the performance of the database server.  However, I
haven't had the chance to discuss it with the bottleneck yet.


Re: Defining performance.

От
Tom Lane
Дата:
Paul Lathrop <plathrop@squaretrade.com> writes:
> ... When I joined the company last year, the databases were
> deployed on 12-disk RAID5 arrays on dual-proc AMD machines with 4Gb of
> RAM, running Debian Woody and Postgres 7.2. These systems seemed to
> suffer a gradually decreasing performance accompanied by a gradually
> growing disk space usage. The DBA had come to the conclusion that the
> VACUUM command did/does not work on these systems, because even after a
> VACUUM FULL, the size of the database was continually increasing.

The very first thing you need to do is get off 7.2.

After that, I'd recommend looking at *not* using VACUUM FULL.  FULL is
actually counterproductive in a lot of scenarios, because it shrinks the
tables at the price of bloating the indexes.  And 7.2's poor ability to
reuse index space turns that into a double whammy.  Have you checked
into the relative sizes of tables and indexes and tracked the trend over
time?

            regards, tom lane

Re: Defining performance.

От
Scott Marlowe
Дата:
On Thu, 2006-11-30 at 18:26, Tom Lane wrote:
> Paul Lathrop <plathrop@squaretrade.com> writes:
> > ... When I joined the company last year, the databases were
> > deployed on 12-disk RAID5 arrays on dual-proc AMD machines with 4Gb of
> > RAM, running Debian Woody and Postgres 7.2. These systems seemed to
> > suffer a gradually decreasing performance accompanied by a gradually
> > growing disk space usage. The DBA had come to the conclusion that the
> > VACUUM command did/does not work on these systems, because even after a
> > VACUUM FULL, the size of the database was continually increasing.
>
> The very first thing you need to do is get off 7.2.
>
> After that, I'd recommend looking at *not* using VACUUM FULL.  FULL is
> actually counterproductive in a lot of scenarios, because it shrinks the
> tables at the price of bloating the indexes.  And 7.2's poor ability to
> reuse index space turns that into a double whammy.  Have you checked
> into the relative sizes of tables and indexes and tracked the trend over
> time?

And if you cant get off 7.2, look into scheduling some downtime to run
reindex on the bloated indexes.

In all honesty, a simple single processor workstation with a gig of ram
and a couple of good sized SATA drives and a modern linux distro can
probably outrun your 7.2 server if it's running on 8.1 / 8.2

It's that much faster now.

For the love of all that's holy, as well as your data, start planning
your migration now, and if you can, have it done by the end of next week
or so.

And backup every night religiously.

Re: Defining performance.

От
Jeff Davis
Дата:
On Fri, 2006-12-01 at 01:05 +0100, Tobias Brox wrote:
> > However, we still are suffering a gradual decrease in performance over
> > time - or so the application engineers claim. The DBA and I have been
> > banging our heads against this for a month.
>
> We're having the same issues, so we do the dumping and restoring every
> now and then to be sure everything is properly cleaned up.  With 8.1.
>

What's causing that? Is it index bloat?

I would think a REINDEX would avoid having to dump/restore, right? A
CLUSTER might also be necessary, depending on what kind of performance
degradation you're experiencing.

Am I missing something?

Regards,
    Jeff Davis


Re: Defining performance.

От
Tobias Brox
Дата:
[Jeff Davis - Thu at 04:57:54PM -0800]
> > We're having the same issues, so we do the dumping and restoring every
> > now and then to be sure everything is properly cleaned up.  With 8.1.
> >
>
> What's causing that? Is it index bloat?
>
> I would think a REINDEX would avoid having to dump/restore, right? A
> CLUSTER might also be necessary, depending on what kind of performance
> degradation you're experiencing.
>
> Am I missing something?

Just as with Paul Lathrops case, the performance degradation is
something perceived by the application developers.  We haven't had time
to actually verify reliably that the performance is actually beeing
degraded, neither that the reload beeing done helps (after we resolved
the pending transaction issue, anyway), nor look into what the possible
reasons of this percieved degradation could be.


Re: Defining performance.

От
nospam@hardgeus.com
Дата:
> 4) Much of my reading of the PG docs and list archives seems to suggest
that much of performance tuning is done at the query level - you have to
know how to ask for information in an efficient way.

Performance does not exist in a vacuum (ho ho, PostgreSQL joke).  The
person writing the queries has to understand what is actually happening
inside of the database.  When I ported my app from MS SQL over to
PostgreSQL several years ago, there were many queries that were previously
zippy that ran like turds on PostgreSQL, and vice-versa!

As my dataset has gotten larger I have had to throw more metal at the
problem, but I have also had to rethink my table and query design.  Just
because your data set grows linearly does NOT mean that the performance of
your query is guaranteed to grow linearly!  A sloppy query that runs OK
with 3000 rows in your table may choke horribly when you hit 50000.

John




Re: Defining performance.

От
Tobias Brox
Дата:
[nospam@hardgeus.com - Thu at 06:37:12PM -0600]
> As my dataset has gotten larger I have had to throw more metal at the
> problem, but I have also had to rethink my table and query design.  Just
> because your data set grows linearly does NOT mean that the performance of
> your query is guaranteed to grow linearly!  A sloppy query that runs OK
> with 3000 rows in your table may choke horribly when you hit 50000.

Then some limit is hit ... either the memory cache, or that the planner
is doing an unlucky change of strategy when hitting 50000.

Anyway, it's very important when testing queries that they actually are
tested on a (copy of the) production database, and not on an empty
database or a database containing some random test data.  If testing
queries off the production database, it's important to have equal
hardware and configuration on both servers.


Re: Defining performance.

От
Chris
Дата:
Tobias Brox wrote:
> [nospam@hardgeus.com - Thu at 06:37:12PM -0600]
>> As my dataset has gotten larger I have had to throw more metal at the
>> problem, but I have also had to rethink my table and query design.  Just
>> because your data set grows linearly does NOT mean that the performance of
>> your query is guaranteed to grow linearly!  A sloppy query that runs OK
>> with 3000 rows in your table may choke horribly when you hit 50000.
>
> Then some limit is hit ... either the memory cache, or that the planner
> is doing an unlucky change of strategy when hitting 50000.

Not really. A bad query is a bad query (eg missing a join element). It
won't show up for 3000 rows, but will very quickly if you increase that
by a reasonable amount. Even as simple as a missing index on a join
column won't show up for a small dataset but will for a larger one.

It's a pretty common mistake to assume that a small dataset will behave
exactly the same as a larger one - not always the case.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Defining performance.

От
Tobias Brox
Дата:
[Chris - Fri at 02:32:05PM +1100]
> Not really. A bad query is a bad query (eg missing a join element). It
> won't show up for 3000 rows, but will very quickly if you increase that
> by a reasonable amount. Even as simple as a missing index on a join
> column won't show up for a small dataset but will for a larger one.

Ok, you're talking about O(n^2) and such stuff :-)

> It's a pretty common mistake to assume that a small dataset will behave
> exactly the same as a larger one - not always the case.

No. :-)  We had the worst experience when launching our product - it had
been stress tested, but only by increasing the number of customers, not
by increasing the overall size of the data set available for browsing.
When opening the web pages for the public, this data set was already
some ten-hundred times bigger than in the version enduring the stress
tests - and the servers had no chances processing all the traffic.

The worst bottle neck was not the database this time, but some horror
algorithm one of the programmers had sneaked in ... poorly documented,
cryptically programmed, slow ... and since I never understood that
algorithm, I can only guess it must have been of O(n^2).


Re: Defining performance.

От
"Heikki Linnakangas"
Дата:
Paul Lathrop wrote:
> We run 4 ~25-30Gb databases which cache information from eBay. These
> databases have had performance issues since before I joined the company.
> The databases have gone through a number of iterations. Initially, they
> were deployed as one huge database - performance was apparently
> unacceptable. They were split, and tried on a variety of hardware
> platforms. When I joined the company last year, the databases were
> deployed on 12-disk RAID5 arrays on dual-proc AMD machines with 4Gb of
> RAM, running Debian Woody and Postgres 7.2.

Well, first of all you need to upgrade. 7.2 is old and not supported
anymore.

> These systems seemed to
> suffer a gradually decreasing performance accompanied by a gradually
> growing disk space usage. The DBA had come to the conclusion that the
> VACUUM command did/does not work on these systems, because even after a
> VACUUM FULL, the size of the database was continually increasing. So, as
> things stand with the PG7.2 machines, vacuuming is run nightly, and
> whenever the database size reaches 40Gb on disk (the point at which
> performance has degraded below tolerance), the DBA exports the data,
> deletes the database, and then imports the data, shrinking it to the
> actual size of the dataset.

Vacuum didn't reclaim empty index pages until 7.4, so you might be
suffering from index bloat. A nightly reindex would help with that.

> This process is time-consuming, costly, and the servers that we are
> deployed on do not meet our stability requirements. So, after much
> pushing, I was able to deploy a 12-disk RAID5 dual-proc AMD64 machine
> with 16Gb of RAM running FreeBSD and Postgres 8.1.

You should give 8.2 (now in beta stage) a try as well. There's some
significant performance enhancements, for example vacuums should run faster.

> 1) How does one define 'performance' anyway? Is it average time to
> complete a query? If so, what kind of query? Is it some other metric?

Performance is really an umbrella term that can mean a lot of things.
You'll have to come up with a metric that's most meaningful to you and
that you can easily measure. Some typical metrics are:

* Average response time to a query/queries
* Max or 90% percentile response time to a query
* throughput, transactions per second

You'll have to start measuring performance somehow. You might find out
that actually your performance is bad only during some hour of day for
example. Take a look at the log_min_duration_statement parameter in more
recent versions for starter.

> 2) I've combed the archives and seen evidence that people out there are
> running much much larger databases on comparable hardware with decent
> performance. Is this true, or is my dataset at about the limit of my
> hardware?

It depends on your load, really. A dataset of ~ 40 GB is certainly not
that big compared to what some people have.

> 3) Though this may seem irrelevant, since we are moving away from the
> platform, it would still be good to know - was VACUUM actually
> completely useless on PG7.2 or is there some other culprit on these
> legacy machines?

It's certainly better nowadays..

> 4) Much of my reading of the PG docs and list archives seems to suggest
> that much of performance tuning is done at the query level - you have to
> know how to ask for information in an efficient way. To that end, I took
> a look at some of the queries we get on a typical day. On average, 24
> times per minute, our application causes a unique key violation. This
> struck me as strange, but the VP of Engineering says this is a
> performance ENHANCEMENT - the code doesn't bother checking for the
> unique key because it depends on the database to enforce that. My
> interpretation of reading the archives & docs seems to indicate that
> this could be causing the constantly increasing database size... so now
> that I've rambled about it, does an INSERT transaction that is rolled
> back due to a unique key violation leave dead rows in the table? If so, why?

The way unique checking works in PostgreSQL is:

1. The row is inserted into heap.
2. The corresponding index tuple is inserted to index. While doing that,
  we check that there's no duplicate key there already.

So yes, a unique key violation will leave the dead tuple in the heap,
and it will be removed by vacuum later on.

I think it's a valid and sane approach to leave the uniqueness check to
the database. Unless a very large proportion of your transactions abort
due to unique key violations, the dead rows won't be a problem. The
space will be reclaimed by vacuum.

In general, it's normal that there's some dead rows in the database. As
long as you vacuum regularly, the database size should eventually reach
a steady-state where it doesn't grow anymore, unless the real live
dataset size increases.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com