Обсуждение: explain analyze reports that my queries are fast but they run very slowly

Поиск
Список
Период
Сортировка

explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
We just upgraded from 8.3 to 9.1 and we're seeing some performance problems.  When we EXPLAIN ANALYZE our queries the explain result claim that the queries are reasonably fast but the wall clock time is way way longer.  Does anyone know why this might happen?

Like so:
db=>\timing
db=>EXPLAIN ANALYZE SELECT max(id) FROM foo WHERE blah_id = 1209123;

The plan is sensible.  The estimates are sensible.  The actual DB time reads like it is very sensible.  But the wall clock time is like 11 seconds and the \timing report confirms it.

Any ideas?

Thanks!

Nik

Re: explain analyze reports that my queries are fast but they run very slowly

От
François Beausoleil
Дата:
Le 2012-12-27 à 12:10, Nikolas Everett a écrit :

> We just upgraded from 8.3 to 9.1 and we're seeing some performance problems.  When we EXPLAIN ANALYZE our queries the
explainresult claim that the queries are reasonably fast but the wall clock time is way way longer.  Does anyone know
whythis might happen? 
>
> Like so:
> db=>\timing
> db=>EXPLAIN ANALYZE SELECT max(id) FROM foo WHERE blah_id = 1209123;
>
> The plan is sensible.  The estimates are sensible.  The actual DB time reads like it is very sensible.  But the wall
clocktime is like 11 seconds and the \timing report confirms it. 
>
> Any ideas?

Could you post the actual plans? On both versions? That would help a lot.

Also, http://explain.depesz.com/ helps readability.

Bye,
François



Re: explain analyze reports that my queries are fast but they run very slowly

От
Richard Neill
Дата:
On 27/12/12 17:21, François Beausoleil wrote:
>
> Le 2012-12-27 à 12:10, Nikolas Everett a écrit :
>
>> We just upgraded from 8.3 to 9.1 and we're seeing some performance problems.  When we EXPLAIN ANALYZE our queries
theexplain result claim that the queries are reasonably fast but the wall clock time is way way longer.  Does anyone
knowwhy this might happen? 
>>

Is it possible you missed an optimisation setting in the migration
process? I made that mistake, and much later found performance was
somewhat degraded (but surprisingly not as much as I'd expected) by my
having failed to set effective_cache_size.

Also, if you just did a dump/restore, it might help to run Analyse once
(it seems that Analyse is normally run automatically via vacuum, but the
first time you insert the data, it may not happen).

A side-effect of Analyse it that it will pull all the tables into the OS
memory cache (or try to) - which may give significantly faster results
(try running the same query twice in succession: it's often 5x faster
the 2nd time).

HTH,

Richard


Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
Thanks!


Looks like we're running a load of about 6.  The machines have two physical cores hyperthreaded to 32 cores.

Interesting - the data is stored on nfs on a netapp.  We don't seem to have a ton of nfs traffic.

Also we've got shared memory set to 48 gigs which is comfortably less than the 146 gigs in the machine.


On Thu, Dec 27, 2012 at 12:21 PM, François Beausoleil <francois@teksol.info> wrote:

Le 2012-12-27 à 12:10, Nikolas Everett a écrit :

> We just upgraded from 8.3 to 9.1 and we're seeing some performance problems.  When we EXPLAIN ANALYZE our queries the explain result claim that the queries are reasonably fast but the wall clock time is way way longer.  Does anyone know why this might happen?
>
> Like so:
> db=>\timing
> db=>EXPLAIN ANALYZE SELECT max(id) FROM foo WHERE blah_id = 1209123;
>
> The plan is sensible.  The estimates are sensible.  The actual DB time reads like it is very sensible.  But the wall clock time is like 11 seconds and the \timing report confirms it.
>
> Any ideas?

Could you post the actual plans? On both versions? That would help a lot.

Also, http://explain.depesz.com/ helps readability.

Bye,
François


Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
New news - the hot slave seems to be performing as expected with no long pauses.

It looks like we're using an archive_timeout of 60 seconds and default checkout_timeout and checkpoint_completion_target.  I didn't do any of the research on this.  It seems like we're asking postgres to clear all of the dirty buffers every 60 seconds.  With 48 gigs of shared buffers we could have quite a few buffers to clear.  Is there some place I could check on how all that is going?

On Thu, Dec 27, 2012 at 12:45 PM, Nikolas Everett <nik9000@gmail.com> wrote:
p

Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
Actually that last paragraph doesn't make much sense.  Please ignore it.


On Thu, Dec 27, 2012 at 12:58 PM, Nikolas Everett <nik9000@gmail.com> wrote:
New news - the hot slave seems to be performing as expected with no long pauses.

It looks like we're using an archive_timeout of 60 seconds and default checkout_timeout and checkpoint_completion_target.  I didn't do any of the research on this.  It seems like we're asking postgres to clear all of the dirty buffers every 60 seconds.  With 48 gigs of shared buffers we could have quite a few buffers to clear.  Is there some place I could check on how all that is going?

On Thu, Dec 27, 2012 at 12:45 PM, Nikolas Everett <nik9000@gmail.com> wrote:
p


Re: explain analyze reports that my queries are fast but they run very slowly

От
Tom Lane
Дата:
Nikolas Everett <nik9000@gmail.com> writes:
> We just upgraded from 8.3 to 9.1 and we're seeing some performance
> problems.  When we EXPLAIN ANALYZE our queries the explain result claim
> that the queries are reasonably fast but the wall clock time is way way
> longer.  Does anyone know why this might happen?

> Like so:
> db=>\timing
> db=>EXPLAIN ANALYZE SELECT max(id) FROM foo WHERE blah_id = 1209123;

> The plan is sensible.  The estimates are sensible.  The actual DB time
> reads like it is very sensible.  But the wall clock time is like 11 seconds
> and the \timing report confirms it.

Seems like the extra time would have to be in parsing/planning, or in
waiting to acquire AccessShareLock on the table.  It's hard to believe
the former for such a simple query, unless the table has got thousands
of indexes or something silly like that.  Lock waits are surely possible
if there is something else contending for exclusive lock on the table,
but it's hard to see how the wait time would be so consistent.

BTW, the explain.depesz.com link you posted clearly does not correspond
to the above query (it's not doing a MAX), so another possibility is
confusion about what query is really causing trouble.  We've seen people
remove essential details before while trying to anonymize their query.

            regards, tom lane


Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
Sorry for the confusion around the queries.  Both queries are causing trouble.  We've noticed that just EXPLAINING the very simple queries takes forever.

After more digging it looks like this table has an inordinate number of indices (10 ish).  There a whole buch of conditional indicies for other columns that we're not checking.  The particular column that is causing us trouble exists in both a regular (con_id) and a composite index (con_id, somthing_else).

We checked on locks and don't see any ungranted locks.  Would waiting on the AccessShareLock not appear in pg_locks?

Thanks!

Nik


On Thu, Dec 27, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nikolas Everett <nik9000@gmail.com> writes:
> We just upgraded from 8.3 to 9.1 and we're seeing some performance
> problems.  When we EXPLAIN ANALYZE our queries the explain result claim
> that the queries are reasonably fast but the wall clock time is way way
> longer.  Does anyone know why this might happen?

> Like so:
> db=>\timing
> db=>EXPLAIN ANALYZE SELECT max(id) FROM foo WHERE blah_id = 1209123;

> The plan is sensible.  The estimates are sensible.  The actual DB time
> reads like it is very sensible.  But the wall clock time is like 11 seconds
> and the \timing report confirms it.

Seems like the extra time would have to be in parsing/planning, or in
waiting to acquire AccessShareLock on the table.  It's hard to believe
the former for such a simple query, unless the table has got thousands
of indexes or something silly like that.  Lock waits are surely possible
if there is something else contending for exclusive lock on the table,
but it's hard to see how the wait time would be so consistent.

BTW, the explain.depesz.com link you posted clearly does not correspond
to the above query (it's not doing a MAX), so another possibility is
confusion about what query is really causing trouble.  We've seen people
remove essential details before while trying to anonymize their query.

                        regards, tom lane

Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
Another other thing - the query seems to get faster after the first time we plan it.  I'm not sure that this is the case but I think it might be.


On Thu, Dec 27, 2012 at 2:28 PM, Nikolas Everett <nik9000@gmail.com> wrote:
Sorry for the confusion around the queries.  Both queries are causing trouble.  We've noticed that just EXPLAINING the very simple queries takes forever.

After more digging it looks like this table has an inordinate number of indices (10 ish).  There a whole buch of conditional indicies for other columns that we're not checking.  The particular column that is causing us trouble exists in both a regular (con_id) and a composite index (con_id, somthing_else).

We checked on locks and don't see any ungranted locks.  Would waiting on the AccessShareLock not appear in pg_locks?

Thanks!

Nik


On Thu, Dec 27, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nikolas Everett <nik9000@gmail.com> writes:
> We just upgraded from 8.3 to 9.1 and we're seeing some performance
> problems.  When we EXPLAIN ANALYZE our queries the explain result claim
> that the queries are reasonably fast but the wall clock time is way way
> longer.  Does anyone know why this might happen?

> Like so:
> db=>\timing
> db=>EXPLAIN ANALYZE SELECT max(id) FROM foo WHERE blah_id = 1209123;

> The plan is sensible.  The estimates are sensible.  The actual DB time
> reads like it is very sensible.  But the wall clock time is like 11 seconds
> and the \timing report confirms it.

Seems like the extra time would have to be in parsing/planning, or in
waiting to acquire AccessShareLock on the table.  It's hard to believe
the former for such a simple query, unless the table has got thousands
of indexes or something silly like that.  Lock waits are surely possible
if there is something else contending for exclusive lock on the table,
but it's hard to see how the wait time would be so consistent.

BTW, the explain.depesz.com link you posted clearly does not correspond
to the above query (it's not doing a MAX), so another possibility is
confusion about what query is really causing trouble.  We've seen people
remove essential details before while trying to anonymize their query.

                        regards, tom lane


Re: explain analyze reports that my queries are fast but they run very slowly

От
Tom Lane
Дата:
Nikolas Everett <nik9000@gmail.com> writes:
> After more digging it looks like this table has an inordinate number
> of indices (10 ish).

10 doesn't sound like a lot.

> There a whole buch of conditional indicies for other
> columns that we're not checking.  The particular column that is causing us
> trouble exists in both a regular (con_id) and a composite index (con_id,
> somthing_else).

You're not being at all clear here.  Are you trying to say that only
queries involving "WHERE col = constant" for a particular column seem
to be slow?  If so, maybe the column has a weird datatype or a wildly
out of line statistics target?  (Still hard to see how you get to 11-ish
seconds in planning, though.)

One thing you might do is watch the backend process in "top" or local
equivalent, and see if it's spending most of the 11 seconds sleeping, or
accumulating CPU time, or accumulating I/O.  That info would eliminate
a lot of possibilities.

            regards, tom lane


Re: explain analyze reports that my queries are fast but they run very slowly

От
Tom Lane
Дата:
Nikolas Everett <nik9000@gmail.com> writes:
> We straced the backend during the explain and it looked like the open
> commands were taking several seconds each.

Kind of makes me wonder if you have a whole lot of tables ("whole lot"
in this context probably means tens of thousands) and are storing the
database on a filesystem that doesn't scale well to lots of files in one
directory.  If that's the explanation, the reason the 8.3 installation
was okay was likely that it was stored on a more modern filesystem.

BTW, please keep the list cc'd on replies.

            regards, tom lane


Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:



On Thu, Dec 27, 2012 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nikolas Everett <nik9000@gmail.com> writes:
> We straced the backend during the explain and it looked like the open
> commands were taking several seconds each.

Kind of makes me wonder if you have a whole lot of tables ("whole lot"
in this context probably means tens of thousands) and are storing the
database on a filesystem that doesn't scale well to lots of files in one
directory.  If that's the explanation, the reason the 8.3 installation
was okay was likely that it was stored on a more modern filesystem.

We have 1897 files for our largest database which really isn't a whole lot.  The old servers were EXT3 over FC to a NetApp running RHEL5 PPC.  The new servers are on NFS to the same NetApp running RHEL5 Intel.  We've failed from our physical primary to a virtual secondary both of which seem to have the same problem.  We're in the process of rebuilding the a hot slave on EXT3 over iSCSI.  We'll fail over to it as soon as we can.

We never tried stracing the PPC infrastructure but it obviously didn't have this problem.

We also have another cluster running with an identical setup which doesn't seem to have the problem.  In fact, the problem never came up durring correctness testing for this problem either - it specifically required load before it came up.
 

BTW, please keep the list cc'd on replies.

Itchy reply finger.
 

                        regards, tom lane

Re: explain analyze reports that my queries are fast but they run very slowly

От
Tom Lane
Дата:
Nikolas Everett <nik9000@gmail.com> writes:
> On Thu, Dec 27, 2012 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Nikolas Everett <nik9000@gmail.com> writes:
>>> We straced the backend during the explain and it looked like the open
>>> commands were taking several seconds each.

>> Kind of makes me wonder if you have a whole lot of tables ("whole lot"
>> in this context probably means tens of thousands) and are storing the
>> database on a filesystem that doesn't scale well to lots of files in one
>> directory.  If that's the explanation, the reason the 8.3 installation
>> was okay was likely that it was stored on a more modern filesystem.

> We have 1897 files for our largest database which really isn't a whole lot.

OK...

>  The old servers were EXT3 over FC to a NetApp running RHEL5 PPC.  The new
> servers are on NFS to the same NetApp running RHEL5 Intel.

Now I'm wondering about network glitches or NFS configuration problems.
This is a bit outside my expertise unfortunately, but it seems clear
that your performance issue is somewhere in that area.

            regards, tom lane


Re: explain analyze reports that my queries are fast but they run very slowly

От
Nikolas Everett
Дата:
It looks like it was a problem with NFS.  We're not really sure what was wrong with it but once we failed over to an iSCSI mount for the data everything is running just fine.


On Thu, Dec 27, 2012 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nikolas Everett <nik9000@gmail.com> writes:
> On Thu, Dec 27, 2012 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Nikolas Everett <nik9000@gmail.com> writes:
>>> We straced the backend during the explain and it looked like the open
>>> commands were taking several seconds each.

>> Kind of makes me wonder if you have a whole lot of tables ("whole lot"
>> in this context probably means tens of thousands) and are storing the
>> database on a filesystem that doesn't scale well to lots of files in one
>> directory.  If that's the explanation, the reason the 8.3 installation
>> was okay was likely that it was stored on a more modern filesystem.

> We have 1897 files for our largest database which really isn't a whole lot.

OK...

>  The old servers were EXT3 over FC to a NetApp running RHEL5 PPC.  The new
> servers are on NFS to the same NetApp running RHEL5 Intel.

Now I'm wondering about network glitches or NFS configuration problems.
This is a bit outside my expertise unfortunately, but it seems clear
that your performance issue is somewhere in that area.

                        regards, tom lane