Обсуждение: Performance issue with 8.2.3 - "C" application

От:
Karl Denninger
Дата:

I've got an interesting issue here that I'm running into with 8.2.3

This is an application that has run quite well for a long time, and has
been operating without significant changes (other than recompilation)
since back in the early 7.x Postgres days.  But now we're seeing a LOT
more load than we used to with it, and suddenly, we're seeing odd
performance issues.

It APPEARS that the problem isn't query performance per-se.  That is,
while I can find a few processes here and there in a run state when I
look with a PS, I don't see them consistently churning.

But.... here's the query that has a habit of taking the most time....

select forum, * from post where toppost = 1 and (replied > (select
lastview from forumlog where login='theuser' and forum=post.forum and
number is null)) is not false AND (replied > (select lastview from
forumlog where login='theuser' and forum=post.forum and
number=post.number)) is not f
alse order by pinned desc, replied desc offset 0 limit 20

Now the numeric and "login" fields may change; when I plug it into
explain what I get back is:

                                             QUERY
PLAN
-------------------------------------------------------------------------------------------------------
 Limit  (cost=57270.22..57270.27 rows=20 width=757)
   ->  Sort  (cost=57270.22..57270.50 rows=113 width=757)
         Sort Key: pinned, replied
         ->  Index Scan using post_top on post  (cost=0.00..57266.37
rows=113 width=757)
               Index Cond: (toppost = 1)
               Filter: (((replied > (subplan)) IS NOT FALSE) AND
((replied > (subplan)) IS NOT FALSE))
               SubPlan
                 ->  Index Scan using forumlog_composite on forumlog
(cost=0.00..8.29 rows=1 width
                       Index Cond: (("login" = 'theuser'::text) AND
(forum = $0) AND (number = $1))
                 ->  Bitmap Heap Scan on forumlog  (cost=4.39..47.61
rows=1 width=8)
                       Recheck Cond: (("login" = 'theuser'::text) AND
(forum = $0))
                       Filter: (number IS NULL)
                       ->  Bitmap Index Scan on forumlog_composite
(cost=0.00..4.39 rows=12 width=0)
                             Index Cond: (("login" = 'theuser'::text)
AND (forum = $0))
(14 rows)

And indeed, it returns a fairly reasonable number of rows.

This takes a second or two to return - not all that bad - although this
is one that people hit a LOT.

One thing that DOES bother me is this line from the EXPLAIN output:
->  Index Scan using post_top on post  (cost=0.00..57266.53 rows=113
width=757)

This is indexed using:

  "post_top" btree (toppost)

Ain't nothing fancy there.  So how come the planner thinks this is going
to take that long?!?

More interesting, if I do a simple query on that line, I get....

ticker=> explain select forum from post where toppost='1';
                                QUERY PLAN
---------------------------------------------------------------------------
 Index Scan using post_top on post  (cost=0.00..632.03 rows=1013 width=11)
   Index Cond: (toppost = 1)

Hmmmmmm; that's a bit more reasonable.  So what's up with the above line?

What I'm seeing is that as concurrency increases, I see the CPU load
spike.  Disk I/O is low to moderate at less than 10% of maximum
according to systat -vm, no swap in use, 300mb dedicated to shared
memory buffers for Postgres (machine has 1GB of RAM and is a P4/3G/HT
running FreeBSD 6.2-STABLE)  It does not swap at all, so it does not
appear I've got a problem with running out of physical memory.  shmem is
pinned to physical memory via the sysctl tuning parameter to prevent
page table thrashing.

However, load goes up substantially and under moderate to high
concurrency gets into the 4-5 range with response going somewhat to
crap.  The application is still usable, but its not "crisp".  If I do a
"ps" during times that performance is particularly bad, I don't see any
particular overrepresentation of this query .vs. others (I have the
application doing a "setproctitle" so I know what command - and thus
what sets of queries - it is executing.)

Not sure where to start here.  It appears that I'm CPU limited and the
problem may be that this is a web-served application that must connect
to the Postgres backend for each transaction, perform its queries, and
then close the connection down - in other words the load may be coming
not from Postgres but rather from places I can't fix at the application
layer (e.g. fork() overhead, etc).  The DBMS and Apache server are on
the same machine, so there's no actual network overhead involved.

If that's the case the only solution is to throw more hardware at it.  I
can do that, but before I go tossing more CPU at the problem I'd like to
know I'm not just wasting money.

The application uses the "C" language interface and just calls
"Connectdb" - the only parameter is the dbname, so it should be
defaulting to the local socket.  It appears that this is the case.

The system load is typically divided up as about 2:1 system .vs. user
time on the CPU meter.

Ideas on where to start in trying to profile where the bottlenecks are?

The indices are btrees - I'm wondering if perhaps I should be using
something different here?

Thanks in advance.....

--
Karl Denninger ()
http://www.denninger.net




%SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok

От:
Tom Lane
Дата:

Karl Denninger <> writes:
> But.... here's the query that has a habit of taking the most time....

> select forum, * from post where toppost = 1 and (replied > (select
> lastview from forumlog where login='theuser' and forum=post.forum and
> number is null)) is not false AND (replied > (select lastview from
> forumlog where login='theuser' and forum=post.forum and
> number=post.number)) is not f
> alse order by pinned desc, replied desc offset 0 limit 20

Did that ever perform well for you?  It's the sub-selects that are
likely to hurt ... in particular,

>          ->  Index Scan using post_top on post  (cost=0.00..57266.37
> rows=113 width=757)
>                Index Cond: (toppost = 1)
>                Filter: (((replied > (subplan)) IS NOT FALSE) AND
> ((replied > (subplan)) IS NOT FALSE))

versus

>  Index Scan using post_top on post  (cost=0.00..632.03 rows=1013 width=11)
>    Index Cond: (toppost = 1)

The planner thinks that the two subplan filter conditions will eliminate
about 90% of the rows returned by the bare indexscan (IIRC this is
purely a rule of thumb, not based on any statistics) and that testing
them 1013 times will add over 50000 cost units to the basic indexscan.
That part I believe --- correlated subqueries are expensive.

            regards, tom lane

От:
Karl Denninger
Дата:

Yeah, the problem doesn't appear to be there.  As I said, if I look at the PS of the system when its bogging, there aren't a whole bunch of processes stuck doing these, so while this does take a second or two to come back, that's not that bad.

Its GENERAL performance that just bites - the system is obviously out of CPU, but what I can't get a handle on is WHY.  It does not appear to be accumulating large amounts of runtime in processes I can catch, but the load average is quite high.

This is why I'm wondering if what I'm taking here is a hit on the fork/exec inside the portmaster, in the setup internally in there, in the IPC between my process via libPQ, etc - and how I can profile what's going on.
Karl Denninger ()
http://www.denninger.net



Tom Lane wrote:
Karl Denninger <> writes: 
But.... here's the query that has a habit of taking the most time....   
 
select forum, * from post where toppost = 1 and (replied > (select 
lastview from forumlog where login='theuser' and forum=post.forum and 
number is null)) is not false AND (replied > (select lastview from 
forumlog where login='theuser' and forum=post.forum and 
number=post.number)) is not f
alse order by pinned desc, replied desc offset 0 limit 20   
Did that ever perform well for you?  It's the sub-selects that are
likely to hurt ... in particular,
 
         ->  Index Scan using post_top on post  (cost=0.00..57266.37 
rows=113 width=757)              Index Cond: (toppost = 1)              Filter: (((replied > (subplan)) IS NOT FALSE) AND 
((replied > (subplan)) IS NOT FALSE))   
versus
 
 Index Scan using post_top on post  (cost=0.00..632.03 rows=1013 width=11)  Index Cond: (toppost = 1)   
The planner thinks that the two subplan filter conditions will eliminate
about 90% of the rows returned by the bare indexscan (IIRC this is
purely a rule of thumb, not based on any statistics) and that testing
them 1013 times will add over 50000 cost units to the basic indexscan.
That part I believe --- correlated subqueries are expensive.
		regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings


%SPAMBLOCK-SYS: Matched [hub.org+], message ok 
От:
"Merlin Moncure"
Дата:

On 7/25/07, Karl Denninger <> wrote:
>
>  Yeah, the problem doesn't appear to be there.  As I said, if I look at the
> PS of the system when its bogging, there aren't a whole bunch of processes
> stuck doing these, so while this does take a second or two to come back,
> that's not that bad.
>
>  Its GENERAL performance that just bites - the system is obviously out of
> CPU, but what I can't get a handle on is WHY.  It does not appear to be
> accumulating large amounts of runtime in processes I can catch, but the load
> average is quite high.

8.2.3 has the 'stats collector bug' (fixed in 8.2.4) which increased
load in high concurrency conditions.  on a client's machine after
patching the postmaster load drop from the 4-5 range to 1-2 range on a
500 tps server.  maybe this is biting you?  symptoms are high load avg
and high cpu usage of stats collector process.

merlin

От:
Karl Denninger
Дата:

Hmmmmm..... now that's interesting.  Stats collector IS accumulating
quite a bit of runtime..... me thinks its time to go grab 8.2.4.

Karl Denninger ()
http://www.denninger.net




Merlin Moncure wrote:
> On 7/25/07, Karl Denninger <> wrote:
>>
>>  Yeah, the problem doesn't appear to be there.  As I said, if I look
>> at the
>> PS of the system when its bogging, there aren't a whole bunch of
>> processes
>> stuck doing these, so while this does take a second or two to come back,
>> that's not that bad.
>>
>>  Its GENERAL performance that just bites - the system is obviously
>> out of
>> CPU, but what I can't get a handle on is WHY.  It does not appear to be
>> accumulating large amounts of runtime in processes I can catch, but
>> the load
>> average is quite high.
>
> 8.2.3 has the 'stats collector bug' (fixed in 8.2.4) which increased
> load in high concurrency conditions.  on a client's machine after
> patching the postmaster load drop from the 4-5 range to 1-2 range on a
> 500 tps server.  maybe this is biting you?  symptoms are high load avg
> and high cpu usage of stats collector process.
>
> merlin
>
>
> %SPAMBLOCK-SYS: Matched [google.com+], message ok


%SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok

От:
Tom Lane
Дата:

Karl Denninger <> writes:
> Hmmmmm..... now that's interesting.  Stats collector IS accumulating
> quite a bit of runtime..... me thinks its time to go grab 8.2.4.

I think Merlin might have nailed it --- the "stats collector bug" is
that it tries to write out the stats file way more often than it
should.  So any excessive userland CPU time you see is just the tip
of the iceberg compared to the system and I/O costs incurred.

            regards, tom lane

От:
Karl Denninger
Дата:

Aha!

BIG difference.  I won't know for sure until the biz day tomorrow but
the "first blush" look is that it makes a HUGE difference in system
load, and I no longer have the stats collector process on the top of the
"top" list......

Karl Denninger ()
http://www.denninger.net




Tom Lane wrote:
> Karl Denninger <> writes:
>
>> Hmmmmm..... now that's interesting.  Stats collector IS accumulating
>> quite a bit of runtime..... me thinks its time to go grab 8.2.4.
>>
>
> I think Merlin might have nailed it --- the "stats collector bug" is
> that it tries to write out the stats file way more often than it
> should.  So any excessive userland CPU time you see is just the tip
> of the iceberg compared to the system and I/O costs incurred.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to  so that your
>        message can get through to the mailing list cleanly
>
>
> %SPAMBLOCK-SYS: Matched [hub.org+], message ok
>


%SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok

От:
Nis Jørgensen
Дата:

Karl Denninger skrev:
> I've got an interesting issue here that I'm running into with 8.2.3
>
> This is an application that has run quite well for a long time, and has
> been operating without significant changes (other than recompilation)
> since back in the early 7.x Postgres days.  But now we're seeing a LOT
> more load than we used to with it, and suddenly, we're seeing odd
> performance issues.
>
> It APPEARS that the problem isn't query performance per-se.  That is,
> while I can find a few processes here and there in a run state when I
> look with a PS, I don't see them consistently churning.
>
> But.... here's the query that has a habit of taking the most time....
>
> select forum, * from post where toppost = 1 and (replied > (select
> lastview from forumlog where login='theuser' and forum=post.forum and
> number is null)) is not false AND (replied > (select lastview from
> forumlog where login='theuser' and forum=post.forum and
> number=post.number)) is not false order by pinned desc, replied desc offset 0 limit 20

Since I can do little to help you with anything else, here is a little
help from a guy with a hammer. It seems you may be able to convert the
subqueries into a left join. Not sure whether this helps, nor whether I
got some bits of the logic wrong, but something like this might help the
planner find a better plan:

SELECT forum, *
FROM post
LEFT JOIN forumlog
ON post.forum = forumlog.forum
AND forumlog.login = 'theuser'
AND (post.number = forumlog.number OR forumlog.number IS NULL)
AND post.replied <= lastview
WHERE forumlog.forum IS NULL
AND forum.toppost = 1
ORDER BY pinned DESC, replied DESC OFFSET 0 LIMIT 20 ;


Nis

От:
Gregory Stark
Дата:

"Karl Denninger" <> writes:

> Not sure where to start here.  It appears that I'm CPU limited and the problem
> may be that this is a web-served application that must connect to the Postgres
> backend for each transaction, perform its queries, and then close the
> connection down - in other words the load may be coming not from Postgres but
> rather from places I can't fix at the application layer (e.g. fork() overhead,
> etc).  The DBMS and Apache server are on the same machine, so there's no actual
> network overhead involved.
>
> If that's the case the only solution is to throw more hardware at it.  I can do
> that, but before I go tossing more CPU at the problem I'd like to know I'm not
> just wasting money.

I know you found the proximate cause of your current problems, but it sounds
like you have something else you should consider looking at here. There are
techniques for avoiding separate database connections for each request.

If you're using Apache you can reduce the CPU usage a lot by writing your
module as an Apache module instead of a CGI or whatever type of program it is
now. Then your module would live as long as a single Apache instance which you
can configure to be hours or days instead of a single request. It can keep
around the database connection for that time.

If that's impossible there are still techniques that can help. You can set up
PGPool or PGBouncer or some other connection aggregating tool to handle the
connections. This is a pretty low-impact change which shouldn't require making
any application changes aside from changing the database connection string.
Effectively this is a just a connection pool that lives in a separate
process.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


От:
Karl Denninger
Дата:

Looks like that was the problem - got a day under the belt now with the
8.2.4 rev and all is back to normal.

Karl Denninger ()
http://www.denninger.net




Karl Denninger wrote:
> Aha!
>
> BIG difference.  I won't know for sure until the biz day tomorrow but
> the "first blush" look is that it makes a HUGE difference in system
> load, and I no longer have the stats collector process on the top of
> the "top" list......
>
> Karl Denninger ()
> http://www.denninger.net
>
>
>
>
> Tom Lane wrote:
>> Karl Denninger <> writes:
>>
>>> Hmmmmm..... now that's interesting.  Stats collector IS accumulating
>>> quite a bit of runtime..... me thinks its time to go grab 8.2.4.
>>>
>>
>> I think Merlin might have nailed it --- the "stats collector bug" is
>> that it tries to write out the stats file way more often than it
>> should.  So any excessive userland CPU time you see is just the tip
>> of the iceberg compared to the system and I/O costs incurred.
>>
>>             regards, tom lane
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>        subscribe-nomail command to  so that your
>>        message can get through to the mailing list cleanly
>>
>>
>> %SPAMBLOCK-SYS: Matched [hub.org+], message ok
>>
>
>
> %SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
>
> %SPAMBLOCK-SYS: Matched [hub.org+], message ok


%SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok