Обсуждение: Performance issue with 8.2.3 - "C" application
I've got an interesting issue here that I'm running into with 8.2.3 This is an application that has run quite well for a long time, and has been operating without significant changes (other than recompilation) since back in the early 7.x Postgres days. But now we're seeing a LOT more load than we used to with it, and suddenly, we're seeing odd performance issues. It APPEARS that the problem isn't query performance per-se. That is, while I can find a few processes here and there in a run state when I look with a PS, I don't see them consistently churning. But.... here's the query that has a habit of taking the most time.... select forum, * from post where toppost = 1 and (replied > (select lastview from forumlog where login='theuser' and forum=post.forum and number is null)) is not false AND (replied > (select lastview from forumlog where login='theuser' and forum=post.forum and number=post.number)) is not f alse order by pinned desc, replied desc offset 0 limit 20 Now the numeric and "login" fields may change; when I plug it into explain what I get back is: QUERY PLAN ------------------------------------------------------------------------------------------------------- Limit (cost=57270.22..57270.27 rows=20 width=757) -> Sort (cost=57270.22..57270.50 rows=113 width=757) Sort Key: pinned, replied -> Index Scan using post_top on post (cost=0.00..57266.37 rows=113 width=757) Index Cond: (toppost = 1) Filter: (((replied > (subplan)) IS NOT FALSE) AND ((replied > (subplan)) IS NOT FALSE)) SubPlan -> Index Scan using forumlog_composite on forumlog (cost=0.00..8.29 rows=1 width Index Cond: (("login" = 'theuser'::text) AND (forum = $0) AND (number = $1)) -> Bitmap Heap Scan on forumlog (cost=4.39..47.61 rows=1 width=8) Recheck Cond: (("login" = 'theuser'::text) AND (forum = $0)) Filter: (number IS NULL) -> Bitmap Index Scan on forumlog_composite (cost=0.00..4.39 rows=12 width=0) Index Cond: (("login" = 'theuser'::text) AND (forum = $0)) (14 rows) And indeed, it returns a fairly reasonable number of rows. This takes a second or two to return - not all that bad - although this is one that people hit a LOT. One thing that DOES bother me is this line from the EXPLAIN output: -> Index Scan using post_top on post (cost=0.00..57266.53 rows=113 width=757) This is indexed using: "post_top" btree (toppost) Ain't nothing fancy there. So how come the planner thinks this is going to take that long?!? More interesting, if I do a simple query on that line, I get.... ticker=> explain select forum from post where toppost='1'; QUERY PLAN --------------------------------------------------------------------------- Index Scan using post_top on post (cost=0.00..632.03 rows=1013 width=11) Index Cond: (toppost = 1) Hmmmmmm; that's a bit more reasonable. So what's up with the above line? What I'm seeing is that as concurrency increases, I see the CPU load spike. Disk I/O is low to moderate at less than 10% of maximum according to systat -vm, no swap in use, 300mb dedicated to shared memory buffers for Postgres (machine has 1GB of RAM and is a P4/3G/HT running FreeBSD 6.2-STABLE) It does not swap at all, so it does not appear I've got a problem with running out of physical memory. shmem is pinned to physical memory via the sysctl tuning parameter to prevent page table thrashing. However, load goes up substantially and under moderate to high concurrency gets into the 4-5 range with response going somewhat to crap. The application is still usable, but its not "crisp". If I do a "ps" during times that performance is particularly bad, I don't see any particular overrepresentation of this query .vs. others (I have the application doing a "setproctitle" so I know what command - and thus what sets of queries - it is executing.) Not sure where to start here. It appears that I'm CPU limited and the problem may be that this is a web-served application that must connect to the Postgres backend for each transaction, perform its queries, and then close the connection down - in other words the load may be coming not from Postgres but rather from places I can't fix at the application layer (e.g. fork() overhead, etc). The DBMS and Apache server are on the same machine, so there's no actual network overhead involved. If that's the case the only solution is to throw more hardware at it. I can do that, but before I go tossing more CPU at the problem I'd like to know I'm not just wasting money. The application uses the "C" language interface and just calls "Connectdb" - the only parameter is the dbname, so it should be defaulting to the local socket. It appears that this is the case. The system load is typically divided up as about 2:1 system .vs. user time on the CPU meter. Ideas on where to start in trying to profile where the bottlenecks are? The indices are btrees - I'm wondering if perhaps I should be using something different here? Thanks in advance..... -- Karl Denninger (karl@denninger.net) http://www.denninger.net %SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok
Karl Denninger <karl@denninger.net> writes: > But.... here's the query that has a habit of taking the most time.... > select forum, * from post where toppost = 1 and (replied > (select > lastview from forumlog where login='theuser' and forum=post.forum and > number is null)) is not false AND (replied > (select lastview from > forumlog where login='theuser' and forum=post.forum and > number=post.number)) is not f > alse order by pinned desc, replied desc offset 0 limit 20 Did that ever perform well for you? It's the sub-selects that are likely to hurt ... in particular, > -> Index Scan using post_top on post (cost=0.00..57266.37 > rows=113 width=757) > Index Cond: (toppost = 1) > Filter: (((replied > (subplan)) IS NOT FALSE) AND > ((replied > (subplan)) IS NOT FALSE)) versus > Index Scan using post_top on post (cost=0.00..632.03 rows=1013 width=11) > Index Cond: (toppost = 1) The planner thinks that the two subplan filter conditions will eliminate about 90% of the rows returned by the bare indexscan (IIRC this is purely a rule of thumb, not based on any statistics) and that testing them 1013 times will add over 50000 cost units to the basic indexscan. That part I believe --- correlated subqueries are expensive. regards, tom lane
Yeah, the problem doesn't appear to be there. As I said, if I look at the PS of the system when its bogging, there aren't a whole bunch of processes stuck doing these, so while this does take a second or two to come back, that's not that bad.
Its GENERAL performance that just bites - the system is obviously out of CPU, but what I can't get a handle on is WHY. It does not appear to be accumulating large amounts of runtime in processes I can catch, but the load average is quite high.
This is why I'm wondering if what I'm taking here is a hit on the fork/exec inside the portmaster, in the setup internally in there, in the IPC between my process via libPQ, etc - and how I can profile what's going on.
Tom Lane wrote:
Its GENERAL performance that just bites - the system is obviously out of CPU, but what I can't get a handle on is WHY. It does not appear to be accumulating large amounts of runtime in processes I can catch, but the load average is quite high.
This is why I'm wondering if what I'm taking here is a hit on the fork/exec inside the portmaster, in the setup internally in there, in the IPC between my process via libPQ, etc - and how I can profile what's going on.
Karl Denninger (karl@denninger.net) http://www.denninger.net
Tom Lane wrote:
Karl Denninger <karl@denninger.net> writes:But.... here's the query that has a habit of taking the most time....select forum, * from post where toppost = 1 and (replied > (select lastview from forumlog where login='theuser' and forum=post.forum and number is null)) is not false AND (replied > (select lastview from forumlog where login='theuser' and forum=post.forum and number=post.number)) is not f alse order by pinned desc, replied desc offset 0 limit 20Did that ever perform well for you? It's the sub-selects that are likely to hurt ... in particular,-> Index Scan using post_top on post (cost=0.00..57266.37 rows=113 width=757) Index Cond: (toppost = 1) Filter: (((replied > (subplan)) IS NOT FALSE) AND ((replied > (subplan)) IS NOT FALSE))versusIndex Scan using post_top on post (cost=0.00..632.03 rows=1013 width=11) Index Cond: (toppost = 1)The planner thinks that the two subplan filter conditions will eliminate about 90% of the rows returned by the bare indexscan (IIRC this is purely a rule of thumb, not based on any statistics) and that testing them 1013 times will add over 50000 cost units to the basic indexscan. That part I believe --- correlated subqueries are expensive. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings %SPAMBLOCK-SYS: Matched [hub.org+], message ok
On 7/25/07, Karl Denninger <karl@denninger.net> wrote: > > Yeah, the problem doesn't appear to be there. As I said, if I look at the > PS of the system when its bogging, there aren't a whole bunch of processes > stuck doing these, so while this does take a second or two to come back, > that's not that bad. > > Its GENERAL performance that just bites - the system is obviously out of > CPU, but what I can't get a handle on is WHY. It does not appear to be > accumulating large amounts of runtime in processes I can catch, but the load > average is quite high. 8.2.3 has the 'stats collector bug' (fixed in 8.2.4) which increased load in high concurrency conditions. on a client's machine after patching the postmaster load drop from the 4-5 range to 1-2 range on a 500 tps server. maybe this is biting you? symptoms are high load avg and high cpu usage of stats collector process. merlin
Hmmmmm..... now that's interesting. Stats collector IS accumulating quite a bit of runtime..... me thinks its time to go grab 8.2.4. Karl Denninger (karl@denninger.net) http://www.denninger.net Merlin Moncure wrote: > On 7/25/07, Karl Denninger <karl@denninger.net> wrote: >> >> Yeah, the problem doesn't appear to be there. As I said, if I look >> at the >> PS of the system when its bogging, there aren't a whole bunch of >> processes >> stuck doing these, so while this does take a second or two to come back, >> that's not that bad. >> >> Its GENERAL performance that just bites - the system is obviously >> out of >> CPU, but what I can't get a handle on is WHY. It does not appear to be >> accumulating large amounts of runtime in processes I can catch, but >> the load >> average is quite high. > > 8.2.3 has the 'stats collector bug' (fixed in 8.2.4) which increased > load in high concurrency conditions. on a client's machine after > patching the postmaster load drop from the 4-5 range to 1-2 range on a > 500 tps server. maybe this is biting you? symptoms are high load avg > and high cpu usage of stats collector process. > > merlin > > > %SPAMBLOCK-SYS: Matched [google.com+], message ok %SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok
Karl Denninger <karl@denninger.net> writes: > Hmmmmm..... now that's interesting. Stats collector IS accumulating > quite a bit of runtime..... me thinks its time to go grab 8.2.4. I think Merlin might have nailed it --- the "stats collector bug" is that it tries to write out the stats file way more often than it should. So any excessive userland CPU time you see is just the tip of the iceberg compared to the system and I/O costs incurred. regards, tom lane
Aha! BIG difference. I won't know for sure until the biz day tomorrow but the "first blush" look is that it makes a HUGE difference in system load, and I no longer have the stats collector process on the top of the "top" list...... Karl Denninger (karl@denninger.net) http://www.denninger.net Tom Lane wrote: > Karl Denninger <karl@denninger.net> writes: > >> Hmmmmm..... now that's interesting. Stats collector IS accumulating >> quite a bit of runtime..... me thinks its time to go grab 8.2.4. >> > > I think Merlin might have nailed it --- the "stats collector bug" is > that it tries to write out the stats file way more often than it > should. So any excessive userland CPU time you see is just the tip > of the iceberg compared to the system and I/O costs incurred. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > > %SPAMBLOCK-SYS: Matched [hub.org+], message ok > %SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok
Karl Denninger skrev: > I've got an interesting issue here that I'm running into with 8.2.3 > > This is an application that has run quite well for a long time, and has > been operating without significant changes (other than recompilation) > since back in the early 7.x Postgres days. But now we're seeing a LOT > more load than we used to with it, and suddenly, we're seeing odd > performance issues. > > It APPEARS that the problem isn't query performance per-se. That is, > while I can find a few processes here and there in a run state when I > look with a PS, I don't see them consistently churning. > > But.... here's the query that has a habit of taking the most time.... > > select forum, * from post where toppost = 1 and (replied > (select > lastview from forumlog where login='theuser' and forum=post.forum and > number is null)) is not false AND (replied > (select lastview from > forumlog where login='theuser' and forum=post.forum and > number=post.number)) is not false order by pinned desc, replied desc offset 0 limit 20 Since I can do little to help you with anything else, here is a little help from a guy with a hammer. It seems you may be able to convert the subqueries into a left join. Not sure whether this helps, nor whether I got some bits of the logic wrong, but something like this might help the planner find a better plan: SELECT forum, * FROM post LEFT JOIN forumlog ON post.forum = forumlog.forum AND forumlog.login = 'theuser' AND (post.number = forumlog.number OR forumlog.number IS NULL) AND post.replied <= lastview WHERE forumlog.forum IS NULL AND forum.toppost = 1 ORDER BY pinned DESC, replied DESC OFFSET 0 LIMIT 20 ; Nis
"Karl Denninger" <karl@denninger.net> writes: > Not sure where to start here. It appears that I'm CPU limited and the problem > may be that this is a web-served application that must connect to the Postgres > backend for each transaction, perform its queries, and then close the > connection down - in other words the load may be coming not from Postgres but > rather from places I can't fix at the application layer (e.g. fork() overhead, > etc). The DBMS and Apache server are on the same machine, so there's no actual > network overhead involved. > > If that's the case the only solution is to throw more hardware at it. I can do > that, but before I go tossing more CPU at the problem I'd like to know I'm not > just wasting money. I know you found the proximate cause of your current problems, but it sounds like you have something else you should consider looking at here. There are techniques for avoiding separate database connections for each request. If you're using Apache you can reduce the CPU usage a lot by writing your module as an Apache module instead of a CGI or whatever type of program it is now. Then your module would live as long as a single Apache instance which you can configure to be hours or days instead of a single request. It can keep around the database connection for that time. If that's impossible there are still techniques that can help. You can set up PGPool or PGBouncer or some other connection aggregating tool to handle the connections. This is a pretty low-impact change which shouldn't require making any application changes aside from changing the database connection string. Effectively this is a just a connection pool that lives in a separate process. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Looks like that was the problem - got a day under the belt now with the 8.2.4 rev and all is back to normal. Karl Denninger (karl@denninger.net) http://www.denninger.net Karl Denninger wrote: > Aha! > > BIG difference. I won't know for sure until the biz day tomorrow but > the "first blush" look is that it makes a HUGE difference in system > load, and I no longer have the stats collector process on the top of > the "top" list...... > > Karl Denninger (karl@denninger.net) > http://www.denninger.net > > > > > Tom Lane wrote: >> Karl Denninger <karl@denninger.net> writes: >> >>> Hmmmmm..... now that's interesting. Stats collector IS accumulating >>> quite a bit of runtime..... me thinks its time to go grab 8.2.4. >>> >> >> I think Merlin might have nailed it --- the "stats collector bug" is >> that it tries to write out the stats file way more often than it >> should. So any excessive userland CPU time you see is just the tip >> of the iceberg compared to the system and I/O costs incurred. >> >> regards, tom lane >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly >> >> >> %SPAMBLOCK-SYS: Matched [hub.org+], message ok >> > > > %SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > > > %SPAMBLOCK-SYS: Matched [hub.org+], message ok %SPAMBLOCK-SYS: Matched [@postgresql.org+], message ok