Обсуждение: Unpredictable shark slowdown after migrating to 8.4
Hello community, Second time after migration 8.3.7 --> 8.4.1 I was caught by this problem. Migration was 8 days ago. (note, I never seen such situation on 8.3) Environment: PostgreSQL 8.4.1 + patch http://archives.postgresql.org/pgsql-committers/2009-10/msg00056.php CentOS release 5.4 (Final) SunFire X4600 M2; 3U; 8xOpteron 8380 2.5GHz; 96GB; 12x146GB 15K RPM DB size ~90G TPS~1000 Symptoms: In short period of time total query execution time increases incredibly. Time Sum duration (ms) 17:17 12811 17:18 8870 17:19 33744 17:20 9991 17:21 13392 17:22 163928 17:23 1151709 17:24 12112797 <-- here it cuts due to connection threshold 17:25 12305390 17:26 12970853 17:27 12957648 LA changes rather insignificantly from ~6 to ~8. CPU user time increases from ~350 to ~600 ), system from ~50 to ~250. Neither additional significant disc activity nor iowait. Another thing I noticed is autoanalyze finish on tables that few minutes later were used in the first and mostly canceled by timeout queries. First time I assigned the blame to multiply locks on one of the mentioned above tables. There was heavy delete. I started delete rows manually by small batches and found a record that hung my delete for a long time (many times greater then stmt timeout) even when I tried to delete it by PK. Situation was saved by disabling some functional that uses this table until next day when I got rid of the aggressive deletes. Second time I didn't find any reason that explains the situation. Was this situation mentioned before and is there a solution or workaround? (I didn't find any) If not please give me a glue where to dig or what information should I provide? Thank you. -- Regards, Sergey Konoplev -- PostgreSQL articles in english & russian http://gray-hemp.blogspot.com/search/label/postgresql/
On Wed, Nov 11, 2009 at 12:50 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: > Was this situation mentioned before and is there a solution or > workaround? (I didn't find any) If not please give me a glue where to > dig or what information should I provide? I think you should use log_min_duration_statement or auto_explain to figure out which queries are giving you grief. I don't believe that 8.4 is in general slower than 8.3, so there must be something about how you are using it that is making it slower for you. But without more information it's hard to speculate what it might be. It's also not entirely clear that your 8.4 setup is really the same as your 8.3 setup. You might have different configuration, differences in your tables or table data, differences in workload, etc. Without controlling for all those factors it's hard to draw any conclusions. Also, I don't believe this is an appropriate topic for pgsql-hackers. If you have EXPLAIN ANALYZE results for the affected queries, try pgsql-performance. ...Robert
On Thu, Nov 12, 2009 at 4:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 11, 2009 at 12:50 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: >> Was this situation mentioned before and is there a solution or >> workaround? (I didn't find any) If not please give me a glue where to >> dig or what information should I provide? > > I think you should use log_min_duration_statement or auto_explain to > figure out which queries are giving you grief. I don't believe that > 8.4 is in general slower than 8.3, so there must be something about > how you are using it that is making it slower for you. But without > more information it's hard to speculate what it might be. It's also > not entirely clear that your 8.4 setup is really the same as your 8.3 > setup. You might have different configuration, differences in your > tables or table data, differences in workload, etc. Without > controlling for all those factors it's hard to draw any conclusions. Well I turned on log_min_duration_statement, set up auto_explain, pg_stat_statements, pgrowlocks, pg_buffercache, wrote some monitoring queries and started to wait when the situation repeats. Today it happens! Situation was absolutely the same as I described in my previous letter. One more thing I noticed about CPU user-time this time is that after connections count gets close to pgbouncer threshold it decreased from ~800 to ~10 very fast. Log monitoring shows me that query plans were the same as usual (thanx auto_explain). I reset pg_stat_statements and few minutes later did select from it. I noticed that slowest queries was on tables with high number of updates (but isn't it usual?). I tried to get locks with this queries SELECT t.tablename, (SELECT count(*) FROM pgrowlocks(schemaname || '.' || t.tablename)) AS locks FROM pg_tables t WHERE schemaname = 'public' ORDER BY 2 DESC LIMIT 10; SELECT * FROM pgrowlocks('public.person_online'); but nothing was returned. Here is portions of vmstat and iostat results http://pastie.org/701326 This time situation was saved by PG restart to. Obviously all I provided tells almost nothing and I'm very confused with it. So please tell me what else could I do to get more info? > Also, I don't believe this is an appropriate topic for pgsql-hackers. > If you have EXPLAIN ANALYZE results for the affected queries, try > pgsql-performance. I do have but this results are good and the same as when nothing has happened when everything is allright. -- Regards, Sergey Konoplev
Sergey Konoplev escribió: > I tried to get locks with this queries Did you try pg_locks? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, Nov 16, 2009 at 1:53 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: > On Thu, Nov 12, 2009 at 4:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Nov 11, 2009 at 12:50 PM, Sergey Konoplev <gray.ru@gmail.com> wrote: >>> Was this situation mentioned before and is there a solution or >>> workaround? (I didn't find any) If not please give me a glue where to >>> dig or what information should I provide? >> >> I think you should use log_min_duration_statement or auto_explain to >> figure out which queries are giving you grief. I don't believe that >> 8.4 is in general slower than 8.3, so there must be something about >> how you are using it that is making it slower for you. But without >> more information it's hard to speculate what it might be. It's also >> not entirely clear that your 8.4 setup is really the same as your 8.3 >> setup. You might have different configuration, differences in your >> tables or table data, differences in workload, etc. Without >> controlling for all those factors it's hard to draw any conclusions. > > Well I turned on log_min_duration_statement, set up auto_explain, > pg_stat_statements, pgrowlocks, pg_buffercache, wrote some monitoring > queries and started to wait when the situation repeats. > > Today it happens! Situation was absolutely the same as I described in > my previous letter. One more thing I noticed about CPU user-time this > time is that after connections count gets close to pgbouncer threshold > it decreased from ~800 to ~10 very fast. > > Log monitoring shows me that query plans were the same as usual (thanx > auto_explain). > > I reset pg_stat_statements and few minutes later did select from it. I > noticed that slowest queries was on tables with high number of updates > (but isn't it usual?). > > I tried to get locks with this queries > > SELECT > t.tablename, > (SELECT count(*) FROM pgrowlocks(schemaname || '.' || t.tablename)) AS locks > FROM pg_tables t > WHERE schemaname = 'public' > ORDER BY 2 DESC > LIMIT 10; > > SELECT * FROM pgrowlocks('public.person_online'); > > but nothing was returned. > > Here is portions of vmstat and iostat results http://pastie.org/701326 > > This time situation was saved by PG restart to. Obviously all I > provided tells almost nothing and I'm very confused with it. So please > tell me what else could I do to get more info? > >> Also, I don't believe this is an appropriate topic for pgsql-hackers. >> If you have EXPLAIN ANALYZE results for the affected queries, try >> pgsql-performance. > > I do have but this results are good and the same as when nothing has > happened when everything is allright. Can you show us the non-commented settings from your postgresql.conf? Can you show us what the vmstat output looks like when everything is running normally? It looks like the blocks out are pretty high, but I don't know how that compares to normal for you. ...Robert
On Wednesday 11 November 2009 18:50:46 Sergey Konoplev wrote: > Hello community, > > > Second time after migration 8.3.7 --> 8.4.1 I was caught by this > problem. Migration was 8 days ago. > (note, I never seen such situation on 8.3) Is 8.4 configured similarly to 8.3? Andres
On Mon, Nov 16, 2009 at 9:56 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Sergey Konoplev escribió: > >> I tried to get locks with this queries > > Did you try pg_locks? > I tried monitor locks with pgrowlocks. Isn't it better way? If it isn't what points should I pay attention with pg_lock? I've just write the query SELECT pid, count(1) AS locks, current_query FROM pg_locks AS l LEFT JOIN pg_stat_activity AS a ON pid = procpid WHERE l.mode IN ('RowExclusiveLock', 'ShareUpdateExclusiveLock', 'ExclusiveLock') GROUP BY 1,3 ORDER BY 2 DESC LIMIT 10; would it be what we need? -- Regards, Sergey Konoplev
> > Can you show us the non-commented settings from your postgresql.conf? Working postgresql.conf http://pastie.org/702748 > > Can you show us what the vmstat output looks like when everything is > running normally? It looks like the blocks out are pretty high, but I > don't know how that compares to normal for you. > Here it is http://pastie.org/702742 -- Regards, Sergey Konoplev
On Mon, Nov 16, 2009 at 10:17 PM, Andres Freund <andres@anarazel.de> wrote: > On Wednesday 11 November 2009 18:50:46 Sergey Konoplev wrote: >> Hello community, >> >> >> Second time after migration 8.3.7 --> 8.4.1 I was caught by this >> problem. Migration was 8 days ago. >> (note, I never seen such situation on 8.3) > Is 8.4 configured similarly to 8.3? > It is. 8.3 conf - http://pastie.org/702752 8.4 conf - http://pastie.org/702748 -- Regards, Sergey Konoplev -- PostgreSQL articles in english & russian http://gray-hemp.blogspot.com/search/label/postgresql/
Sergey Konoplev escribió: > On Mon, Nov 16, 2009 at 9:56 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Sergey Konoplev escribió: > > > >> I tried to get locks with this queries > > > > Did you try pg_locks? > > > > I tried monitor locks with pgrowlocks. Isn't it better way? If it > isn't what points should I pay attention with pg_lock? pgrowlocks shows row locks only. pg_locks shows all regular locks, i.e. locks on tables, pages, transactions, etc. You should be concerned with pgrowlocks only if you see a transaction waiting for locktype=transaction. > I've just write the query > > SELECT pid, count(1) AS locks, current_query > FROM pg_locks AS l LEFT JOIN pg_stat_activity AS a ON pid = procpid > WHERE l.mode IN ('RowExclusiveLock', 'ShareUpdateExclusiveLock', > 'ExclusiveLock') > GROUP BY 1,3 ORDER BY 2 DESC LIMIT 10; Why only those modes? I'd search for locks with granted=false, then see all the other locks held by the process that's holding the conflicting lock with granted=true (i.e. the one you're waiting on). -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Thank you for the hints. > Why only those modes? I'd search for locks with granted=false, then see > all the other locks held by the process that's holding the conflicting > lock with granted=true (i.e. the one you're waiting on). Something like this? SELECT granted, pid, virtualxid, transactionid, virtualtransaction, count(1) AS locks, current_query FROM pg_locks AS l LEFT JOIN pg_stat_activity AS a ON pid = procpid GROUP BY 1, 2, 3, 4, 5, 7 ORDER BY 1, 6 DESC; And two more queries to do extended analysis of its results after restarting PG: SELECT pg_stat_activity.datname, pg_class.relname, pg_locks.transactionid, pg_locks.mode, pg_locks.granted, pg_stat_activity.usename, pg_stat_activity.current_query, pg_stat_activity.query_start, age(now(),pg_stat_activity.query_start)AS "age", pg_stat_activity.procpid FROM pg_stat_activity, pg_locks LEFT OUTER JOIN pg_class ON pg_locks.relation = pg_class.oid WHERE pg_locks.pid = pg_stat_activity.procpid ORDER BY query_start; SELECT * FROM pg_locks; Are there another things I should do when the problem rise up again? -- Regards, Sergey Konoplev