Обсуждение: swarm of processes in BIND state?
hi, we have following situation: pg 9.3.11 on ubuntu. we have master and slave. the db is large-ish, but we're removing *most* of its data from all across the tables, and lots of tables too. while we're doing it, sometimes, we get LOTS of processes, but only on slave, never on master, that spend long time in BIND state. same queries ran on master run without any problem, and are fast. any clues on where to start diagnosing it? So far we've: 1. ruled out IO problems (enough io both in terms of bandwidth and iops) 2. ruled out memory (plenty) 3. vacuumed all tables, including system ones, with analyze 4. did pg_repack on most of the tables to remove bloat. we are in process of doing the same operation to series of servers, so i'm not sure if the criteria on all of them are the same, but perhaps you can think of something we didn't? depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
Hi
2016-05-28 7:19 GMT+02:00 hubert depesz lubaczewski <depesz@depesz.com>:
hi,
we have following situation:
pg 9.3.11 on ubuntu.
we have master and slave.
the db is large-ish, but we're removing *most* of its data from all
across the tables, and lots of tables too.
while we're doing it, sometimes, we get LOTS of processes, but only on
slave, never on master, that spend long time in BIND state. same queries
ran on master run without any problem, and are fast.
any clues on where to start diagnosing it?
So far we've:
1. ruled out IO problems (enough io both in terms of bandwidth and iops)
2. ruled out memory (plenty)
3. vacuumed all tables, including system ones, with analyze
4. did pg_repack on most of the tables to remove bloat.
we are in process of doing the same operation to series of servers, so
i'm not sure if the criteria on all of them are the same, but perhaps
you can think of something we didn't?
It is looking like spinlock issue.
try to look there by "perf top"
Pavel
depesz
--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sat, May 28, 2016 at 07:25:18AM +0200, Pavel Stehule wrote: > It is looking like spinlock issue. > try to look there by "perf top" First results look like: Samples: 64K of event 'cpu-clock', Event count (approx.): 2394094576 [0/65] 32.08% [unknown] [.] 0x00007f27ed20a296 6.31% [kernel] [k] finish_task_switch 5.72% [kernel] [k] _raw_spin_unlock_irqrestore 3.46% postgres [.] 0x00000000002764ec 3.38% libruby-2.1.so.2.1.0 [.] 0x000000000008a8d0 2.76% [kernel] [k] __do_page_fault 2.08% postgres [.] hash_search_with_hash_value 1.94% libc-2.19.so [.] 0x0000000000160b96 1.79% postgres [.] LWLockAcquire 1.50% postgres [.] heap_page_prune_opt 1.28% postgres [.] LWLockRelease 1.09% postgres [.] heap_hot_search_buffer 1.00% [kernel] [k] tick_nohz_idle_enter I guess I have to use pg with debugging flags? Best regards, depesz
2016-05-28 7:45 GMT+02:00 hubert depesz lubaczewski <depesz@depesz.com>:
On Sat, May 28, 2016 at 07:25:18AM +0200, Pavel Stehule wrote:
> It is looking like spinlock issue.
> try to look there by "perf top"
First results look like:
Samples: 64K of event 'cpu-clock', Event count (approx.): 2394094576 [0/65]
32.08% [unknown] [.] 0x00007f27ed20a296
6.31% [kernel] [k] finish_task_switch
5.72% [kernel] [k] _raw_spin_unlock_irqrestore
3.46% postgres [.] 0x00000000002764ec
3.38% libruby-2.1.so.2.1.0 [.] 0x000000000008a8d0
2.76% [kernel] [k] __do_page_fault
2.08% postgres [.] hash_search_with_hash_value
1.94% libc-2.19.so [.] 0x0000000000160b96
1.79% postgres [.] LWLockAcquire
1.50% postgres [.] heap_page_prune_opt
1.28% postgres [.] LWLockRelease
1.09% postgres [.] heap_hot_search_buffer
1.00% [kernel] [k] tick_nohz_idle_enter
I guess I have to use pg with debugging flags?
you should to install debug info - or compile with dubug symbols
Pavel
Best regards,
depesz
On Sat, May 28, 2016 at 07:46:52AM +0200, Pavel Stehule wrote: > you should to install debug info - or compile with dubug symbols Installed debug info, and the problem stopped. Don't think it's related - it could be just timing. I'll report back if/when the problem will re-appear. Best regards, depesz
On Fri, May 27, 2016 at 10:19 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote: > hi, > we have following situation: > pg 9.3.11 on ubuntu. > we have master and slave. > the db is large-ish, but we're removing *most* of its data from all > across the tables, and lots of tables too. > > while we're doing it, sometimes, we get LOTS of processes, but only on > slave, never on master, that spend long time in BIND state. same queries > ran on master run without any problem, and are fast. This sounds like a known issue, except the fix for that issue was already incorporated into 9.3.11: commit 4162a55c77cbb54acb4ac442ef3565b813b9d07a Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Tue Feb 25 16:04:09 2014 -0500 Use SnapshotDirty rather than an active snapshot to probe index endpoints. But it might be some variant not covered by that fix. The essence is that if the planner is thinking about doing a merge join (even if it never actually uses one) with the merge key being an indexed column, it will consult the index to find the current min and max values. If the min and max portions of the index are full of dead or uncommitted tuples, it does a lot of work digging through the index looking for a non-dead one. > any clues on where to start diagnosing it? I'd start by using strace (with -y -ttt -T) on one of the processes and see what it is doing. A lot of IO, and one what file? A lot of semop's? If that wasn't informative, I'd attach to one of the processes with the gdb debugger and get a backtrace. (You might want to do that a few times, just in case the first one accidentally caught the code during a part of its execution which was not in the bottlenecked spot.) > So far we've: > 1. ruled out IO problems (enough io both in terms of bandwidth and iops) Are you saying that you are empirically not actually doing any IO waits, or just that the IO capacity is theoretically sufficient? Thanks, Jeff
On Sat, May 28, 2016 at 08:04:43AM +0200, Pavel Stehule wrote: > > > you should to install debug info - or compile with dubug symbols > > Installed debug info, and the problem stopped. OK. ot he problem back. Ps looked like this: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND postgres 9762 0.0 1.3 6816588 424512 ? S 05:49 0:09 /usr/lib/postgresql/9.3/bin/postgres -D /var/lib/postgresql/9.3/main-c config_file=/etc/postgresql/9.3/main/postgresql.conf postgres 9763 1.6 20.3 6819980 6284520 ? Ss 05:49 11:49 \_ postgres: startup process recovering 0000000400007D61000000D2 postgres 10300 0.1 20.0 6827396 6190236 ? Ss 05:49 1:22 \_ postgres: checkpointer process postgres 10301 0.1 19.9 6819736 6145144 ? Ss 05:49 0:55 \_ postgres: writer process postgres 11111 0.0 0.0 104704 2076 ? Ss 05:49 0:11 \_ postgres: stats collector process postgres 11148 0.7 0.0 6832848 4840 ? Ds 05:49 5:42 \_ postgres: wal receiver process streaming 7D61/D2CE4D90 postgres 43248 0.0 0.2 6823908 79072 ? Ss 06:41 0:00 \_ postgres: monitor dbname [local] idle postgres 14802 4.0 18.5 6898688 5723652 ? Ss 09:57 19:54 \_ postgres: app_user dbname 10.1.15.177(44508) BIND postgres 14804 2.9 16.3 6870980 5061020 ? Ss 09:57 14:25 \_ postgres: app_user dbname 10.1.15.177(44510) BIND postgres 121044 6.2 16.8 6871756 5189808 ? Ss 13:19 17:52 \_ postgres: app_user dbname 10.1.14.77(58756) BIND postgres 130506 5.0 15.6 6871536 4844704 ? Ss 13:38 13:22 \_ postgres: app_user dbname 10.1.15.177(47734) BIND postgres 3767 2.8 13.6 6870876 4226928 ? Ss 13:45 7:20 \_ postgres: app_user dbname 10.1.14.77(59142) BIND postgres 3768 3.7 16.0 6877232 4943916 ? Ss 13:45 9:35 \_ postgres: app_user dbname 10.1.14.77(59143) BIND postgres 7168 0.0 0.0 6821248 8260 ? Ss 13:52 0:00 \_ postgres: pgbouncer dbname 10.1.16.158(50990) idle postgres 7170 0.0 2.0 6863740 637368 ? Ss 13:52 0:01 \_ postgres: app_user dbname 10.1.16.158(50991) idle postgres 23083 5.9 14.7 6872584 4552872 ? Ds 14:22 13:16 \_ postgres: app_user dbname 10.1.15.177(48363) BIND postgres 70043 0.0 0.0 6821224 7548 ? Ss 15:52 0:00 \_ postgres: pgbouncer dbname 10.1.14.77(60980) idle postgres 70719 0.0 0.0 6821248 7448 ? Ss 15:53 0:00 \_ postgres: pgbouncer dbname 10.1.15.177(49701) idle postgres 71257 5.8 14.5 6870940 4479808 ? Ss 15:54 7:33 \_ postgres: app_user dbname 10.1.14.77(32783) BIND postgres 77780 5.9 13.3 6871008 4119172 ? Ss 16:06 6:58 \_ postgres: app_user dbname 10.1.14.77(32951) BIND postgres 84682 9.2 15.5 6877216 4806100 ? Ss 16:19 9:40 \_ postgres: app_user dbname 10.1.15.177(50074) BIND postgres 100924 7.1 13.5 6870804 4176864 ? Ss 16:51 5:11 \_ postgres: app_user dbname 10.1.15.177(50537) BIND postgres 118767 7.7 13.7 6870736 4251048 ? Ss 17:24 3:04 \_ postgres: app_user dbname 10.1.14.77(34090) BIND postgres 118897 6.2 13.6 6870588 4219824 ? Ss 17:24 2:27 \_ postgres: app_user dbname 10.1.14.77(34094) BIND postgres 118916 8.6 12.4 6869524 3852648 ? Ss 17:25 3:22 \_ postgres: app_user dbname 10.1.15.177(51026) BIND postgres 118978 3.3 13.0 6868456 4033924 ? Ss 17:25 1:18 \_ postgres: app_user dbname 10.1.14.77(34100) BIND postgres 119022 2.3 13.4 6829152 4148088 ? Ss 17:25 0:53 \_ postgres: app_user dbname 10.1.14.77(34101) BIND postgres 119060 3.3 13.8 6869172 4265976 ? Ss 17:25 1:18 \_ postgres: app_user dbname 10.1.15.177(51034) BIND postgres 119096 1.9 12.1 6828716 3754204 ? Ss 17:25 0:46 \_ postgres: app_user dbname 10.1.15.177(51037) BIND postgres 119097 6.4 13.8 6870868 4263168 ? Ss 17:25 2:29 \_ postgres: app_user dbname 10.1.15.177(51038) BIND postgres 119111 1.9 12.1 6826656 3763260 ? Ss 17:25 0:44 \_ postgres: app_user dbname 10.1.14.77(34105) BIND postgres 119152 3.5 12.7 6869468 3921916 ? Ss 17:25 1:22 \_ postgres: app_user dbname 10.1.14.77(34107) BIND postgres 119266 7.3 13.5 6868908 4193496 ? Ss 17:25 2:49 \_ postgres: app_user dbname 10.1.15.177(51041) BIND postgres 119298 5.4 13.3 6870552 4107692 ? Ss 17:25 2:05 \_ postgres: app_user dbname 10.1.14.77(34110) BIND postgres 119303 6.2 14.7 6870816 4553052 ? Ss 17:25 2:24 \_ postgres: app_user dbname 10.1.15.177(51043) BIND postgres 119314 4.9 13.0 6869704 4024040 ? Ss 17:25 1:54 \_ postgres: app_user dbname 10.1.14.77(34113) BIND postgres 119315 2.0 12.5 6869208 3881956 ? Ss 17:25 0:47 \_ postgres: app_user dbname 10.1.15.177(51045) BIND postgres 119316 7.0 13.3 6870908 4136492 ? Ss 17:25 2:41 \_ postgres: app_user dbname 10.1.15.177(51046) BIND postgres 119317 3.2 13.0 6870100 4018860 ? Ss 17:25 1:14 \_ postgres: app_user dbname 10.1.14.77(34114) BIND postgres 119318 8.2 14.4 6870868 4457528 ? Ss 17:25 3:09 \_ postgres: app_user dbname 10.1.14.77(34115) BIND postgres 119319 2.3 12.0 6827084 3724248 ? Ss 17:25 0:54 \_ postgres: app_user dbname 10.1.15.177(51047) BIND postgres 119320 5.0 13.0 6870208 4037396 ? Ss 17:25 1:56 \_ postgres: app_user dbname 10.1.15.177(51048) BIND postgres 119368 4.5 13.2 6868480 4085724 ? Ss 17:25 1:44 \_ postgres: app_user dbname 10.1.15.177(51050) BIND postgres 119417 4.9 12.8 6869740 3981108 ? Ss 17:25 1:52 \_ postgres: app_user dbname 10.1.14.77(34117) BIND postgres 119418 3.4 13.0 6869204 4027344 ? Ss 17:25 1:18 \_ postgres: app_user dbname 10.1.14.77(34118) BIND postgres 119441 2.4 12.3 6828944 3826244 ? Ss 17:25 0:57 \_ postgres: app_user dbname 10.1.15.177(51051) BIND postgres 119442 3.4 12.6 6869248 3896108 ? Ss 17:26 1:18 \_ postgres: app_user dbname 10.1.15.177(51052) BIND postgres 119550 10.7 14.4 6870876 4448704 ? Ss 17:26 4:04 \_ postgres: app_user dbname 10.1.15.177(51062) BIND postgres 119560 7.0 14.2 6870744 4402940 ? Ss 17:26 2:40 \_ postgres: app_user dbname 10.1.14.77(34125) BIND postgres 119621 3.3 13.1 6870516 4052152 ? Ss 17:26 1:16 \_ postgres: app_user dbname 10.1.14.77(34128) BIND postgres 119631 1.9 12.4 6827044 3858696 ? Ss 17:26 0:43 \_ postgres: app_user dbname 10.1.14.77(34129) BIND postgres 130410 0.0 1.5 6824028 465040 ? Ss 17:47 0:00 \_ postgres: app_user dbname 10.1.15.177(51372) idle I ran perf top and got: Samples: 620K of event 'cpu-clock', Event count (approx.): 30018858044 [0/32] 23.77% postgres [.] s_lock 13.58% postgres [.] 0x000000000027668c 8.00% postgres [.] LWLockAcquire 6.45% [kernel] [k] _raw_spin_unlock_irqrestore 6.13% postgres [.] hash_search_with_hash_value 4.97% postgres [.] LWLockRelease 4.12% [kernel] [k] finish_task_switch 3.95% postgres [.] heap_page_prune_opt 1.33% postgres [.] RecoveryInProgress 1.19% [kernel] [k] copy_user_enhanced_fast_string 1.09% postgres [.] _bt_checkkeys 1.05% postgres [.] heap_hot_search_buffer 1.02% libc-2.19.so [.] 0x0000000000160b96 0.99% postgres [.] hash_any ... Does that help us in any way? Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > Does that help us in any way? Not terribly. That confirms that the processes are contending for a spinlock, but we can't tell which one. Can you collect a few stack traces from those processes? regards, tom lane
On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote: > > any clues on where to start diagnosing it? > > I'd start by using strace (with -y -ttt -T) on one of the processes > and see what it is doing. A lot of IO, and one what file? A lot of > semop's? So, I did: sudo strace -o bad.log -y -ttt -T -p $( ps uwwf -u postgres | grep BIND | awk '{print $2}' | head -n1 ) and killed it after 10 seconds, more or less. Results: $ wc -l bad.log 6075 bad.log $ grep -c semop bad.log 6018 The rest were reads, seeks, and single open to these files: $ grep -v semop bad.log | grep -oE '/16421/[0-9.]*' | sort | uniq -c 2 /16421/3062403236.20 2 /16421/3062403236.8 25 /16421/3222944583.49 28 /16421/3251043620.60 Which are: $ select oid::regclass from pg_class where relfilenode in (3062403236, 3222944583, 3251043620); oid ---------------------------------- app_schema.s_table app_schema.v_table app_schema.m_table (3 rows) which are 3 largest tables there are. But, logs dont show any queries that would touch all 3 of them. > If that wasn't informative, I'd attach to one of the processes with > the gdb debugger and get a backtrace. (You might want to do that a > few times, just in case the first one accidentally caught the code > during a part of its execution which was not in the bottlenecked > spot.) I did: for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done Since there is lots of output, I made a tarball with it, and put it on https://depesz.com/various/all.bt.logs.tar.gz The file is ~ 19kB. > > So far we've: > > 1. ruled out IO problems (enough io both in terms of bandwidth and iops) > > Are you saying that you are empirically not actually doing any IO > waits, or just that the IO capacity is theoretically sufficient? there are no iowaits per what iostat returns. Or, there are but very low. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Sat, May 28, 2016 at 02:15:07PM -0400, Tom Lane wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > Does that help us in any way? > > Not terribly. That confirms that the processes are contending for a > spinlock, but we can't tell which one. Can you collect a few stack traces > from those processes? Sure, tarballed them into https://depesz.com/various/all.bt.logs.tar.gz. Best regards, depesz
On Sat, May 28, 2016 at 11:32 AM, hubert depesz lubaczewski <depesz@depesz.com> wrote: > On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote: >> If that wasn't informative, I'd attach to one of the processes with >> the gdb debugger and get a backtrace. (You might want to do that a >> few times, just in case the first one accidentally caught the code >> during a part of its execution which was not in the bottlenecked >> spot.) > > I did: > for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done > > Since there is lots of output, I made a tarball with it, and put it on > https://depesz.com/various/all.bt.logs.tar.gz > > The file is ~ 19kB. If you look at the big picture, it is what I thought: the planner probing the index end points when planning a merge join. Although I don't know how that turns into the low-level specifics you are seeing. It looks like everyone becomes interested in the same disk page at the same time. One process starts reading in that page, and all the others queue up on a lock waiting for that one it to finish. So what you see is 1 disk wait and N-1 semop waits. But, if the the page is that popular, why is it not staying in cache? Either which page is popular is moving around quickly (which is hard to see how that would be plausible if ti represents the index end-points) or there are so many simultaneously popular pages that they can't all fit in cache. So my theory is that you deleted a huge number of entries off from either end of the index, that transaction committed, and that commit became visible to all. Planning a mergejoin needs to dig through all those tuples to probe the true end-point. On master, the index entries quickly get marked as LP_DEAD so future probes don't have to do all that work, but on the replicas those index hint bits are, for some unknown to me reason, not getting set. So it has to scour the all the heap pages which might have the smallest/largest tuple, on every planning cycle, and that list of pages is very large leading to occasional IO stalls. Or perhaps the master realizes the deleting transaction is committed-visible-to-all, but the replicas believe there are still some transactions which could care about them, and that is the reason they are not getting hinted? >> > So far we've: >> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops) >> >> Are you saying that you are empirically not actually doing any IO >> waits, or just that the IO capacity is theoretically sufficient? > > there are no iowaits per what iostat returns. Or, there are but very low. If each IO wait has a pile-up of processes waiting behind it on semops, then it could have a much larger effect than the raw numbers would indicate. Cheers, Jeff
On Mon, May 30, 2016 at 11:05:17AM -0700, Jeff Janes wrote: > So my theory is that you deleted a huge number of entries off from > either end of the index, that transaction committed, and that commit > became visible to all. Planning a mergejoin needs to dig through all > those tuples to probe the true end-point. On master, the index > entries quickly get marked as LP_DEAD so future probes don't have to > do all that work, but on the replicas those index hint bits are, for > some unknown to me reason, not getting set. So it has to scour the > all the heap pages which might have the smallest/largest tuple, on > every planning cycle, and that list of pages is very large leading to > occasional IO stalls. This I get, but why was the same backend reading data for all 3 largest tables, while I know for sure (well, 99.9% sure) that no query touches all of them? depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/