Обсуждение: Repeated requests for feedback in logical standby
I've been running some tests, and come across a weird behavior on shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot of extra requests for feedback are sent from the walsender. On my laptop I get about 3-15 requests (different each time), and on a server with a bit more power but still very low activity, the number is simply huge. Basically what happens is: * Replication works fine * I have *zero* activity on the master since it got started, not even a single psql * I issue a fast shutdown on the master * Nothing happens at all for a while * Once pg_recvlogical times out and sends a status message things wake up * The server sends a whole bunch of requests for status, which pg_recvlogical responds to * Eventually things shut down (on the proper server, I've had to kill things because it takes too long - I'm unsure if it ever shuts down) Log output from pg_recvlogical: $ ./pg_recvlogical -v -p 5495 -S trivialslot -o include_transaction=on -v -d postgres --start -f - -P test_decoding 2>&1 | ts '[%Y-%m-%d %H:%M:%S]' [2016-04-12 17:14:31] pg_recvlogical: starting log streaming at 0/0 (slot trivialslot) [2016-04-12 17:14:31] pg_recvlogical: streaming initiated [2016-04-12 17:14:31] pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot trivialslot) [2016-04-12 17:14:41] pg_recvlogical: confirming write up to 0/18F9DE88, flush to 0/18F9DE88 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DEF8, flush to 0/18F9DEF8 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0, flush to 0/18F9DFA0 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0, flush to 0/18F9DFA0 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0, flush to 0/18F9DFA0 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0, flush to 0/18F9DFA0 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0, flush to 0/18F9DFA0 (slot trivialslot) [2016-04-12 17:14:51] pg_recvlogical: disconnected; waiting 5 seconds to try again In this example, the server has: 2016-04-12 17:14:45 CEST LOG: received fast shutdown request 2016-04-12 17:14:45 CEST LOG: aborting any active transactions 2016-04-12 17:14:45 CEST LOG: autovacuum launcher shutting down 2016-04-12 17:14:45 CEST LOG: shutting down 2016-04-12 17:14:45 CEST LOG: database system is shut down Note how it took a while (6 seconds) for pg_recvlogical to pick up that anything was happening - looks like something might not have been flushed? When running on the server where there's more data it was very clear though - if I break the walsender with gdb, pg_recvlogical stops spitting out debugging messages, and as soon as I "cont" in gdb it starts spitting them out again, so there's a clear connection between the two. And it takes a *lot* more than a few seconds. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On 12 April 2016 at 16:18, Magnus Hagander <magnus@hagander.net> wrote: > > I've been running some tests, and come across a weird behavior on shutdown with logical replication, 9.5.2. It seems thaton shutdown, a lot of extra requests for feedback are sent from the walsender. > > On my laptop I get about 3-15 requests (different each time), and on a server with a bit more power but still very lowactivity, the number is simply huge. > > Basically what happens is: > > * Replication works fine > * I have *zero* activity on the master since it got started, not even a single psql > * I issue a fast shutdown on the master > * Nothing happens at all for a while > * Once pg_recvlogical times out and sends a status message things wake up > * The server sends a whole bunch of requests for status, which pg_recvlogical responds to > * Eventually things shut down (on the proper server, I've had to kill things because it takes too long - I'm unsure ifit ever shuts down) That sounds similar to something I've seen with wal-based replication: http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com Nick
On Thu, Apr 14, 2016 at 7:34 AM, Nick Cleaton <nick@cleaton.net> wrote: > On 12 April 2016 at 16:18, Magnus Hagander <magnus@hagander.net> wrote: > > > > I've been running some tests, and come across a weird behavior on > shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot > of extra requests for feedback are sent from the walsender. > > > > On my laptop I get about 3-15 requests (different each time), and on a > server with a bit more power but still very low activity, the number is > simply huge. > > > > Basically what happens is: > > > > * Replication works fine > > * I have *zero* activity on the master since it got started, not even a > single psql > > * I issue a fast shutdown on the master > > * Nothing happens at all for a while > > * Once pg_recvlogical times out and sends a status message things wake up > > * The server sends a whole bunch of requests for status, which > pg_recvlogical responds to > > * Eventually things shut down (on the proper server, I've had to kill > things because it takes too long - I'm unsure if it ever shuts down) > > > That sounds similar to something I've seen with wal-based replication: > > http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com Agreed, that sounds very similar. I don't know the code enough to say whether that's a proper fix though, or if it might have other sideeffects. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thu, Apr 14, 2016 at 12:30 PM, Magnus Hagander <magnus@hagander.net> wrote: > > > On Thu, Apr 14, 2016 at 7:34 AM, Nick Cleaton <nick@cleaton.net> wrote: > >> On 12 April 2016 at 16:18, Magnus Hagander <magnus@hagander.net> wrote: >> > >> > I've been running some tests, and come across a weird behavior on >> shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot >> of extra requests for feedback are sent from the walsender. >> > >> > On my laptop I get about 3-15 requests (different each time), and on a >> server with a bit more power but still very low activity, the number is >> simply huge. >> > >> > Basically what happens is: >> > >> > * Replication works fine >> > * I have *zero* activity on the master since it got started, not even a >> single psql >> > * I issue a fast shutdown on the master >> > * Nothing happens at all for a while >> > * Once pg_recvlogical times out and sends a status message things wake >> up >> > * The server sends a whole bunch of requests for status, which >> pg_recvlogical responds to >> > * Eventually things shut down (on the proper server, I've had to kill >> things because it takes too long - I'm unsure if it ever shuts down) >> >> >> That sounds similar to something I've seen with wal-based replication: >> >> http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com > > > Agreed, that sounds very similar. > > I don't know the code enough to say whether that's a proper fix though, or > if it might have other sideeffects. > > I should add that yes, this patch does seem to remove the problem for me as well. But I can't confirm if it breaks something else. Hopefully someone who has worked more on that code can? I also still see the problem that "pg_ctrl stop -m fast" on the master does *nothing* until the client sends one message. I tried increasing the feedback interval to 2 minutes, and that increased the shutdown time to 2 minutes as well. Which can't be right - clearly the shutdown is failing to actually make the walsender send the request and shut down. This happens both with and without the suggested patch. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On 14/04/16 12:51, Magnus Hagander wrote: > I should add that yes, this patch does seem to remove the problem for me as > well. But I can't confirm if it breaks something else. Did you also see whether it fixes the problem on the bigger box which seemed to not shut down at all? .m
On Thu, Apr 14, 2016 at 1:14 PM, Marko Tiikkaja <marko@joh.to> wrote: > On 14/04/16 12:51, Magnus Hagander wrote: > >> I should add that yes, this patch does seem to remove the problem for me >> as >> well. But I can't confirm if it breaks something else. >> > > Did you also see whether it fixes the problem on the bigger box which > seemed to not shut down at all? Not yet, but those symptoms sound even more like those of Nick. I have it fixing two different installations that I have locally here though. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On 14 April 2016 at 11:30, Magnus Hagander <magnus@hagander.net> wrote: >> http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com > > > Agreed, that sounds very similar. > > I don't know the code enough to say whether that's a proper fix though, or > if it might have other sideeffects. Certainly. My "this fixes it for me" was intended more as a piece of diagnostic information than as a proposed change to the code.
On Thu, Apr 14, 2016 at 8:54 PM, Nick Cleaton <nick@cleaton.net> wrote: > Certainly. My "this fixes it for me" was intended more as a piece of > diagnostic information than as a proposed change to the code. As I am just bumping on it. For the archive's sake, a fix has been pushed as 5945501 (commit log references this thread as well). -- Michael
On Thu, Apr 14, 2016 at 1:21 PM, Magnus Hagander <magnus@hagander.net> wrote: > > > On Thu, Apr 14, 2016 at 1:14 PM, Marko Tiikkaja <marko@joh.to> wrote: > >> On 14/04/16 12:51, Magnus Hagander wrote: >> >>> I should add that yes, this patch does seem to remove the problem for me >>> as >>> well. But I can't confirm if it breaks something else. >>> >> >> Did you also see whether it fixes the problem on the bigger box which >> seemed to not shut down at all? > > > Not yet, but those symptoms sound even more like those of Nick. I have it > fixing two different installations that I have locally here though. > > Finally got around to test this properly. And no, it does not work - 9.5.3 with logical replication. I captured a couple of stack traces, and they all look like this: #0 0x00007f8fa933e6b0 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x00005641a49aded3 in read (__nbytes=8192, __buf=0x5641a53d9090, __fd=6) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44 #2 XLogRead (buf=buf@entry=0x5641a53d9090 "\207\320\002", startptr=startptr@entry=1056964608, count=count@entry=8192) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:2124 #3 0x00005641a49af0db in logical_read_xlog_page (state=<optimized out>, targetPagePtr=1056964608, reqLen=<optimized out>, targetRecPtr=<optimized out>, cur_page=0x5641a53d9090 "\207\320\002", pageTLI=<optimized out>) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:772 #4 0x00005641a480f4d2 in ReadPageInternal (state=state@entry=0x5641a53d6b08, pageptr=pageptr@entry=1062813696, reqLen=reqLen@entry=360) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/access/transam/xlogreader.c:524 #5 0x00005641a480fed1 in XLogReadRecord (state=0x5641a53d6b08, RecPtr=1062814032, errormsg=errormsg@entry=0x7ffe3bd8ee48) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/access/transam/xlogreader.c:243 #6 0x00005641a49ae17b in XLogSendLogical () at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:2437 #7 0x00005641a49afc63 in WalSndLoop (send_data=send_data@entry=0x5641a49ae150 <XLogSendLogical>) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1852 #8 0x00005641a49b0aa2 in StartLogicalReplication (cmd=<optimized out>) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1015 #9 exec_replication_command (cmd_string=cmd_string@entry=0x5641a53a3c50 "START_REPLICATION SLOT \"nineone\" LOGICAL 0/3F5244F8 (\"include_transaction\" 'on')") at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1344 #10 0x00005641a49ed8b4 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x5641a5335118, dbname=0x5641a5335030 "db", username=<optimized out>) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/tcop/postgres.c:4028 #11 0x00005641a47925b8 in BackendRun (port=0x5641a5354bb0) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:4239 #12 BackendStartup (port=0x5641a5354bb0) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:3913 #13 ServerLoop () at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:1684 #14 0x00005641a4993e7b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:1292 #15 0x00005641a47932b2 in main (argc=5, argv=0x5641a53341f0) at /build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/main/main.c:228 They all come out at approximately the same place there. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/