Обсуждение: CREATE REPLICATION SLOT fails on a timeout
I am finding that my logical walsender connections are being terminated
due to a timeout on the CREATE REPLICATION SLOT command. with
"terminating walsender process due to replication timeout"
Below is the stack trace when this happens
#3 0x000000000067df28 in WalSndCheckTimeOut
(now=now@entry=453585463823871) at walsender.c:1748
#4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
walsender.c:1216
#5 logical_read_xlog_page (state=<optimized out>,
targetPagePtr=691724288, reqLen=<optimized out>,
targetRecPtr=<optimized out>, cur_page=0x18476e0 "}\320\005",
pageTLI=<optimized out>) at walsender.c:741
#6 0x00000000004f41bf in ReadPageInternal (state=state@entry=0x17aa140,
pageptr=pageptr@entry=691724288,
reqLen=reqLen@entry=3600) at xlogreader.c:525
#7 0x00000000004f454a in XLogReadRecord (state=0x17aa140,
RecPtr=691727856, RecPtr@entry=0,
errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228
#8 0x0000000000675e5c in DecodingContextFindStartpoint
(ctx=ctx@entry=0x17a0358) at logical.c:460
#9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
walsender.c:800
#10 exec_replication_command (
cmd_string=cmd_string@entry=0x17f1478 "CREATE_REPLICATION_SLOT
\"slon_4_1\" LOGICAL \"slony1_funcs.2.2.0\"")
at walsender.c:1291
#11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x177db50, dbname=0x177db30 "test1",
(gdb) p last_reply_timestamp
$1 = 0
I propose the attached patch sets last_reply_timestamp to now() it
starts processing a command. Since receiving a command is hearing
something from the client.
Steve
Вложения
Hi,
On 2014-05-16 16:37:16 -0400, Steve Singer wrote:
> I am finding that my logical walsender connections are being terminated due
> to a timeout on the CREATE REPLICATION SLOT command. with "terminating
> walsender process due to replication timeout"
>
> Below is the stack trace when this happens
>
> #3 0x000000000067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871)
> at walsender.c:1748
> #4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
> walsender.c:1216
> ...
> #9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
> walsender.c:800
> #10 exec_replication_command () at walsender.c:1291
> #11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
> argv=argv@entry=0x177db50, dbname=0x177db30 "test1",
>
> (gdb) p last_reply_timestamp
> $1 = 0
>
>
> I propose the attached patch sets last_reply_timestamp to now() it starts
> processing a command. Since receiving a command is hearing something from
> the client.
Hm. Yes, that's a problem.
> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
> new file mode 100644
> index 5c11d68..56a2f10
> *** a/src/backend/replication/walsender.c
> --- b/src/backend/replication/walsender.c
> *************** exec_replication_command(const char *cmd
> *** 1276,1281 ****
> --- 1276,1282 ----
> parse_rc))));
>
> cmd_node = replication_parse_result;
> + last_reply_timestamp = GetCurrentTimestamp();
>
> switch (cmd_node->type)
> {
I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.
Greetings,
Andres Freund
On 05/16/2014 04:43 PM, Andres Freund wrote: > Hi, > > I don't think that's going to cut it though. The creation can take > longer than whatever wal_sender_timeout is set to (when there's lots of > longrunning transactions). I think checking whether last_reply_timestamp > = 0 during timeout checking is more robust. > > Greetings, > > Andres Freund > > That makes sense. A patch that does that is attached. Steve
Вложения
On 2014-05-16 17:02:33 -0400, Steve Singer wrote: > >I don't think that's going to cut it though. The creation can take > >longer than whatever wal_sender_timeout is set to (when there's lots of > >longrunning transactions). I think checking whether last_reply_timestamp > >= 0 during timeout checking is more robust. > That makes sense. > A patch that does that is attached. I've attached a heavily revised version of that patch. Didn't touch all the necessary places... Survives creation of logical replication slots under 'intensive pressure', with a wal_sender_timeout=10ms. Thanks for the bugreport! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Hi, On 2014-05-17 01:34:25 +0200, Andres Freund wrote: > On 2014-05-16 17:02:33 -0400, Steve Singer wrote: > > >I don't think that's going to cut it though. The creation can take > > >longer than whatever wal_sender_timeout is set to (when there's lots of > > >longrunning transactions). I think checking whether last_reply_timestamp > > >= 0 during timeout checking is more robust. > > > That makes sense. > > A patch that does that is attached. > > I've attached a heavily revised version of that patch. Didn't touch all > the necessary places... > > Survives creation of logical replication slots under 'intensive > pressure', with a wal_sender_timeout=10ms. > > Thanks for the bugreport! Pushed a fix for it. I am pretty sure it will, but could you still test that it fixes your problem? Thanks! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 05/28/2014 06:41 PM, Andres Freund wrote: > Hi, > > > Pushed a fix for it. I am pretty sure it will, but could you still test > that it fixes your problem? > > Thanks! The fix seems to work (I am no longer getting the timeout on slot creation) Thanks > Andres Freund >