Обсуждение: CREATE REPLICATION SLOT fails on a timeout

Поиск
Список
Период
Сортировка

CREATE REPLICATION SLOT fails on a timeout

От
Steve Singer
Дата:
I am finding that my logical walsender connections are being terminated
due to a timeout on the CREATE REPLICATION SLOT command. with
"terminating walsender process due to replication timeout"

Below is the stack trace when this happens

#3  0x000000000067df28 in WalSndCheckTimeOut
(now=now@entry=453585463823871) at walsender.c:1748
#4  0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
walsender.c:1216
#5  logical_read_xlog_page (state=<optimized out>,
targetPagePtr=691724288, reqLen=<optimized out>,
     targetRecPtr=<optimized out>, cur_page=0x18476e0 "}\320\005",
pageTLI=<optimized out>) at walsender.c:741
#6  0x00000000004f41bf in ReadPageInternal (state=state@entry=0x17aa140,
pageptr=pageptr@entry=691724288,
     reqLen=reqLen@entry=3600) at xlogreader.c:525
#7  0x00000000004f454a in XLogReadRecord (state=0x17aa140,
RecPtr=691727856, RecPtr@entry=0,
     errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228
#8  0x0000000000675e5c in DecodingContextFindStartpoint
(ctx=ctx@entry=0x17a0358) at logical.c:460
#9  0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
walsender.c:800
#10 exec_replication_command (
     cmd_string=cmd_string@entry=0x17f1478 "CREATE_REPLICATION_SLOT
\"slon_4_1\" LOGICAL \"slony1_funcs.2.2.0\"")
     at walsender.c:1291
#11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x177db50, dbname=0x177db30 "test1",

(gdb) p last_reply_timestamp
$1 = 0


I propose the attached patch sets last_reply_timestamp to now() it
starts processing a command.  Since receiving a command is hearing
something from the client.


Steve


Вложения

Re: CREATE REPLICATION SLOT fails on a timeout

От
Andres Freund
Дата:
Hi,

On 2014-05-16 16:37:16 -0400, Steve Singer wrote:
> I am finding that my logical walsender connections are being terminated due
> to a timeout on the CREATE REPLICATION SLOT command. with "terminating
> walsender process due to replication timeout"
> 
> Below is the stack trace when this happens
> 
> #3  0x000000000067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871)
> at walsender.c:1748
> #4  0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
> walsender.c:1216
> ...
> #9  0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
> walsender.c:800
> #10 exec_replication_command ()  at walsender.c:1291
> #11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
> argv=argv@entry=0x177db50, dbname=0x177db30 "test1",
> 
> (gdb) p last_reply_timestamp
> $1 = 0
> 
> 
> I propose the attached patch sets last_reply_timestamp to now() it starts
> processing a command.  Since receiving a command is hearing something from
> the client.

Hm. Yes, that's a problem.

> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
> new file mode 100644
> index 5c11d68..56a2f10
> *** a/src/backend/replication/walsender.c
> --- b/src/backend/replication/walsender.c
> *************** exec_replication_command(const char *cmd
> *** 1276,1281 ****
> --- 1276,1282 ----
>                                     parse_rc))));
>   
>       cmd_node = replication_parse_result;
> +     last_reply_timestamp = GetCurrentTimestamp();
>   
>       switch (cmd_node->type)
>       {

I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

Greetings,

Andres Freund



Re: CREATE REPLICATION SLOT fails on a timeout

От
Steve Singer
Дата:
On 05/16/2014 04:43 PM, Andres Freund wrote:
> Hi,
>
> I don't think that's going to cut it though. The creation can take
> longer than whatever wal_sender_timeout is set to (when there's lots of
> longrunning transactions). I think checking whether last_reply_timestamp
> = 0 during timeout checking is more robust.
>
> Greetings,
>
> Andres Freund
>
>


That makes sense.
A patch that does that is attached.

Steve

Вложения

Re: CREATE REPLICATION SLOT fails on a timeout

От
Andres Freund
Дата:
On 2014-05-16 17:02:33 -0400, Steve Singer wrote:
> >I don't think that's going to cut it though. The creation can take
> >longer than whatever wal_sender_timeout is set to (when there's lots of
> >longrunning transactions). I think checking whether last_reply_timestamp
> >= 0 during timeout checking is more robust.

> That makes sense.
> A patch that does that is attached.

I've attached a heavily revised version of that patch. Didn't touch all
the necessary places...

Survives creation of logical replication slots under 'intensive
pressure', with a wal_sender_timeout=10ms.

Thanks for the bugreport!

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: CREATE REPLICATION SLOT fails on a timeout

От
andres@anarazel.de (Andres Freund)
Дата:
Hi,

On 2014-05-17 01:34:25 +0200, Andres Freund wrote:
> On 2014-05-16 17:02:33 -0400, Steve Singer wrote:
> > >I don't think that's going to cut it though. The creation can take
> > >longer than whatever wal_sender_timeout is set to (when there's lots of
> > >longrunning transactions). I think checking whether last_reply_timestamp
> > >= 0 during timeout checking is more robust.
> 
> > That makes sense.
> > A patch that does that is attached.
> 
> I've attached a heavily revised version of that patch. Didn't touch all
> the necessary places...
> 
> Survives creation of logical replication slots under 'intensive
> pressure', with a wal_sender_timeout=10ms.
> 
> Thanks for the bugreport!

Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?

Thanks!

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: CREATE REPLICATION SLOT fails on a timeout

От
Steve Singer
Дата:
On 05/28/2014 06:41 PM, Andres Freund wrote:
> Hi,
>
>
> Pushed a fix for it. I am pretty sure it will, but could you still test
> that it fixes your problem?
>
> Thanks!

The fix seems to work (I am no longer getting the timeout on slot creation)

Thanks



> Andres Freund
>