Обсуждение: [HACKERS] Multiple table synchronizations are processed serially

Поиск
Список
Период
Сортировка

[HACKERS] Multiple table synchronizations are processed serially

От
Masahiko Sawada
Дата:
Hi,

While testing table synchronization in logical replication, I found
that multiple table synchronizations for a subscription are processed
serially due to lock wait.

I setup pgbench tables on publisher (scalefactor = 1000) and on
subscriber, and truncated these tables on subscriber, and then created
publication for all pgbench tables. When I created a subscription I
got the following log messages.

2017-05-19 13:12:56.311 JST [54970] LOG:  logical replication apply
for subscription hoge_sub started
2017-05-19 13:12:56.314 JST [54970] LOG:  starting logical replication
worker for subscription "hoge_sub"
2017-05-19 13:12:56.317 JST [54972] LOG:  logical replication sync for
subscription hoge_sub, table pgbench_accounts started
2017-05-19 13:12:57.315 JST [54970] LOG:  starting logical replication
worker for subscription "hoge_sub"
2017-05-19 13:12:57.318 JST [54974] LOG:  logical replication sync for
subscription hoge_sub, table pgbench_branches started
2017-05-19 13:12:58.317 JST [54970] LOG:  starting logical replication
worker for subscription "hoge_sub"
2017-05-19 13:12:58.320 JST [54976] LOG:  logical replication sync for
subscription hoge_sub, table pgbench_history started
2017-05-19 13:12:59.319 JST [54970] LOG:  starting logical replication
worker for subscription "hoge_sub"
2017-05-19 13:12:59.322 JST [54978] LOG:  logical replication sync for
subscription hoge_sub, table pgbench_tellers started

Seems all four table sync workers are launched at the almost same
time, but three workers of them get stuck in idle transaction state
when creating replication slot. That is these waiting workers cannot
proceed its work until first connected table sync worker finishes. ps
command shows the followings.

54898 ?        Ss     0:00 postgres: bgworker: logical replication launcher
54970 ?        Ss     0:00 postgres: bgworker: logical replication
worker for subscription 16432
54971 ?        Ss     0:00 postgres: wal sender process masahiko [local] idle
54972 ?        Rs    10:38 postgres: bgworker: logical replication
worker for subscription 16432 sync 16395
54973 ?        Ss     1:52 postgres: wal sender process masahiko [local] COPY
54974 ?        Ss     0:00 postgres: bgworker: logical replication
worker for subscription 16432 sync 16398
54975 ?        Ss     0:00 postgres: wal sender process masahiko
[local] idle in transaction waiting
54976 ?        Ss     0:00 postgres: bgworker: logical replication
worker for subscription 16432 sync 16389
54977 ?        Ss     0:00 postgres: wal sender process masahiko
[local] idle in transaction waiting
54978 ?        Ss     0:00 postgres: bgworker: logical replication
worker for subscription 16432 sync 16392
54979 ?        Ss     0:00 postgres: wal sender process masahiko
[local] idle in transaction waiting

Also, here is backtrace of the waiting wal sender process.

#0  0x00007f24b4030783 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x000000000084c58d in WaitEventSetWaitBlock (set=0x25ed8e0,
cur_timeout=-1, occurred_events=0x7fff1ea322f0, nevents=1) at
latch.c:1045
#2  0x000000000084c454 in WaitEventSetWait (set=0x25ed8e0, timeout=-1,
occurred_events=0x7fff1ea322f0, nevents=1, wait_event_info=50331652)
at latch. c:997
#3  0x000000000084bbe5 in WaitLatchOrSocket (latch=0x7f24ad385024,
wakeEvents=1, sock=-1, timeout=-1, wait_event_info=50331652) at
latch.c:385
#4  0x000000000084bac6 in WaitLatch (latch=0x7f24ad385024,
wakeEvents=1, timeout=0, wait_event_info=50331652) at latch.c:339
#5  0x0000000000863632 in ProcSleep (locallock=0x25f2fb0,
lockMethodTable=0xbdb020) at proc.c:1255
#6  0x000000000085c83c in WaitOnLock (locallock=0x25f2fb0,
owner=0x2604dd8) at lock.c:1702
#7  0x000000000085b541 in LockAcquireExtended (locktag=0x7fff1ea32800,
lockmode=5, sessionLock=0 '\000', dontWait=0 '\000',
reportMemoryError=1 '\001\ ') at lock.c:998
#8  0x000000000085ab46 in LockAcquire (locktag=0x7fff1ea32800,
lockmode=5, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:688
#9  0x0000000000859a16 in XactLockTableWait (xid=598, rel=0x0,
ctid=0x0, oper=XLTW_None) at lmgr.c:587
#10 0x000000000080d87a in SnapBuildWaitSnapshot (running=0x25ed8a8,
cutoff=599) at snapbuild.c:1400
#11 0x000000000080d64c in SnapBuildFindSnapshot (builder=0x26ab530,
lsn=14202458264, running=0x25ed8a8) at snapbuild.c:1311
#12 0x000000000080d1bb in SnapBuildProcessRunningXacts
(builder=0x26ab530, lsn=14202458264, running=0x25ed8a8) at
snapbuild.c:1106
#13 0x00000000007fb779 in DecodeStandbyOp (ctx=0x25f7090,
buf=0x7fff1ea32930) at decode.c:314
#14 0x00000000007fb28f in LogicalDecodingProcessRecord (ctx=0x25f7090,
record=0x25f7350) at decode.c:117
#15 0x00000000007ff0d5 in DecodingContextFindStartpoint
(ctx=0x25f7090) at logical.c:458
#16 0x00000000008141c8 in CreateReplicationSlot (cmd=0x25ecba8) at
walsender.c:941
#17 0x0000000000814ec5 in exec_replication_command
(cmd_string=0x26511e0 "CREATE_REPLICATION_SLOT
\"hoge_sub_16432_sync_16398\" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT") at walsender.c:1508
#18 0x000000000087d2d5 in PostgresMain (argc=1, argv=0x25fb2b8,
dbname=0x25fb100 "postgres", username=0x25fb0d8 "masahiko") at
postgres.c:4071
#19 0x00000000007e0ffd in BackendRun (port=0x25f37f0) at postmaster.c:4334
#20 0x00000000007e077d in BackendStartup (port=0x25f37f0) at postmaster.c:4006
#21 0x00000000007dcd86 in ServerLoop () at postmaster.c:1743
#22 0x00000000007dc44a in PostmasterMain (argc=5, argv=0x25cceb0) at
postmaster.c:1351
#23 0x00000000007184a2 in main (argc=5, argv=0x25cceb0) at main.c:228

It also blocks subscriptions that will be created later. I'd say it's
not an expected behavior. Any thoughts?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Multiple table synchronizations are processed serially

От
Peter Eisentraut
Дата:
On 5/19/17 01:01, Masahiko Sawada wrote:
> Seems all four table sync workers are launched at the almost same
> time, but three workers of them get stuck in idle transaction state
> when creating replication slot. That is these waiting workers cannot
> proceed its work until first connected table sync worker finishes. ps
> command shows the followings.

Creating a replication slot waits for all transactions to finish.  So if
one of those transactions is a table copy of another subscription, it
has to wait for that.

You can avoid that by creating all the slots first and then triggering
the initial table copy separately.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Multiple table synchronizations are processed serially

От
Petr Jelinek
Дата:
On 19/05/17 21:47, Peter Eisentraut wrote:
> On 5/19/17 01:01, Masahiko Sawada wrote:
>> Seems all four table sync workers are launched at the almost same
>> time, but three workers of them get stuck in idle transaction state
>> when creating replication slot. That is these waiting workers cannot
>> proceed its work until first connected table sync worker finishes. ps
>> command shows the followings.
> 
> Creating a replication slot waits for all transactions to finish.  So if
> one of those transactions is a table copy of another subscription, it
> has to wait for that.
> 

Well IMHO this concrete example is partially effect of the snapshort
builder fixes we did with Andres. Before those fixes the ondisk snapshot
may have been used in this situation. Makes me think if we want to mark
the exported snapshot as containing all transactions and using it if
that's the case.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Multiple table synchronizations are processed serially

От
Masahiko Sawada
Дата:
On Sat, May 20, 2017 at 4:47 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 5/19/17 01:01, Masahiko Sawada wrote:
>> Seems all four table sync workers are launched at the almost same
>> time, but three workers of them get stuck in idle transaction state
>> when creating replication slot. That is these waiting workers cannot
>> proceed its work until first connected table sync worker finishes. ps
>> command shows the followings.
>
> Creating a replication slot waits for all transactions to finish.  So if
> one of those transactions is a table copy of another subscription, it
> has to wait for that.
>
> You can avoid that by creating all the slots first and then triggering
> the initial table copy separately.
>

Thank you for suggestion!
I understood the reason why subsequent processes have to wait. If some
of tables belonging to publication are very large, the idle
transaction will wait for a long time, which is not good. So as you
mentioned it seems to me that it's better to create all slots first
and then trigger the initial table copy separately especially in such
case.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center