Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Дата
Msg-id CAHut+Pt4PyKQCwqzQ=EFF=bpKKJD7XKt_S23F6L20ayQNxg77A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Ответы Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
On Wed, Nov 18, 2020 at 1:29 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2020-Nov-04, Amit Kapila wrote:
>
> > On Thu, Oct 15, 2020 at 8:20 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > > * STREAM COMMIT bug?
> > >   In apply_handle_stream_commit, we do CommitTransactionCommand, but
> > >   apparently in a tablesync worker we shouldn't do it.
> >
> > In the tablesync stage, we don't allow streaming. See pgoutput_startup
> > where we disable streaming for the init phase. As far as I understand,
> > for tablesync we create the initial slot during which streaming will
> > be disabled then we will copy the table (here logical decoding won't
> > be used) and then allow the apply worker to get any other data which
> > is inserted in the meantime. Now, I might be missing something here
> > but if you can explain it a bit more or share some test to show how we
> > can reach here via tablesync worker then we can discuss the possible
> > solution.
>
> Hmm, okay, that sounds like there would be no bug then.  Maybe what we
> need is just an assert in apply_handle_stream_commit that
> !am_tablesync_worker(), as in the attached patch.  Passes tests.

Hi.

Using the same debugging technique described in a previous mail [1], I
have tested again but this time using a SUBSCRIPTION capable of
streaming.

While paused in the debugger (to force an unusual timing situation) I
can publish INSERTs en masse and cause streaming replication to occur.

To cut a long story short, a tablesync worker CAN in fact end up
processing (e.g. apply_dispatch) streaming messages.
So the tablesync worker CAN get into the apply_handle_stream_commit.
And this scenario, albeit rare, will crash.

For example,
---
Program received signal SIGABRT, Aborted.
0x00007f37570f22c7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f37570f22c7 in raise () from /lib64/libc.so.6
#1  0x00007f37570f39b8 in abort () from /lib64/libc.so.6
#2  0x0000000000f2bf6a in ExceptionalCondition
(conditionName=0x11330a4 "IsTransactionState()", errorType=0x1133094
"FailedAssertion",
    fileName=0x1133088 "tablesync.c", lineNumber=273) at assert.c:69
#3  0x0000000000bb046e in process_syncing_tables_for_sync
(current_lsn=23984432) at tablesync.c:273
#4  0x0000000000bb1039 in process_syncing_tables
(current_lsn=23984432) at tablesync.c:535
#5  0x0000000000bb5c7a in apply_handle_stream_commit
(s=0x7ffd2d892fd0) at worker.c:1080
#6  0x0000000000bb8edb in apply_dispatch (s=0x7ffd2d892fd0) at worker.c:1966
#7  0x0000000000bb97c9 in LogicalRepApplyLoop (last_received=23984432)
at worker.c:2166
#8  0x0000000000bbca84 in ApplyWorkerMain (main_arg=1) at worker.c:3122
#9  0x0000000000b36f11 in StartBackgroundWorker () at bgworker.c:820
#10 0x0000000000b5cf3a in do_start_bgworker (rw=0x1f3da50) at postmaster.c:5837
#11 0x0000000000b5d661 in maybe_start_bgworkers () at postmaster.c:6062
#12 0x0000000000b5b560 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5223
#13 <signal handler called>
#14 0x00007f37571b10d3 in __select_nocancel () from /lib64/libc.so.6
#15 0x0000000000b52a40 in ServerLoop () at postmaster.c:1691
#16 0x0000000000b51fa8 in PostmasterMain (argc=3, argv=0x1f164f0) at
postmaster.c:1400
#17 0x00000000009ab55d in main (argc=3, argv=0x1f164f0) at main.c:209
---

Also, PSA v2 of my "helper" patch for debugging the tablesync worker.
(This patch has more logging than v1)

---

[1] - https://www.postgresql.org/message-id/CAHut%2BPsprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz%2BAA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table.
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop