Re: BUG #17695: Failed Assert in logical replication snapbuild.

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: BUG #17695: Failed Assert in logical replication snapbuild.
Дата
Msg-id CAD21AoCsxvV3Mpzv8Q3hG7CdQZ7vSBWyEFLgY3QZARRDcgzVwA@mail.gmail.com
обсуждение исходный текст
Ответ на BUG #17695: Failed Assert in logical replication snapbuild.  (PG Bug reporting form <noreply@postgresql.org>)
Список pgsql-bugs
On Thu, Nov 24, 2022 at 7:28 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      17695
> Logged by:          施博文
> Email address:      zxwsbg@qq.com
> PostgreSQL version: 14.6
> Operating system:   centos
> Description:
>
> In PG14 or higher version, I notice that SnapBuildRestore don't set
> builder->next_phase_at=InvalidTransactionId .
>
> But in SnapBuildSerialize function, the assert check this condition.
>
> Assert(builder->next_phase_at == InvalidTransactionId);
>
> This would cause some problems, and I have repeat it with the perl test case
> which I would update later, the problem is following:
>
> TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId",
> File: "snapbuild.c", Line: 1604, PID: 29974)
> postgres: master: walsender postgres [local]
> START_REPLICATION(ExceptionalCondition+0xb9)[0xb1c9bd]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8f548d]
> postgres: master: walsender postgres [local]
> START_REPLICATION(SnapBuildProcessRunningXacts+0x55)[0x8f4c5c]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8dd8be]
> postgres: master: walsender postgres [local]
> START_REPLICATION(LogicalDecodingProcessRecord+0xd1)[0x8dd243]
> postgres: master: walsender postgres [local] START_REPLICATION[0x915eb1]
> postgres: master: walsender postgres [local] START_REPLICATION[0x91520c]
> postgres: master: walsender postgres [local] START_REPLICATION[0x913bdd]
> postgres: master: walsender postgres [local]
> START_REPLICATION(exec_replication_command+0x42c)[0x914593]
> postgres: master: walsender postgres [local]
> START_REPLICATION(PostgresMain+0x7be)[0x984ddf]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8c0d41]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8c06b3]
> postgres: master: walsender postgres [local] START_REPLICATION[0x8bc9c4]
> postgres: master: walsender postgres [local]
> START_REPLICATION(PostmasterMain+0x117a)[0x8bc29b]
> postgres: master: walsender postgres [local] START_REPLICATION[0x7bdaf9]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f815dd92555]
> postgres: master: walsender postgres [local] START_REPLICATION[0x488d09]
>

Thank you for reporting the issue!

I could reproduce this issue with a small change and the following steps:

1. Add sleep after setting SNAPBUILD_BUILDING_SNAPSHOT.

@@ -1409,6 +1409,9 @@ SnapBuildFindSnapshot(SnapBuild *builder,
XLogRecPtr lsn, xl_running_xacts *runn
                           running->xcnt, running->nextXid)));

        SnapBuildWaitSnapshot(running, running->nextXid);
+
+       elog(LOG, "sleep 10s");
+       pg_usleep(10 * 1000000);
    }

2. Create and start database instance and create a table and the
replication slot 's1'.
create table test (i int);
select pg_logical_replication_slot('s1', 'test_decoding');

3. Create another replication slot "s2" while a transaction is running.

tx-1:
begin;
insert into test values (1);

    tx-2:
    select pg_logical_replication_slot('s2', 'test_decoding');

checkpoint;
commit;

Note that replication slot creation will wait for 10 sec after
commiting the tx-1, please go to the next step once the slot has been
created.

4. Start decoding on slot 's2'.
select pg_logical_slot_get_changes('s2', null, null);

The logical decoding will sleep. Please go to the next step after
seeing the log "sleep 10s".

5. While the logical decoding started at step 4 is sleeping, start
decoding on slot 's1'.
select pg_logical_slot_get_changes('s1', null, null);

6. After the decoding on slot 's1' wakes up, it fails due to the
assertion check.

This scenario simulates the case where the logical decoding (re)starts
where RUNNING_XACTS record having a running transactions and it
restores the serialized snapshot when decoding the next RUNNING_XACTS
record. Since we don't reset builder->next_phase_at when restoring a
serialized snapshot, the assertion check in SnapBuildSerialize fails.

Regarding the proposed patch, I've confirmed it fixes this issue. But
I think it's better to reset builder->next_phase_at right after the
following assertion check:

   /* consistent snapshots have no next phase */
   Assert(ondisk.builder.next_phase_at == InvalidTransactionId);

I could not reproduce this issue with your script in my environment. I
think it's better to include the reproducible test case in the patch
but I'm not sure how to do that without adding sleep/gdb attach.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17696: Creation of deferrable initially deferred constraint breaks before transaction commit
Следующее
От: "Hayato Kuroda (Fujitsu)"
Дата:
Сообщение: RE: BUG #17695: Failed Assert in logical replication snapbuild.