Обсуждение: BUG #17054: Memory corruption in logical replication worker when replicating into partitioned table

Поиск
Список
Период
Сортировка

BUG #17054: Memory corruption in logical replication worker when replicating into partitioned table

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17054
Logged by:          Sergey Bernikov
Email address:      sbernikov@gmail.com
PostgreSQL version: 13.3
Operating system:   Ubuntu 18.04.4
Description:

When logical replication target is a partitioned table then execution of any
DDL on source table leads to crash of target (subscriber) server.

Steps to reproduce:
1. in source DB: create table and add to publication
    create table test_replication (
      id int not null,
      value varchar(100),
      primary key (id)
    );
create publication test_publication for table test_replication;

2. in target DB: create partitioned table and start replication
    create table test_replication (
      id int not null,
      value varchar(100),
      primary key (id)
    ) partition by range (id);
    create table test_replication_p_1 partition of test_replication
       for values from (0) to (10);
    create table test_replication_p_2 partition of test_replication
       for values from (10) to (20);
 
    create subscription test_subscription CONNECTION '...' publication
test_publication;
 
4. in source DB: insert and update data
    insert into test_replication(id, value) values (1, 'a1');
    insert into test_replication(id, value) values (2, 'a1');
    insert into test_replication(id, value) values (3, 'a1');
    update test_replication set value = 'a2';

5. in source DB: execute any DDL on the table
    vacuum test_replication;

6. in source DB: update data
    update test_replication set value = 'a3';

Result: logical replication worker on target server crashes with error
message:
    LOG:  background worker "logical replication worker" (PID 28356) was
terminated by signal 11: Segmentation fault
    LOG:  terminating any other active server processes

Backtrace from core dump:
Core was generated by `postgres: 13/main: logical replication worker for
subscription 781420         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000557026391fef in slot_modify_cstrings
(slot=slot@entry=0x557026fa8298, srcslot=<optimized out>,
rel=rel@entry=0x557026ff7370, values=values@entry=0x7ffff4135550,
    replaces=replaces@entry=0x7ffff4138950) at
./build/../src/backend/replication/logical/worker.c:434
434     ./build/../src/backend/replication/logical/worker.c: No such file or
directory.
(gdb) bt
#0  0x0000557026391fef in slot_modify_cstrings
(slot=slot@entry=0x557026fa8298, srcslot=<optimized out>,
rel=rel@entry=0x557026ff7370, values=values@entry=0x7ffff4135550,
    replaces=replaces@entry=0x7ffff4138950) at
./build/../src/backend/replication/logical/worker.c:434
#1  0x0000557026392b9f in apply_handle_tuple_routing
(relinfo=0x557026f80928, estate=estate@entry=0x557026fae108,
remoteslot=remoteslot@entry=0x557026f813d8,
newtup=newtup@entry=0x7ffff4135550,
    relmapentry=relmapentry@entry=0x557026f96d90,
operation=operation@entry=CMD_UPDATE) at
./build/../src/backend/replication/logical/worker.c:1105
#2  0x00005570263934df in apply_handle_update (s=s@entry=0x7ffff41390a0) at
./build/../src/backend/replication/logical/worker.c:791
#3  0x00005570263941c1 in apply_dispatch (s=0x7ffff41390a0) at
./build/../src/backend/replication/logical/worker.c:1368
#4  LogicalRepApplyLoop (last_received=936525246824) at
./build/../src/backend/replication/logical/worker.c:1577
#5  ApplyWorkerMain (main_arg=<optimized out>) at
./build/../src/backend/replication/logical/worker.c:2123
#6  0x00005570263613ae in StartBackgroundWorker () at
./build/../src/backend/postmaster/bgworker.c:879
#7  0x000055702636d5a3 in do_start_bgworker (rw=0x557026ec9110) at
./build/../src/backend/postmaster/postmaster.c:5870
#8  maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:6095
#9  0x000055702636e035 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at ./build/../src/backend/postmaster/postmaster.c:5255
#10 <signal handler called>
#11 0x00007f4bb7bbcdd7 in __GI___select (nfds=nfds@entry=10,
readfds=readfds@entry=0x7ffff4139870, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffff41397d0)
    at ../sysdeps/unix/sysv/linux/select.c:41
#12 0x000055702636e5f9 in ServerLoop () at
./build/../src/backend/postmaster/postmaster.c:1703
#13 0x0000557026370423 in PostmasterMain (argc=5, argv=<optimized out>) at
./build/../src/backend/postmaster/postmaster.c:1412
#14 0x00005570260c19f8 in main (argc=5, argv=0x557026e73fd0) at
./build/../src/backend/main/main.c:210


PG Bug reporting form <noreply@postgresql.org> writes:
> When logical replication target is a partitioned table then execution of any
> DDL on source table leads to crash of target (subscriber) server.

Thanks for the report!  I duplicated the crash on v13 branch tip,
although it's hitting an assertion failure before reaching any segfault:

#2  0x00000000008f466a in ExceptionalCondition (
    conditionName=conditionName@entry=0xa5eae0 "natts == rel->attrmap->maplen", errorType=errorType@entry=0x948cc9
"FailedAssertion", 
    fileName=fileName@entry=0xa52956 "worker.c",
    lineNumber=lineNumber@entry=490) at assert.c:67
#3  0x0000000000777741 in slot_modify_cstrings (slot=slot@entry=0x2ec6e40,
    srcslot=<optimized out>, rel=rel@entry=0x2eca918,
    values=values@entry=0x7fffb3506480, replaces=replaces@entry=0x7fffb3509880)
    at worker.c:490
#4  0x00000000007785e7 in apply_handle_tuple_routing (
    edata=edata@entry=0x2ea45a0, remoteslot=remoteslot@entry=0x2ea48a0,
    newtup=newtup@entry=0x7fffb3506480, operation=operation@entry=CMD_UPDATE)
    at worker.c:1153
#5  0x0000000000778e74 in apply_handle_update (s=s@entry=0x7fffb3509fa0)
    at worker.c:846
#6  0x000000000077963c in apply_dispatch (s=0x7fffb3509fa0) at worker.c:1415
#7  LogicalRepApplyLoop (last_received=254887792) at worker.c:1624
#8  ApplyWorkerMain (main_arg=<optimized out>) at worker.c:2171
#9  0x0000000000743ec9 in StartBackgroundWorker () at bgworker.c:890

Interestingly, the same test case does NOT crash for me on master.
So apparently we fixed something that should have been back-patched.

            regards, tom lane



I wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
>> When logical replication target is a partitioned table then execution of any
>> DDL on source table leads to crash of target (subscriber) server.

> Thanks for the report!  I duplicated the crash on v13 branch tip,

I can't reproduce this anymore after commit b270713fd.  I think it's
probably the same thing I found while making a test for your other
report:

    logicalrep_partition_open() failed to ensure that the
    LogicalRepPartMapEntry it built for a partition was fully
    independent of that for the partition root, leading to
    trouble if the root entry was later freed or rebuilt.

My failure to see a crash on HEAD was probably an accidental
issue of memory reuse patterns.

            regards, tom lane