BUG #16125: Crash of PostgreSQL's wal sender during logical replication

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #16125: Crash of PostgreSQL's wal sender during logical replication
Дата
Msg-id 16125-a78dafa032d53f7e@postgresql.org
обсуждение исходный текст
Ответы Re: BUG #16125: Crash of PostgreSQL's wal sender during logicalreplication  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16125
Logged by:          Andrey Salnikov
Email address:      andrey.salnikov@dataegret.com
PostgreSQL version: 10.11
Operating system:   Ubuntu 18.04.2 LTS
Description:

Hi,
PostgreSQL's wal sender begins continuously crash during logical
replication. Each start after crash sender tries to decode the same part of
wal and crash again and again.

Log contains next lines:
2019-11-16 03:01:39.512 UTC 22235 @ from  [vxid: txid:0] [] LOG:  server
process (PID 26972) was terminated by signal 11: Segmentation fault
--
2019-11-16 03:03:16.240 UTC 8286 @ from  [vxid: txid:0] [] LOG:  checkpoint
complete: wrote 675790 buffers (32.2%); 0 WAL file(s) added, 198 removed, 0
recycled; write=68.768 s, sync=0.011 s, total=69.214 s; sync files=361,
longest=0.002 s, average=0.000 s; distance=3164435 kB, estimate=3164435 kB
2019-11-16 03:03:16.251 UTC 22235 @ from  [vxid: txid:0] [] LOG:  database
system is ready to accept connections
2019-11-16 03:03:16.533 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG:  received replication command: IDENTIFY_SYSTEM
2019-11-16 03:03:16.533 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG:  received replication command: START_REPLICATION SLOT "rep_slot"
LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '"rep_pub"')
2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG:  starting logical decoding for slot "rep_slot"
2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] DETAIL:  streaming transactions committing after 25EE/D614FC78,
reading WAL from 25EE/D614B580
2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] LOG:  logical decoding found consistent point at 25EE/D614B580
2019-11-16 03:03:16.534 UTC 10070 rep_usr@db from 1.1.1.1 [vxid:5/0 txid:0]
[idle] DETAIL:  There are no running transactions.
2019-11-16 03:03:16.732 UTC 22235 @ from  [vxid: txid:0] [] LOG:  server
process (PID 10070) was terminated by signal 11: Segmentation fault
--
2019-11-16 03:03:18.410 UTC 10072 @ from  [vxid: txid:0] [] LOG:  database
system was not properly shut down; automatic recovery in progress
2019-11-16 03:03:18.413 UTC 10072 @ from  [vxid: txid:0] [] LOG:  invalid
record length at 25EE/D6E155C8: wanted 24, got 0
2019-11-16 03:03:18.413 UTC 10072 @ from  [vxid: txid:0] [] LOG:  redo is
not required
2019-11-16 03:03:18.419 UTC 10072 @ from  [vxid: txid:0] [] LOG:  checkpoint
starting: end-of-recovery immediate
2019-11-16 03:03:18.467 UTC 10072 @ from  [vxid: txid:0] [] LOG:  checkpoint
complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled; write=0.028 s, sync=0.000 s, total=0.050 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
2019-11-16 03:03:18.476 UTC 22235 @ from  [vxid: txid:0] [] LOG:  database
system is ready to accept connections
2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG:  received replication command: IDENTIFY_SYSTEM
2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG:  received replication command: START_REPLICATION SLOT "rep_slot"
LOGICAL 25EE/D6DD72D0 (proto_version '1', publication_names '"rep_pub"')
2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG:  starting logical decoding for slot "rep_slot"
2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] DETAIL:  streaming transactions committing after 25EE/D614FC78,
reading WAL from 25EE/D614B580
2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] LOG:  logical decoding found consistent point at 25EE/D614B580
2019-11-16 03:03:18.754 UTC 10112 rep_usr@db from 1.1.1.1 [vxid:4/0 txid:0]
[idle] DETAIL:  There are no running transactions.
2019-11-16 03:03:18.949 UTC 22235 @ from  [vxid: txid:0] [] LOG:  server
process (PID 10112) was terminated by signal 11: Segmentation fault

Core dump contains:
Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/92/b3b790a5135245f2cbb2a0a2f48fc7ff9f689a.debug...done.
done.
[New LWP 13678]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 10/main: wal sender process replication
1.1.1.1(43224) idle       '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  heap_deform_tuple (tuple=tuple@entry=0x8,
tupleDesc=tupleDesc@entry=0x555713b2e638,
values=values@entry=0x555713b7ca48, isnull=isnull@entry=0x555713b7cf98 "")
at ./build/../src/backend/access/common/heaptuple.c:936
936    ./build/../src/backend/access/common/heaptuple.c: No such file or
directory.
(gdb) bt
#0  heap_deform_tuple (tuple=tuple@entry=0x8,
tupleDesc=tupleDesc@entry=0x555713b2e638,
values=values@entry=0x555713b7ca48, isnull=isnull@entry=0x555713b7cf98 "")
at ./build/../src/backend/access/common/heaptuple.c:936
#1  0x00005557127c918b in ReorderBufferToastReplace (rb=0x555713b7b788,
txn=0x555713b982e8, relation=0x7f4b3fafa248, relation=0x7f4b3fafa248,
change=0x555713b9a108)
    at ./build/../src/backend/replication/logical/reorderbuffer.c:2985
#2  ReorderBufferCommit (rb=0x555713b7b788, xid=xid@entry=1667601527,
commit_lsn=commit_lsn@entry=41707737345720,
end_lsn=end_lsn@entry=41707737345792,
commit_time=commit_time@entry=627188499326201, 
    origin_id=origin_id@entry=0, origin_lsn=0) at
./build/../src/backend/replication/logical/reorderbuffer.c:1585
#3  0x00005557127bef32 in DecodeCommit (xid=1667601527,
parsed=0x7ffd288fe4a0, buf=<synthetic pointer>, ctx=0x555713a9ca08) at
./build/../src/backend/replication/logical/decode.c:614
#4  DecodeXactOp (buf=<synthetic pointer>, ctx=0x555713a9ca08) at
./build/../src/backend/replication/logical/decode.c:244
#5  LogicalDecodingProcessRecord (ctx=0x555713a9ca08, record=<optimized
out>) at ./build/../src/backend/replication/logical/decode.c:113
#6  0x00005557127d0b86 in XLogSendLogical () at
./build/../src/backend/replication/walsender.c:2797
#7  0x00005557127d232f in WalSndLoop
(send_data=send_data@entry=0x5557127d0b10 <XLogSendLogical>) at
./build/../src/backend/replication/walsender.c:2157
#8  0x00005557127d30aa in StartLogicalReplication (cmd=0x555713b499d8) at
./build/../src/backend/replication/walsender.c:1105
#9  exec_replication_command (cmd_string=cmd_string@entry=0x555713b4da38
"START_REPLICATION SLOT \"pgprod_office_basic\" LOGICAL 25EE/D6DD72D0
(proto_version '1', publication_names '\"pgprod_master_basic\"')")
    at ./build/../src/backend/replication/walsender.c:1536
#10 0x000055571281de3a in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x555713add8d8, dbname=<optimized out>, username=<optimized
out>) at ./build/../src/backend/tcop/postgres.c:4113
#11 0x000055571255c72c in BackendRun (port=0x555713ad56d0) at
./build/../src/backend/postmaster/postmaster.c:4408
#12 BackendStartup (port=0x555713ad56d0) at
./build/../src/backend/postmaster/postmaster.c:4080
#13 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1756
#14 0x00005557127ada21 in PostmasterMain (argc=9, argv=0x555713a7d190) at
./build/../src/backend/postmaster/postmaster.c:1364
#15 0x000055571255e422 in main (argc=9, argv=0x555713a7d190) at
./build/../src/backend/main/main.c:228
(gdb) quit

This behaviour does not depends on defined data in tables, because we see it
in different database with different sets of tables in publications.
Looks like a real issue in logical replication.
I will happy to provide an additional information about that issue, but i
should know what else to need to collect for helping to solve this
problem.

Regards,
Andrey.


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #16121: 12 regression: Volatile function in target list subquery behave as stable
Следующее
От: Gmail
Дата:
Сообщение: Re: BUG #16119: pg_dump omits columns specification for matviews