Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
От | Ondřej Jirman |
---|---|
Тема | Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker |
Дата | |
Msg-id | 20191121173555.midj6vp4dmiiqs24@core.my.home обсуждение исходный текст |
Ответ на | Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
(Ondřej Jirman <ienieghapheoghaiwida@xff.cz>)
|
Список | pgsql-bugs |
On Thu, Nov 21, 2019 at 05:57:52PM +0100, Tomas Vondra wrote: > On Thu, Nov 21, 2019 at 05:15:02PM +0100, Ondřej Jirman wrote: > > On Thu, Nov 21, 2019 at 04:57:07PM +0100, Ondřej Jirman wrote: > > > > > > Maybe it has something to do with my upgrade method. I > > > dumped/restored the replica with pg_dumpall, and then just proceded > > > to enable subscription and refresh publication with (copy_data=false) > > > for all my subscriptions. > > > > OTOH, it may not. There are 2 more databases replicated the same way > > from the same database cluster, and they don't crash the replica > > server, and continue replicating. The one of the other databases also > > has bytea columns in some of the tables. > > > > It really just seems related to the machine restart (a regular one) > > that I did on the primary, minutes later replica crashed, and kept > > crashing ever since whenever connecting to the primary for the hometv > > database. > > > > Hmmm. A restart of the primary certainly should not cause any such > damage, that'd be a bug too. And it'd be a bit strange that it correctly > sends the data and it crashes the replica. How exactly did you restart > the primary? What mode - smart/fast/immediate? Looks like Arch Linux uses fast mode. KillSignal=SIGINT Primary: ------- 23:22:31 - primary machine booted so postgresql shutdown was ~15s prior 23:22:53 - primary cluster started (saying LOG: database system was shut down at 2019-11-20 23:22:50 CET) (there was some RTC inaccuracy, corrected by NTP earlier during boot) lis 20 23:22:36 l5 systemd-timesyncd[312]: System clock time unset or jumped backwards, restoring from recorded timestamp:Wed 2019-11-20 23:22:50 CET postgres[355]: 2019-11-20 23:22:58.106 CET [386] LOG: starting logical decoding for slot "l5_hometv" And this is when the replica crashed: postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG: could not receive data from client: Spojení zrušeno druhou stranou postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG: unexpected EOF on standby connection Replica: ------- 2019-11-20 15:35:32.910 CET [1480049] LOG: logical replication apply worker for subscription "l3_weatherlog" has started 2019-11-20 15:35:32.921 CET [1480050] LOG: logical replication apply worker for subscription "l5_hometv" has started 2019-11-20 15:35:32.931 CET [1480051] LOG: logical replication apply worker for subscription "l5_ledger_dana" has started 2019-11-20 15:35:32.942 CET [1480052] LOG: logical replication apply worker for subscription "l5_news" has started 2019-11-20 23:22:50.315 CET [1480050] LOG: data stream from publisher has ended 2019-11-20 23:22:50.315 CET [1480050] ERROR: could not send end-of-streaming message to primary: COPY se neprovádí 2019-11-20 23:22:50.318 CET [178527] LOG: background worker "logical replication worker" (PID 1480050) exited with exitcode 1 2019-11-20 23:22:50.320 CET [1480051] LOG: data stream from publisher has ended 2019-11-20 23:22:50.320 CET [1480051] ERROR: could not send end-of-streaming message to primary: COPY se neprovádí 2019-11-20 23:22:50.321 CET [178527] LOG: background worker "logical replication worker" (PID 1480051) exited with exitcode 1 2019-11-20 23:22:50.321 CET [2195905] LOG: logical replication apply worker for subscription "l5_hometv" has started 2019-11-20 23:22:50.325 CET [1480052] LOG: data stream from publisher has ended 2019-11-20 23:22:50.325 CET [1480052] ERROR: could not send end-of-streaming message to primary: COPY se neprovádí 2019-11-20 23:22:50.326 CET [178527] LOG: background worker "logical replication worker" (PID 1480052) exited with exitcode 1 2019-11-20 23:22:50.395 CET [2195905] ERROR: could not connect to the publisher: FATAL: the database system is shuttingdown FATAL: the database system is shutting down 2019-11-20 23:22:50.396 CET [178527] LOG: background worker "logical replication worker" (PID 2195905) exited with exitcode 1 2019-11-20 23:22:55.401 CET [2195919] LOG: logical replication apply worker for subscription "l5_hometv" has started nothing in the log and then crash: Nov 20 23:28:47 core kernel: postgres[2195919]: segfault at 562d3bc56291 ip 00007faa0a760917 sp 00007ffd1f93f0c8 error 4in libc-2.30.so (deleted)[7faa0a6dd000+14d000] 2019-11-20 23:28:52.488 CET [178527] LOG: background worker "logical replication worker" (PID 2195919) was terminated bysignal 11: Neoprávněný přístup do pam> 2019-11-20 23:28:52.488 CET [178527] LOG: terminating any other active server processes 2019-11-20 23:28:52.488 CET [2195470] WARNING: terminating connection because of crash of another server process 2019-11-20 23:28:52.488 CET [2195470] DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another > .... then it restart 2019-11-20 23:28:52.505 CET [178527] LOG: all server processes terminated; reinitializing 2019-11-20 23:28:52.720 CET [2279954] LOG: database system was interrupted; last known up at 2019-11-20 23:27:33 CET 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 1 to 0/3BBF52C8 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 2 to 2/BBCE38C0 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 3 to 0/0 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 4 to 2/BBD75448 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 5 to 0/55503B08 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 6 to 0/0 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 7 to 0/5561CE80 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 8 to 0/0 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 9 to 0/0 2019-11-20 23:28:54.457 CET [2279954] LOG: recovered replication state of node 10 to 0/0 2019-11-20 23:28:54.457 CET [2279954] LOG: database system was not properly shut down; automatic recovery in progress did a redo once: 2019-11-20 23:28:54.463 CET [2279954] LOG: redo starts at 8/6D794050 2019-11-20 23:28:54.464 CET [2279954] LOG: invalid record length at 8/6D795370: wanted 24, got 0 2019-11-20 23:28:54.464 CET [2279954] LOG: redo done at 8/6D795338 and went into an endless crash restart loop > > So maybe something's wrong with the replica database (maybe because the > > connection got killed by the walsender at unfortunate time), rather > > than the original database, because I can replicate the original DB > > afresh into a new copy just fine and other databases continue > > replicating just fine if I disable the crashing subscription. > > > > Possibly, but what would be the damaged bit? The only thing I can think > of is the replication slot info (i.e. snapshot), and I know there were > some timing issues in the serialization. > > How far is the change from the restart point of the slot (visible in > pg_replication_slots)? If there are many changes since then, that'd mean > the corrupted snapshot is unlikely. It's the same. postgres=# select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn| confirmed_flush_lsn ----------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+-------------+--------------------- l5_news | pgoutput | logical | 16386 | news | f | f | | | 56463 | 2/BBF679B8 | 2/BBF67B78 l5_hometv | pgoutput | logical | 19030 | hometv | f | f | | | 56395 | 2/BBD86E68 | 2/BBD86E68 l5_hometv2 | pgoutput | logical | 19030 | hometv | f | f | | | 56463 | 2/BBF67B40 | 2/BBF67B78 > There's a lot of moving parts in this - you're replicating between major > versions, and from ARM to x86. All of that should work, of course, but > maybe there's a bug somewhere. So it might take time to investigate and > fix. Thanks for you patience ;-) Yeah, I readlize that. :) Thanks for your help. One missing piece is what exactly is the contents of the outstanding output from pgoutput plugin, that the replica crashes on and doesn't apply. Are there any tools for inspecting the binary output from the pgoutput plugin? Maybe that can provide a clue. regards, o. > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Tomas VondraДата:
Сообщение: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Следующее
От: Ondřej JirmanДата:
Сообщение: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker