Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker

Поиск

Список

Период

Сортировка

От	Ondřej Jirman
Тема	Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Дата	21 ноября 2019 г. 20:35:55
Msg-id	20191121173555.midj6vp4dmiiqs24@core.my.home обсуждение исходный текст
Ответ на	Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы	Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker (Ondřej Jirman <ienieghapheoghaiwida@xff.cz>)
Список	pgsql-bugs

Дерево обсуждения

On Thu, Nov 21, 2019 at 05:57:52PM +0100, Tomas Vondra wrote:
> On Thu, Nov 21, 2019 at 05:15:02PM +0100, Ondřej Jirman wrote:
> > On Thu, Nov 21, 2019 at 04:57:07PM +0100, Ondřej Jirman wrote:
> > > 
> > > Maybe it has something to do with my upgrade method. I
> > > dumped/restored the replica with pg_dumpall, and then just proceded
> > > to enable subscription and refresh publication with (copy_data=false)
> > > for all my subscriptions.
> > 
> > OTOH, it may not. There are 2 more databases replicated the same way
> > from the same database cluster, and they don't crash the replica
> > server, and continue replicating. The one of the other databases also
> > has bytea columns in some of the tables.
> > 
> > It really just seems related to the machine restart (a regular one)
> > that I did on the primary, minutes later replica crashed, and kept
> > crashing ever since whenever connecting to the primary for the hometv
> > database.
> > 
> 
> Hmmm. A restart of the primary certainly should not cause any such
> damage, that'd be a bug too. And it'd be a bit strange that it correctly
> sends the data and it crashes the replica. How exactly did you restart
> the primary? What mode - smart/fast/immediate?

Looks like Arch Linux uses fast mode. KillSignal=SIGINT

Primary:
-------

23:22:31 - primary machine booted so postgresql shutdown was ~15s prior
23:22:53 - primary cluster started (saying
           LOG:  database system was shut down at 2019-11-20 23:22:50 CET)

(there was some RTC inaccuracy, corrected by NTP earlier during boot)
lis 20 23:22:36 l5 systemd-timesyncd[312]: System clock time unset or jumped backwards, restoring from recorded
timestamp:Wed 2019-11-20 23:22:50 CET
 

postgres[355]: 2019-11-20 23:22:58.106 CET [386] LOG:  starting logical decoding for slot "l5_hometv"

And this is when the replica crashed:

postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG:  could not receive data from client: Spojení zrušeno druhou
stranou
postgres[355]: 2019-11-20 23:28:52.489 CET [386] LOG:  unexpected EOF on standby connection

Replica:
-------

2019-11-20 15:35:32.910 CET [1480049] LOG:  logical replication apply worker for subscription "l3_weatherlog" has
started
2019-11-20 15:35:32.921 CET [1480050] LOG:  logical replication apply worker for subscription "l5_hometv" has started
2019-11-20 15:35:32.931 CET [1480051] LOG:  logical replication apply worker for subscription "l5_ledger_dana" has
started
2019-11-20 15:35:32.942 CET [1480052] LOG:  logical replication apply worker for subscription "l5_news" has started

2019-11-20 23:22:50.315 CET [1480050] LOG:  data stream from publisher has ended
2019-11-20 23:22:50.315 CET [1480050] ERROR:  could not send end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.318 CET [178527] LOG:  background worker "logical replication worker" (PID 1480050) exited with
exitcode 1
 
2019-11-20 23:22:50.320 CET [1480051] LOG:  data stream from publisher has ended
2019-11-20 23:22:50.320 CET [1480051] ERROR:  could not send end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.321 CET [178527] LOG:  background worker "logical replication worker" (PID 1480051) exited with
exitcode 1
 
2019-11-20 23:22:50.321 CET [2195905] LOG:  logical replication apply worker for subscription "l5_hometv" has started
2019-11-20 23:22:50.325 CET [1480052] LOG:  data stream from publisher has ended
2019-11-20 23:22:50.325 CET [1480052] ERROR:  could not send end-of-streaming message to primary: COPY se neprovádí
2019-11-20 23:22:50.326 CET [178527] LOG:  background worker "logical replication worker" (PID 1480052) exited with
exitcode 1
 
2019-11-20 23:22:50.395 CET [2195905] ERROR:  could not connect to the publisher: FATAL:  the database system is
shuttingdown
 
        FATAL:  the database system is shutting down
2019-11-20 23:22:50.396 CET [178527] LOG:  background worker "logical replication worker" (PID 2195905) exited with
exitcode 1
 
2019-11-20 23:22:55.401 CET [2195919] LOG:  logical replication apply worker for subscription "l5_hometv" has started

nothing in the log and then crash:

Nov 20 23:28:47 core kernel: postgres[2195919]: segfault at 562d3bc56291 ip 00007faa0a760917 sp 00007ffd1f93f0c8 error
4in libc-2.30.so (deleted)[7faa0a6dd000+14d000]
 
2019-11-20 23:28:52.488 CET [178527] LOG:  background worker "logical replication worker" (PID 2195919) was terminated
bysignal 11: Neoprávněný přístup do pam>
 
2019-11-20 23:28:52.488 CET [178527] LOG:  terminating any other active server processes
2019-11-20 23:28:52.488 CET [2195470] WARNING:  terminating connection because of crash of another server process
2019-11-20 23:28:52.488 CET [2195470] DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another >
 

....

then it restart

2019-11-20 23:28:52.505 CET [178527] LOG:  all server processes terminated; reinitializing
2019-11-20 23:28:52.720 CET [2279954] LOG:  database system was interrupted; last known up at 2019-11-20 23:27:33 CET
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 1 to 0/3BBF52C8
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 2 to 2/BBCE38C0
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 3 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 4 to 2/BBD75448
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 5 to 0/55503B08
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 6 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 7 to 0/5561CE80
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 8 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 9 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG:  recovered replication state of node 10 to 0/0
2019-11-20 23:28:54.457 CET [2279954] LOG:  database system was not properly shut down; automatic recovery in progress

did a redo once:

2019-11-20 23:28:54.463 CET [2279954] LOG:  redo starts at 8/6D794050
2019-11-20 23:28:54.464 CET [2279954] LOG:  invalid record length at 8/6D795370: wanted 24, got 0
2019-11-20 23:28:54.464 CET [2279954] LOG:  redo done at 8/6D795338

and went into an endless crash restart loop

> > So maybe something's wrong with the replica database (maybe because the
> > connection got killed by the walsender at unfortunate time), rather
> > than the original database, because I can replicate the original DB
> > afresh into a new copy just fine and other databases continue
> > replicating just fine if I disable the crashing subscription.
> > 
> 
> Possibly, but what would be the damaged bit? The only thing I can think
> of is the replication slot info (i.e. snapshot), and I know there were
> some timing issues in the serialization.
> 
> How far is the change from the restart point of the slot (visible in
> pg_replication_slots)? If there are many changes since then, that'd mean
> the corrupted snapshot is unlikely.

It's the same.

postgres=# select * from pg_replication_slots;

   slot_name    |  plugin  | slot_type | datoid |  database   | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn 
 

----------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+-------------+---------------------
 l5_news        | pgoutput | logical   |  16386 | news        | f         | f      |            |      |        56463 |
2/BBF679B8 | 2/BBF67B78
 
 l5_hometv      | pgoutput | logical   |  19030 | hometv      | f         | f      |            |      |        56395 |
2/BBD86E68 | 2/BBD86E68
 
 l5_hometv2     | pgoutput | logical   |  19030 | hometv      | f         | f      |            |      |        56463 |
2/BBF67B40 | 2/BBF67B78
 

> There's a lot of moving parts in this - you're replicating between major
> versions, and from ARM to x86. All of that should work, of course, but
> maybe there's a bug somewhere. So it might take time to investigate and
> fix. Thanks for you patience ;-)

Yeah, I readlize that. :) Thanks for your help.

One missing piece is what exactly is the contents of the outstanding output from
pgoutput plugin, that the replica crashes on and doesn't apply. Are there any
tools for inspecting the binary output from the pgoutput plugin? Maybe that can
provide a clue.

regards,
    o.

> regards
> 
> -- 
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-bugs по дате отправления:

Предыдущее

От: Tomas Vondra
Дата: 21 ноября 2019 г., 20:01:52
Сообщение: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker

Следующее

От: Ondřej Jirman
Дата: 21 ноября 2019 г., 20:37:17
Сообщение: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker

Предыдущее

Следующее