Re: WIP: Failover Slots
От | Craig Ringer |
---|---|
Тема | Re: WIP: Failover Slots |
Дата | |
Msg-id | CAMsr+YEacvQiHb1HG1KVwGb2QUBSmku5FeaFT4RFgttoG4a1Rw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: WIP: Failover Slots (Oleksii Kliukin <alexk@hintbits.com>) |
Ответы |
Re: WIP: Failover Slots
(Craig Ringer <craig@2ndquadrant.com>)
|
Список | pgsql-hackers |
On 24 February 2016 at 03:53, Oleksii Kliukin <alexk@hintbits.com> wrote:
I really want to focus on the first patch, timeline following for logical slots. That part is much less invasive and is useful stand-alone. I'll move it to a separate CF entry and post it to a separate thread as I think it needs consideration independently of failover slots.
I found the following issue when shutting down a master with a connected replica that uses a physical failover slot:2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"archived transaction log file ""000000010000000000000003""",,,,,,,,,""2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 CET,,0,PANIC,XX000,"concurrent transaction log activity while database system is shutting down",,,,,,,,,""2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal 6: Abort trap",,,,,,,,,""2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,
Odd that I didn't see that in my testing. Thanks very much for this. I concur with your explanation.
Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, which currently produces WAL, and this violates the assumption at line xlog.c:8492if (shutdown && checkPoint.redo != ProcLastRecPtr)ereport(PANIC,(errmsg("concurrent transaction log activity while database system is shutting down")));
Interesting problem.
It might be reasonably harmless to omit writing WAL for failover slots during a shutdown checkpoint. We're using WAL to move data to the replicas but we don't really need it for local redo and correctness on the master. The trouble is that we do of course redo failover slot updates on the master and we don't really want a slot to go backwards vs its on-disk state before a crash. That's not too harmful - but might be able to lead to us losing a slot catalog_xmin increase so the slot thinks catalog is still readable that could've actually been vacuumed away.
CheckpointReplicationSlots notes that:
* This needn't actually be part of a checkpoint, but it's a convenient
* location.
... and I suspect the answer there is simply to move the slot checkpoint to occur prior to the WAL checkpoint rather than during it. I'll investigate.
I really want to focus on the first patch, timeline following for logical slots. That part is much less invasive and is useful stand-alone. I'll move it to a separate CF entry and post it to a separate thread as I think it needs consideration independently of failover slots.
(BTW, the slot docs promise that slots will replay a change exactly once, but this is not correct and the client must keep track of replay position. I'll post a patch to correct it separately).
There are a couple of incorrect comments
Thanks, will amend.
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Artur ZakirovДата:
Сообщение: Re: plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types