"have it fail over to using the archived WALs instead of full database restore" How do I configure this ?
With Postgres replication, it’s configured it in the recovery.conf file using the “restore_command”. It would amount to a some script that connect into your backups and pulls the requested WAL file.
When you say no firewall; that is bit confusing and I’m left assuming that the nodes are on the same subnet? I normally only use replication slots with either a backup solution or a replia that is going over a WAN. I am bit perplex why replication would fall that far behind on a local network (send lag not replay lag). What is the interconnect; is it gigabit or 10g and what the volume of WALs being generated? Might have a network related issue here.
I haven’t used repmgr; thus I can’t help there.