Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск
Список
Период
Сортировка
От Achilleas Mantzios
Тема Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Дата
Msg-id 356b8aba-6ce1-567c-e0fb-9660bdfc7ebe@matrix.gatewaynet.com
обсуждение исходный текст
Ответ на Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"  (Rui DeSousa <rui@crazybean.net>)
Ответы Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Список pgsql-admin


On 16/11/18 5:29 μ.μ., Rui DeSousa wrote:


On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

net.inet.tcp.always_keepalive=1

This setting is from FreeBSD. I have tested changing the settings on my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would have no effect at all to the postgresql settings, they remained all three of them at zero. This is completely irrelevant with my problem but anyway.


That is what I stated; you don’t need it.  It is that in Linux the application has to enable it and I don’t know of a kernel setting for Linux like the one in FreeBSD


You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.




A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request.  For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
So according to the official Linux docs, three are the parameters that govern TCP keepalive in Linux, which in both the said systems are set as :
root@TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
root@TEST-smadb:/var/lib/pgsql#


That does not mean the connection has TCP keep alive enabled; it just means that if an application requests it those would be the defaults setting if it doesn’t provide its own.  Those setting would be too large anyway; you want to be able to detect a broken connection much quicker than 18 hours.


I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.




The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.

So from looks of this, keep alive is enabled. (Also don't confuse WAL receiver with logical worker, different programs, albeit similar).

I don’t believe it’s enabled; have you check to see that you getting keep alive packets?  If it was enabled it would have terminated after 18 hours.


See above. In the meantime, I would be nice if someone from the hackers would chime in to clear things up, just to be sure.

Which means, that since PostgreSQL *supports* KEEPALIVE and the logical worker kept happy like nothing happened, then I guess *something* was mocking the KEEPALIVE ACKs??????



Is there any way (by network means?) to mock this behavior in order to fool the replication worker like the sender is there?

Put a firewall in-between the servers and drop the packets without sending resets.


Have a read here:

Section 4.2


The RFC states TCP keep alive should be off by default; FreeBSD changed that back in 1999 and I believe Linux still follows the RFC:



В списке pgsql-admin по дате отправления:

Предыдущее
От: Achilleas Mantzios
Дата:
Сообщение: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Следующее
От: Mariel Cherkassky
Дата:
Сообщение: Re: checkpoint occurs very often when vacuum full running