Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Поиск

Список

Период

Сортировка

От	Achilleas Mantzios
Тема	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"
Дата	18 ноября 2018 г. 10:03:05
Msg-id	48980eb7-3f28-9813-8ef9-9444b89b8422@matrix.gatewaynet.com обсуждение исходный текст
Ответ на	Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device" (Rui DeSousa <rui@crazybean.net>)
Список	pgsql-admin

Дерево обсуждения

On 17/11/18 10:47 μ.μ., Rui DeSousa wrote:

On Nov 17, 2018, at 6:07 AM, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:

You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the code supports KEEPALIVE.

Postgres supports it; but the question is it on for the given connection?

As we both saw this is the default.

I checked on a bare minimal default installation, (after tweaking the kernel tunables to smaller values of course), keepalive msgs are sent and ACK'ed at the specified intervals, checked with wireshark, port 5432. You should test this yourself.

I just configured Postgres with streaming replication using the following versions and TCP keep alive was enabled by default for the WAL receiver connection and also psql connections.

Linux debian 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
PostgreSQL 10.6 (Debian 10.6-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

root@debian:~# netstat -anp --timers | grep -e Timer -e EST | grep -e Timer -e 5432
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 0 10.6.6.101:47546 10.6.6.100:5432 ESTABLISHED 989/telnet off (0.00/0/0)
tcp 0 0 10.6.6.101:47544 10.6.6.100:5432 ESTABLISHED 953/psql keepalive (7103.36/0/0)
tcp 0 0 10.6.6.101:47542 10.6.6.100:5432 ESTABLISHED 922/postgres: 10/ma keepalive (7088.03/0/0)

As you can see from above; telnet does not enable keep alive on the connection. I would check the troubled system with the above netstat command to verify that keep alive is in fact enabled on the WAL receiver connection.

On the very same system (with *no* configuration/nothing changed -the kernel sysctl keepalive tested settings was on another test machine-), just now :

root@TEST-smadb:/var/lib/pgsql# uname -aLinux TEST-smadb 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linuxroot@TEST-smadb:/var/lib/pgsql# root@TEST-smadb:/var/lib/pgsql# netstat -anp --timers | grep 10.9.0.20 | grep 5432tcp 0 0 10.9.0.77:51307 10.9.0.20:5432 ESTABLISHED 34400/postgres: bgw keepalive (5711.60/0/0)root@TEST-smadb:/var/lib/pgsql# root@TEST-smadb:/var/lib/pgsql# ps aux | grep 34400postgres 34400 0.5 17.1 19618680 5648984 ? Ssl Nov16 13:52 postgres: bgworker: logical replication worker for subscription 185231525 root 70384 0.0 0.0 12708 1620 pts/1 S+ 08:29 0:00 grep 34400root@TEST-smadb:/var/lib/pgsql# root@TEST-smadb:/var/lib/pgsql# su - postgres -c 'psql -c "select version()"' version ----------------------------------------------------------------------------------------- PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit(1 row)root@TEST-smadb:/var/lib/pgsql#

If it’s enabled the connection should have terminated after the 18 hours

Yes.

and hopefully less now with your new setting.

This was on a test node, the settings are still at the default.

I have no idea why it wouldn’t terminate and reconnect other than tcp keep live is either off or a bug in Linux/Postgres.

I may post the logs of both the primary and the subscriber (filtering out some sensitive business content of course). It is striking that the subscriber node immediately after restart, starts asking for WALs, so this is not network firewall related, at least not in a static fashion. I mean if a firewall was in between with some rules then it could have (catastrophic) effect for the whole course of replication, not just those specific hours. So I too think that there maybe a bug in PostgreSQL (but the logical worker shares code with the wal receiver, this has been tested for so long) or the OS (again unlikely) *or* something weird may have happened, like some VM migration, e.g. cloning, on the cloud provider side. I don't have direct contact with them, our admin guys say nothing abnormal happened, the cloud provider says nothing abnormal happened, however my boss told me that they were doing some migrations from one site to another. That might explain some things.

I will ask again. Thank you a lot!

В списке pgsql-admin по дате отправления:

Предыдущее

От: Rui DeSousa
Дата: 18 ноября 2018 г., 00:18:36
Сообщение: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Следующее

От: Mariel Cherkassky
Дата: 18 ноября 2018 г., 12:29:24
Сообщение: postgresql history and timelines

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"

Предыдущее

Следующее