Problem in starting postgresql on a probably slow machine:pg_ctlcluster ignores pg_ctl timeout

Поиск
Список
Период
Сортировка
От Giuseppe Sacco
Тема Problem in starting postgresql on a probably slow machine:pg_ctlcluster ignores pg_ctl timeout
Дата
Msg-id 831b8b45c4e1d4732aa39b41e276d49e3ea4bf59.camel@eppesuigoccas.homedns.org
обсуждение исходный текст
Ответы Re: Problem in starting postgresql on a probably slow machine:pg_ctlcluster ignores pg_ctl timeout  (Christoph Berg <myon@debian.org>)
Список pgsql-pkg-debian
Hello,
when I boot a machine that hosts a postgresql cluster, the cluster does
not start. The cluster is configured for starting automatically, but
this is what I find in the log via journalctl:

root@hostname:~# journalctl --unit postgresql@9.4-main.service
-- Logs begin at sab 2018-09-29 10:05:56 CEST, end at lun 2018-10-01 14:37:11 CEST. --
set 29 10:06:04 hostname systemd[1]: Starting PostgreSQL Cluster 9.4-main...
set 29 10:06:06 hostname postgresql@9.4-main[685]: Removed stale pid file.
set 29 10:06:13 hostname postgresql@9.4-main[685]: The PostgreSQL server failed to start. Please check the log output:
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:07 CEST [1034-1] LOG:  il database è stato
interrotto;l'ultimo segno di vita risale alle 2018-09-20 1
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:07 CEST [1058-1] [sconosciuto]@[sconosciuto] LOG:
pacchettodi avvio incompleto
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:08 CEST [1134-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:09 CEST [1137-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:09 CEST [1147-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:10 CEST [1151-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:10 CEST [1159-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:11 CEST [1180-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:11 CEST [1199-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:12 CEST [1224-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:12 CEST [1238-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:13 CEST [1265-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname postgresql@9.4-main[685]: 2018-09-29 10:06:13 CEST [1271-1] postgres@postgres FATALE:  il
databasesi sta avviando
 
set 29 10:06:13 hostname systemd[1]: postgresql@9.4-main.service: control process exited, code=exited status=1
set 29 10:06:15 hostname systemd[1]: Failed to start PostgreSQL Cluster 9.4-main.
set 29 10:06:15 hostname systemd[1]: Unit postgresql@9.4-main.service entered failed state.

I think this is a problema related to a timeout, so I investigated it.
I found a generic timeout in systemd unit files and a specific timeout
in pg_ctl. They are different so I leveled both of them at 5 minutes.

I created a directory named /etc/systemd/system/postgresql@.service.d
and I put there a file named timeout.conf with these three lines:

[Service]
Environment=PGCTLTIMEOUT=270
TimeoutSec=300s

the first line is supposed to change the environment variale for
pg_ctl, the second one is changing the timeout unit for systemd.

(I know I could probably have set pg_ctl timeout in
/etc/postgresql/9.4/main/pg_ctl.conf but I've preferred to keep both
timeounts in one file, and I don't want to set this file in every
cluster.
Systemd will use this file when starting the relevant units. Because
the path use the generic system unit name, it is applied to all
postgresql clusters.)


The problem has not been solved and I found a different problem in
/usr/bin/pg_ctlcluster. At about line 220, it does a loop for 60 times,
and in each loop it calls select() with a timeout of 0.5 seconds. If
the database does not start during this loop, pg_ctlcluster abort it
start operation.

[...]
    my $pidfile = $info{'pgdata'}.'/postmaster.pid';
    for (my $attempt = 0; $attempt < 60; $attempt++) {
        select (undef, undef, undef, 0.5);
        $currlogsize = (stat $info{'logfile'})[7] if $info{'logfile'} && -r $info{'logfile'};
[...]

So, beside the timeout I put in systemd and pg_ctl, pg_ctlcluster
always use its internal 30 seconds timeout. Please note that is
select() returns early, then its 60 iteration could last less than 30
seconds. In my case, the database seems started at 10:06:07 and it
gives up at 10:06:13.

Postgresql log contains some more lines that show the process receives
a smart stop request:

2018-09-29 10:06:13 CEST [1265-1] postgres@postgres FATALE:  il database si sta avviando
2018-09-29 10:06:13 CEST [1271-1] postgres@postgres FATALE:  il database si sta avviando
2018-09-29 10:06:13 CEST [866-1] LOG:  richiesta di arresto smart ricevuta
2018-09-29 10:06:14 CEST [1034-2] LOG:  il database non è stato arrestato correttamente; ripristino automatico in
corso
2018-09-29 10:06:15 CEST [1034-3] LOG:  record con lunghezza zero a 20/B9365820

and this is probably pg_ctlcluster (or is it systemd?) that tries to
stop it when the startup did not correctly work.

So, finally, my question: is pg_ctlcluster the root cause of my
problem? Why does it loop and wait if pg_ctl will exit anyway when the
database start or when its timeout is triggered? Should I open a bug
report for this?

Thank you,
Giuseppe


В списке pgsql-pkg-debian по дате отправления:

Предыдущее
От: apt.postgresql.org repository
Дата:
Сообщение: check-pgactivity updated to version 2.3-1.pgdg+1
Следующее
От: Christoph Berg
Дата:
Сообщение: Re: Problem in starting postgresql on a probably slow machine:pg_ctlcluster ignores pg_ctl timeout