Re: [GENERAL] wal_retrieve_retry_interval

Поиск
Список
Период
Сортировка
От Ludovic Vaugeois-Pepin
Тема Re: [GENERAL] wal_retrieve_retry_interval
Дата
Msg-id CAAJDx8ME=kcBoPmbPjY=WgQBj_BredNzJMVhXsmkE=AODjC0iA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [GENERAL] wal_retrieve_retry_interval  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: [GENERAL] wal_retrieve_retry_interval
Список pgsql-general
On Tue, May 30, 2017 at 12:20 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, May 28, 2017 at 9:41 AM, Ludovic Vaugeois-Pepin
> <ludovicvp@gmail.com> wrote:
>> On Sun, May 28, 2017 at 9:49 AM, Ludovic Vaugeois-Pepin
>> <ludovicvp@gmail.com> wrote:
>>> It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time
"listen_addresses","hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased
loggingto DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log
file.
>
> Aren't those logs coming from a standby that is being reconnected to a
> promoted standby? In short you restarted this standby with a fresh
> recovery.conf.

Almost: the promoted standby (host test1) hasn't been started (and
promoted) when the standby with the 15 seconds delay (host test2)
tries to connect to it (its fresh recovery.conf):

$ cat recovery.conf
standby_mode = 'on'
recovery_target_timeline = 'latest'
primary_conninfo = 'host=test1 port=5432 user=repl1'
primary_slot_name = 'test2'


>The WAL receiver would try to fetch history file data
> for validity checks when it notices that there is a timeline jump,
> when it needs to itself jump to a new timeline, and the logs you are
> showing highlight that.

Logs below seem to indicate that the 15s delay begins before the
promoted standby is started. Is that the problem? Some network
connection timeout?




The standby (below as host test2) is trying to reconnect to a to-be
started and promoted standby (below as host test1).


host test2: before host1 is fully started and promoted:

< 2017-05-29 22:42:17.022 CEST > LOG:  database system is ready to
accept read only connections
< 2017-05-29 22:42:17.022 CEST > DEBUG:  checkpointer updated shared
memory configuration values
< 2017-05-29 22:42:17.022 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:17.022 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:17.025 CEST > FATAL:  could not connect to the
primary server: FATAL:  the database system is starting up
...
< 2017-05-29 22:42:17.026 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-29 22:42:17.026 CEST > DEBUG:  invalid record length at
0/8000360: wanted 24, got 0
< 2017-05-29 22:42:17.026 CEST > DEBUG:  switched WAL source from
archive to stream after failure
... (15 seconds later) ...
< 2017-05-29 22:42:32.042 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-29 22:42:32.042 CEST > DEBUG:  invalid record length at
0/8000360: wanted 24, got 0
< 2017-05-29 22:42:32.042 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-29 22:42:32.043 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:32.043 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:32.047 CEST > LOG:  fetching timeline history file
for timeline 4 from primary server
< 2017-05-29 22:42:32.050 CEST > LOG:  started streaming WAL from
primary at 0/8000000 on timeline 3



host test1: to-be-promoted server where the 500ms
wal_retrieve_retry_interval seems to be honored (with a "bogus"
primary_conninfo = 'host= port=5432 user=repl1'):

< 2017-05-29 22:42:17.099 CEST > LOG:  database system is ready to
accept read only connections
< 2017-05-29 22:42:17.102 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:17.103 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:17.104 CEST > FATAL:  could not connect to the
primary server: could not translate host name "port=5432" to address:
Name or service not known

< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(1): 4
on_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  proc_exit(1): 2 callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  exit(1)
< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  postgres child[5168]: starting with (
< 2017-05-29 22:42:17.104 CEST > DEBUG:  proc_exit(-1): 0 callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:         postgres
< 2017-05-29 22:42:17.104 CEST > DEBUG:  )
< 2017-05-29 22:42:17.104 CEST > DEBUG:  InitPostgres
< 2017-05-29 22:42:17.104 CEST > DEBUG:  my backend ID is 2
< 2017-05-29 22:42:17.104 CEST > DEBUG:  StartTransaction
< 2017-05-29 22:42:17.104 CEST > DEBUG:  name: unnamed; blockState:
   DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-29 22:42:17.104 CEST > DEBUG:  CommitTransaction
< 2017-05-29 22:42:17.104 CEST > DEBUG:  name: unnamed; blockState:
   STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-29 22:42:17.105 CEST > DEBUG:  reaping dead processes
< 2017-05-29 22:42:17.105 CEST > DEBUG:  received replication command:
IDENTIFY_SYSTEM
< 2017-05-29 22:42:17.105 CEST > DEBUG:  received replication command:
START_REPLICATION SLOT "test3" 0/8000000 TIMELINE 3
< 2017-05-29 22:42:17.105 CEST > DEBUG:  standby "walreceiver" has now
caught up with primary
< 2017-05-29 22:42:17.105 CEST > DEBUG:  write 0/8000360 flush
0/8000360 apply 0/8000360
< 2017-05-29 22:42:17.207 CEST > DEBUG:  hot standby feedback xmin 2535 epoch 0
< 2017-05-29 22:42:17.604 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-29 22:42:17.605 CEST > DEBUG:  invalid record length at
0/8000360: wanted 24, got 0
< 2017-05-29 22:42:17.605 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-29 22:42:17.605 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:17.605 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:17.607 CEST > FATAL:  could not connect to the
primary server: could not translate host name "port=5432" to address:
Name or service not known

< 2017-05-29 22:42:17.607 CEST > DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
< 2017-05-29 22:42:17.607 CEST > DEBUG:  shmem_exit(1): 4
on_shmem_exit callbacks to make

(22:42:18)$ /usr/pgsql-9.6/bin/pg_ctl promote -D /var/lib/pgsql/9.6/data

< 2017-05-29 22:42:18.685 CEST > LOG:  received promote request
< 2017-05-29 22:42:18.685 CEST > LOG:  redo is not required
< 2017-05-29 22:42:18.685 CEST > DEBUG:  resetting unlogged relations:
cleanup 0 init 1
< 2017-05-29 22:42:18.686 CEST > LOG:  selected new timeline ID: 4
...


--
Ludovic


В списке pgsql-general по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [GENERAL] wal_retrieve_retry_interval
Следующее
От: "Wetzel, Juergen (Juergen)"
Дата:
Сообщение: [GENERAL] Slow query plan used