Обсуждение: [GENERAL] wal_retrieve_retry_interval

Поиск
Список
Период
Сортировка

[GENERAL] wal_retrieve_retry_interval

От
Ludovic Vaugeois-Pepin
Дата:
Hi,

Say, with 9.6.2,  a hot_standby fails to connect to a replication slot:
    FATAL:  could not start WAL streaming: ERROR:  replication slot "test3" does not exist
or
    FATAL:  could not connect to the primary server: FATAL:  the database system is starting up

Is there a way to reduce the time it takes until the next attempt? I assumed, wrongly I think, that this would be wal_retrieve_retry_interval, but it seems that it won't make a difference. I tried setting it to 3s, but it seems to take 15s still. Here are two log samples:

1.

< 2017-05-27 20:12:12.137 CEST > LOG:  consistent recovery state reached at 0/8000328
< 2017-05-27 20:12:12.137 CEST > LOG:  invalid record length at 0/8000328: wanted 24, got 0
< 2017-05-27 20:12:12.137 CEST > LOG:  database system is ready to accept read only connections
< 2017-05-27 20:12:12.142 CEST > FATAL:  could not connect to the primary server: FATAL:  the database system is starting up

< 2017-05-27 20:12:27.208 CEST > LOG:  fetching timeline history file for timeline 4 from primary server
< 2017-05-27 20:12:27.212 CEST > LOG:  started streaming WAL from primary at 0/8000000 on timeline 3
< 2017-05-27 20:12:27.212 CEST > LOG:  replication terminated by primary server
< 2017-05-27 20:12:27.212 CEST > DETAIL:  End of WAL reached on timeline 3 at 0/8000328.


2.

< 2017-05-26 19:17:48.462 CEST > LOG:  database system was shut down in recovery at 2017-05-26 19:17:20 CEST
< 2017-05-26 19:17:48.462 CEST > LOG:  entering standby mode
< 2017-05-26 19:17:48.463 CEST > LOG:  consistent recovery state reached at 0/8000398
< 2017-05-26 19:17:48.463 CEST > LOG:  invalid record length at 0/8000398: wanted 24, got 0
< 2017-05-26 19:17:48.464 CEST > LOG:  database system is ready to accept read only connections
< 2017-05-26 19:17:48.470 CEST > FATAL:  could not start WAL streaming: ERROR:  replication slot "test3" does not exist

< 2017-05-26 19:18:03.495 CEST > LOG:  fetching timeline history file for timeline 4 from primary server
< 2017-05-26 19:18:03.498 CEST > LOG:  started streaming WAL from primary at 0/8000000 on timeline 3
< 2017-05-26 19:18:03.498 CEST > LOG:  replication terminated by primary server
< 2017-05-26 19:18:03.498 CEST > DETAIL:  End of WAL reached on timeline 3 at 0/8000398.



-bash-4.2$ psql
psql (9.6.2)
Type "help" for help.

postgres=# show wal_retrieve_retry_interval;
 wal_retrieve_retry_interval
-----------------------------
 3s
(1 row)



Thanks

Ludovic

Re: [GENERAL] wal_retrieve_retry_interval

От
Michael Paquier
Дата:
On Sat, May 27, 2017 at 2:40 PM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:
> Say, with 9.6.2,  a hot_standby fails to connect to a replication slot:
>     FATAL:  could not start WAL streaming: ERROR:  replication slot "test3"
> does not exist
> or
>     FATAL:  could not connect to the primary server: FATAL:  the database
> system is starting up
>
> Is there a way to reduce the time it takes until the next attempt? I
> assumed, wrongly I think, that this would be wal_retrieve_retry_interval,
> but it seems that it won't make a difference. I tried setting it to 3s, but
> it seems to take 15s still. Here are two log samples:

Could you double-check your configuration? If I set
wal_retrieve_retry_interval to 1s on a standby, I am able to see a
connection attempt from a WAL receiver happening with this interval of
time in the case of repetitive failures.
--
Michael


Fwd: [GENERAL] wal_retrieve_retry_interval

От
Ludovic Vaugeois-Pepin
Дата:

On Sun, May 28, 2017 at 6:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, May 27, 2017 at 2:40 PM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:
> Say, with 9.6.2,  a hot_standby fails to connect to a replication slot:
>     FATAL:  could not start WAL streaming: ERROR:  replication slot "test3"
> does not exist
> or
>     FATAL:  could not connect to the primary server: FATAL:  the database
> system is starting up
>
> Is there a way to reduce the time it takes until the next attempt? I
> assumed, wrongly I think, that this would be wal_retrieve_retry_interval,
> but it seems that it won't make a difference. I tried setting it to 3s, but
> it seems to take 15s still. Here are two log samples:

Could you double-check your configuration? If I set
wal_retrieve_retry_interval to 1s on a standby, I am able to see a
connection attempt from a WAL receiver happening with this interval of
time in the case of repetitive failures.

It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time "listen_addresses", "hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased logging to DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log file.

< 2017-05-28 09:29:36.127 CEST > LOG:  database system was shut down in recovery at 2017-05-28 09:29:03 CEST
< 2017-05-28 09:29:36.127 CEST > LOG:  entering standby mode
< 2017-05-28 09:29:36.127 CEST > DEBUG:  checkpoint record is at 0/80002B8
< 2017-05-28 09:29:36.128 CEST > DEBUG:  redo record is at 0/80002B8; shutdown TRUE
< 2017-05-28 09:29:36.128 CEST > DEBUG:  next transaction ID: 0:2535; next OID: 18660
< 2017-05-28 09:29:36.128 CEST > DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
< 2017-05-28 09:29:36.128 CEST > DEBUG:  oldest unfrozen transaction ID: 1750, in database 1
< 2017-05-28 09:29:36.128 CEST > DEBUG:  oldest MultiXactId: 1, in database 1
< 2017-05-28 09:29:36.128 CEST > DEBUG:  commit timestamp Xid oldest/newest: 0/0
< 2017-05-28 09:29:36.128 CEST > DEBUG:  transaction ID wrap limit is 2147485397, limited by database with OID 1
< 2017-05-28 09:29:36.128 CEST > DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
< 2017-05-28 09:29:36.128 CEST > DEBUG:  starting up replication slots
< 2017-05-28 09:29:36.128 CEST > DEBUG:  resetting unlogged relations: cleanup 1 init 0
< 2017-05-28 09:29:36.129 CEST > DEBUG:  initializing for hot standby
< 2017-05-28 09:29:36.129 CEST > DEBUG:  recovery snapshots are now enabled
< 2017-05-28 09:29:36.129 CEST > LOG:  consistent recovery state reached at 0/8000328
< 2017-05-28 09:29:36.129 CEST > LOG:  invalid record length at 0/8000328: wanted 24, got 0
< 2017-05-28 09:29:36.129 CEST > LOG:  database system is ready to accept read only connections
< 2017-05-28 09:29:36.135 CEST > FATAL:  could not connect to the primary server: FATAL:  the database system is starting up

< 2017-05-28 09:29:36.135 CEST > DEBUG:  invalid record length at 0/8000328: wanted 24, got 0
< 2017-05-28 09:29:51.153 CEST > DEBUG:  invalid record length at 0/8000328: wanted 24, got 0
< 2017-05-28 09:29:51.158 CEST > LOG:  fetching timeline history file for timeline 4 from primary server
< 2017-05-28 09:29:51.160 CEST > LOG:  started streaming WAL from primary at 0/8000000 on timeline 3
< 2017-05-28 09:29:51.160 CEST > LOG:  replication terminated by primary server
< 2017-05-28 09:29:51.160 CEST > DETAIL:  End of WAL reached on timeline 3 at 0/8000328.
< 2017-05-28 09:29:51.161 CEST > DEBUG:  walreceiver ended streaming and awaits new instructions
< 2017-05-28 09:29:51.161 CEST > LOG:  new target timeline is 4
< 2017-05-28 09:29:51.161 CEST > DEBUG:  invalid record length at 0/8000328: wanted 24, got 0
< 2017-05-28 09:29:51.161 CEST > LOG:  restarted WAL streaming at 0/8000000 on timeline 4
< 2017-05-28 09:29:51.191 CEST > LOG:  redo starts at 0/8000328


 
--
Michael



--
Ludovic Vaugeois-Pepin



--
Ludovic Vaugeois-Pepin

Re: [GENERAL] wal_retrieve_retry_interval

От
Ludovic Vaugeois-Pepin
Дата:
On Sun, May 28, 2017 at 9:49 AM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:
>
>
> On Sun, May 28, 2017 at 6:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>>
>> On Sat, May 27, 2017 at 2:40 PM, Ludovic Vaugeois-Pepin
>> <ludovicvp@gmail.com> wrote:
>> > Say, with 9.6.2,  a hot_standby fails to connect to a replication slot:
>> >     FATAL:  could not start WAL streaming: ERROR:  replication slot "test3"
>> > does not exist
>> > or
>> >     FATAL:  could not connect to the primary server: FATAL:  the database
>> > system is starting up
>> >
>> > Is there a way to reduce the time it takes until the next attempt? I
>> > assumed, wrongly I think, that this would be wal_retrieve_retry_interval,
>> > but it seems that it won't make a difference. I tried setting it to 3s, but
>> > it seems to take 15s still. Here are two log samples:
>>
>> Could you double-check your configuration? If I set
>> wal_retrieve_retry_interval to 1s on a standby, I am able to see a
>> connection attempt from a WAL receiver happening with this interval of
>> time in the case of repetitive failures.
>
>
> It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time
"listen_addresses","hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased
loggingto DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log
file.
>
> < 2017-05-28 09:29:36.127 CEST > LOG:  database system was shut down in recovery at 2017-05-28 09:29:03 CEST
> < 2017-05-28 09:29:36.127 CEST > LOG:  entering standby mode
> < 2017-05-28 09:29:36.127 CEST > DEBUG:  checkpoint record is at 0/80002B8
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  redo record is at 0/80002B8; shutdown TRUE
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  next transaction ID: 0:2535; next OID: 18660
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  oldest unfrozen transaction ID: 1750, in database 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  oldest MultiXactId: 1, in database 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  commit timestamp Xid oldest/newest: 0/0
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  transaction ID wrap limit is 2147485397, limited by database with OID 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  starting up replication slots
> < 2017-05-28 09:29:36.128 CEST > DEBUG:  resetting unlogged relations: cleanup 1 init 0
> < 2017-05-28 09:29:36.129 CEST > DEBUG:  initializing for hot standby
> < 2017-05-28 09:29:36.129 CEST > DEBUG:  recovery snapshots are now enabled
> < 2017-05-28 09:29:36.129 CEST > LOG:  consistent recovery state reached at 0/8000328
> < 2017-05-28 09:29:36.129 CEST > LOG:  invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:36.129 CEST > LOG:  database system is ready to accept read only connections
> < 2017-05-28 09:29:36.135 CEST > FATAL:  could not connect to the primary server: FATAL:  the database system is
startingup 
>
> < 2017-05-28 09:29:36.135 CEST > DEBUG:  invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:51.153 CEST > DEBUG:  invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:51.158 CEST > LOG:  fetching timeline history file for timeline 4 from primary server
> < 2017-05-28 09:29:51.160 CEST > LOG:  started streaming WAL from primary at 0/8000000 on timeline 3
> < 2017-05-28 09:29:51.160 CEST > LOG:  replication terminated by primary server
> < 2017-05-28 09:29:51.160 CEST > DETAIL:  End of WAL reached on timeline 3 at 0/8000328.
> < 2017-05-28 09:29:51.161 CEST > DEBUG:  walreceiver ended streaming and awaits new instructions
> < 2017-05-28 09:29:51.161 CEST > LOG:  new target timeline is 4
> < 2017-05-28 09:29:51.161 CEST > DEBUG:  invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:51.161 CEST > LOG:  restarted WAL streaming at 0/8000000 on timeline 4
> < 2017-05-28 09:29:51.191 CEST > LOG:  redo starts at 0/8000328


And this is the log with log_min_messages DEBUG5:


< 2017-05-28 14:48:10.108 CEST > LOG:  invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:10.108 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:10.110 CEST > DEBUG:  checkpointer updated shared
memory configuration values
< 2017-05-28 14:48:10.110 CEST > LOG:  database system is ready to
accept read only connections
< 2017-05-28 14:48:10.112 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-28 14:48:10.112 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-28 14:48:10.126 CEST > FATAL:  could not connect to the
primary server: FATAL:  the database system is starting up

< 2017-05-28 14:48:10.126 CEST > DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
< 2017-05-28 14:48:10.126 CEST > DEBUG:  shmem_exit(1): 4
on_shmem_exit callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG:  proc_exit(1): 2 callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG:  exit(1)
< 2017-05-28 14:48:10.127 CEST > DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG:  shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG:  proc_exit(-1): 0 callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-28 14:48:10.127 CEST > DEBUG:  invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:10.127 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:10.130 CEST > DEBUG:  reaping dead processes
< 2017-05-28 14:48:11.062 CEST > DEBUG:  forked new backend, pid=3184 socket=12
< 2017-05-28 14:48:11.063 CEST > DEBUG:  postgres child[3184]: starting with (
< 2017-05-28 14:48:11.063 CEST > DEBUG:         postgres
< 2017-05-28 14:48:11.063 CEST > DEBUG:  )
< 2017-05-28 14:48:11.063 CEST > DEBUG:  InitPostgres
< 2017-05-28 14:48:11.063 CEST > DEBUG:  my backend ID is 2
< 2017-05-28 14:48:11.064 CEST > DEBUG:  StartTransaction
< 2017-05-28 14:48:11.064 CEST > DEBUG:  name: unnamed; blockState:
   DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:11.075 CEST > DEBUG:  CommitTransaction
< 2017-05-28 14:48:11.075 CEST > DEBUG:  name: unnamed; blockState:
   STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:11.077 CEST > DEBUG:  shmem_exit(0): 1
before_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG:  shmem_exit(0): 6
on_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG:  proc_exit(0): 3 callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG:  exit(0)
< 2017-05-28 14:48:11.077 CEST > DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG:  shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG:  proc_exit(-1): 0 callbacks to make
< 2017-05-28 14:48:11.081 CEST > DEBUG:  reaping dead processes
< 2017-05-28 14:48:11.081 CEST > DEBUG:  server process (PID 3184)
exited with exit code 0
< 2017-05-28 14:48:11.095 CEST > DEBUG:  forked new backend, pid=3186 socket=12
< 2017-05-28 14:48:11.096 CEST > DEBUG:  postgres child[3186]: starting with (
< 2017-05-28 14:48:11.096 CEST > DEBUG:         postgres
< 2017-05-28 14:48:11.096 CEST > DEBUG:  )

...

< 2017-05-28 14:48:24.985 CEST > DEBUG:  postgres child[3430]: starting with (
< 2017-05-28 14:48:24.985 CEST > DEBUG:         postgres
< 2017-05-28 14:48:24.985 CEST > DEBUG:  )
< 2017-05-28 14:48:24.985 CEST > DEBUG:  InitPostgres
< 2017-05-28 14:48:24.985 CEST > DEBUG:  my backend ID is 2
< 2017-05-28 14:48:24.986 CEST > DEBUG:  StartTransaction
< 2017-05-28 14:48:24.986 CEST > DEBUG:  name: unnamed; blockState:
   DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.989 CEST > DEBUG:  CommitTransaction
< 2017-05-28 14:48:24.989 CEST > DEBUG:  name: unnamed; blockState:
   STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.989 CEST > DEBUG:  StartTransactionCommand
< 2017-05-28 14:48:24.989 CEST > DEBUG:  StartTransaction
< 2017-05-28 14:48:24.989 CEST > DEBUG:  name: unnamed; blockState:
   DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.992 CEST > DEBUG:  CommitTransactionCommand
< 2017-05-28 14:48:24.992 CEST > DEBUG:  CommitTransaction
< 2017-05-28 14:48:24.992 CEST > DEBUG:  name: unnamed; blockState:
   STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.993 CEST > DEBUG:  shmem_exit(0): 1
before_shmem_exit callbacks to make
< 2017-05-28 14:48:24.993 CEST > DEBUG:  shmem_exit(0): 6
on_shmem_exit callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG:  proc_exit(0): 3 callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG:  exit(0)
< 2017-05-28 14:48:24.994 CEST > DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG:  shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG:  proc_exit(-1): 0 callbacks to make
< 2017-05-28 14:48:24.996 CEST > DEBUG:  reaping dead processes
< 2017-05-28 14:48:24.996 CEST > DEBUG:  server process (PID 3430)
exited with exit code 0
< 2017-05-28 14:48:25.144 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-28 14:48:25.144 CEST > DEBUG:  invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:25.144 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:25.147 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-28 14:48:25.147 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-28 14:48:25.161 CEST > LOG:  fetching timeline history file
for timeline 4 from primary server
< 2017-05-28 14:48:25.166 CEST > LOG:  started streaming WAL from
primary at 0/8000000 on timeline 3
< 2017-05-28 14:48:25.166 CEST > DEBUG:  sendtime 2017-05-28
14:48:25.16509+02 receipttime 2017-05-28 14:48:25.166539+02
replication apply delay (N/A) transfer latency 1 ms
< 2017-05-28 14:48:25.166 CEST > LOG:  replication terminated by primary server
< 2017-05-28 14:48:25.166 CEST > DETAIL:  End of WAL reached on
timeline 3 at 0/8000398.
< 2017-05-28 14:48:25.166 CEST > DEBUG:  sending write 0/8000398 flush
0/8000398 apply 0/8000398
< 2017-05-28 14:48:25.167 CEST > DEBUG:  walreceiver ended streaming
and awaits new instructions
< 2017-05-28 14:48:25.168 CEST > LOG:  new target timeline is 4
< 2017-05-28 14:48:25.168 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-28 14:48:25.168 CEST > DEBUG:  invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:25.168 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:25.169 CEST > LOG:  restarted WAL streaming at
0/8000000 on timeline 4
< 2017-05-28 14:48:25.170 CEST > DEBUG:  sendtime 2017-05-28
14:48:25.168651+02 receipttime 2017-05-28 14:48:25.170063+02
replication apply delay (N/A) transfer latency 1 ms
< 2017-05-28 14:48:25.170 CEST > DEBUG:  creating and filling new WAL file
< 2017-05-28 14:48:25.225 CEST > DEBUG:  done creating and filling new WAL file
< 2017-05-28 14:48:25.225 CEST > DEBUG:  sending write 0/80006C8 flush
0/8000398 apply 0/8000398
< 2017-05-28 14:48:25.226 CEST > DEBUG:  sending write 0/80006C8 flush
0/80006C8 apply 0/8000398
< 2017-05-28 14:48:25.227 CEST > DEBUG:  sending hot standby feedback
xmin 2535 epoch 0


Re: [GENERAL] wal_retrieve_retry_interval

От
Michael Paquier
Дата:
On Sun, May 28, 2017 at 9:41 AM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:
> On Sun, May 28, 2017 at 9:49 AM, Ludovic Vaugeois-Pepin
> <ludovicvp@gmail.com> wrote:
>> It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time
"listen_addresses","hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased
loggingto DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log
file.

Aren't those logs coming from a standby that is being reconnected to a
promoted standby? In short you restarted this standby with a fresh
recovery.conf. The WAL receiver would try to fetch history file data
for validity checks when it notices that there is a timeline jump,
when it needs to itself jump to a new timeline, and the logs you are
showing highlight that.
--
Michael


Re: [GENERAL] wal_retrieve_retry_interval

От
Ludovic Vaugeois-Pepin
Дата:
On Tue, May 30, 2017 at 12:20 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, May 28, 2017 at 9:41 AM, Ludovic Vaugeois-Pepin
> <ludovicvp@gmail.com> wrote:
>> On Sun, May 28, 2017 at 9:49 AM, Ludovic Vaugeois-Pepin
>> <ludovicvp@gmail.com> wrote:
>>> It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time
"listen_addresses","hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased
loggingto DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log
file.
>
> Aren't those logs coming from a standby that is being reconnected to a
> promoted standby? In short you restarted this standby with a fresh
> recovery.conf.

Almost: the promoted standby (host test1) hasn't been started (and
promoted) when the standby with the 15 seconds delay (host test2)
tries to connect to it (its fresh recovery.conf):

$ cat recovery.conf
standby_mode = 'on'
recovery_target_timeline = 'latest'
primary_conninfo = 'host=test1 port=5432 user=repl1'
primary_slot_name = 'test2'


>The WAL receiver would try to fetch history file data
> for validity checks when it notices that there is a timeline jump,
> when it needs to itself jump to a new timeline, and the logs you are
> showing highlight that.

Logs below seem to indicate that the 15s delay begins before the
promoted standby is started. Is that the problem? Some network
connection timeout?




The standby (below as host test2) is trying to reconnect to a to-be
started and promoted standby (below as host test1).


host test2: before host1 is fully started and promoted:

< 2017-05-29 22:42:17.022 CEST > LOG:  database system is ready to
accept read only connections
< 2017-05-29 22:42:17.022 CEST > DEBUG:  checkpointer updated shared
memory configuration values
< 2017-05-29 22:42:17.022 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:17.022 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:17.025 CEST > FATAL:  could not connect to the
primary server: FATAL:  the database system is starting up
...
< 2017-05-29 22:42:17.026 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-29 22:42:17.026 CEST > DEBUG:  invalid record length at
0/8000360: wanted 24, got 0
< 2017-05-29 22:42:17.026 CEST > DEBUG:  switched WAL source from
archive to stream after failure
... (15 seconds later) ...
< 2017-05-29 22:42:32.042 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-29 22:42:32.042 CEST > DEBUG:  invalid record length at
0/8000360: wanted 24, got 0
< 2017-05-29 22:42:32.042 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-29 22:42:32.043 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:32.043 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:32.047 CEST > LOG:  fetching timeline history file
for timeline 4 from primary server
< 2017-05-29 22:42:32.050 CEST > LOG:  started streaming WAL from
primary at 0/8000000 on timeline 3



host test1: to-be-promoted server where the 500ms
wal_retrieve_retry_interval seems to be honored (with a "bogus"
primary_conninfo = 'host= port=5432 user=repl1'):

< 2017-05-29 22:42:17.099 CEST > LOG:  database system is ready to
accept read only connections
< 2017-05-29 22:42:17.102 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:17.103 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:17.104 CEST > FATAL:  could not connect to the
primary server: could not translate host name "port=5432" to address:
Name or service not known

< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(1): 4
on_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  proc_exit(1): 2 callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  exit(1)
< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:  postgres child[5168]: starting with (
< 2017-05-29 22:42:17.104 CEST > DEBUG:  proc_exit(-1): 0 callbacks to make
< 2017-05-29 22:42:17.104 CEST > DEBUG:         postgres
< 2017-05-29 22:42:17.104 CEST > DEBUG:  )
< 2017-05-29 22:42:17.104 CEST > DEBUG:  InitPostgres
< 2017-05-29 22:42:17.104 CEST > DEBUG:  my backend ID is 2
< 2017-05-29 22:42:17.104 CEST > DEBUG:  StartTransaction
< 2017-05-29 22:42:17.104 CEST > DEBUG:  name: unnamed; blockState:
   DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-29 22:42:17.104 CEST > DEBUG:  CommitTransaction
< 2017-05-29 22:42:17.104 CEST > DEBUG:  name: unnamed; blockState:
   STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-29 22:42:17.105 CEST > DEBUG:  reaping dead processes
< 2017-05-29 22:42:17.105 CEST > DEBUG:  received replication command:
IDENTIFY_SYSTEM
< 2017-05-29 22:42:17.105 CEST > DEBUG:  received replication command:
START_REPLICATION SLOT "test3" 0/8000000 TIMELINE 3
< 2017-05-29 22:42:17.105 CEST > DEBUG:  standby "walreceiver" has now
caught up with primary
< 2017-05-29 22:42:17.105 CEST > DEBUG:  write 0/8000360 flush
0/8000360 apply 0/8000360
< 2017-05-29 22:42:17.207 CEST > DEBUG:  hot standby feedback xmin 2535 epoch 0
< 2017-05-29 22:42:17.604 CEST > DEBUG:  switched WAL source from
stream to archive after failure
< 2017-05-29 22:42:17.605 CEST > DEBUG:  invalid record length at
0/8000360: wanted 24, got 0
< 2017-05-29 22:42:17.605 CEST > DEBUG:  switched WAL source from
archive to stream after failure
< 2017-05-29 22:42:17.605 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-29 22:42:17.605 CEST > DEBUG:  find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-29 22:42:17.607 CEST > FATAL:  could not connect to the
primary server: could not translate host name "port=5432" to address:
Name or service not known

< 2017-05-29 22:42:17.607 CEST > DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
< 2017-05-29 22:42:17.607 CEST > DEBUG:  shmem_exit(1): 4
on_shmem_exit callbacks to make

(22:42:18)$ /usr/pgsql-9.6/bin/pg_ctl promote -D /var/lib/pgsql/9.6/data

< 2017-05-29 22:42:18.685 CEST > LOG:  received promote request
< 2017-05-29 22:42:18.685 CEST > LOG:  redo is not required
< 2017-05-29 22:42:18.685 CEST > DEBUG:  resetting unlogged relations:
cleanup 0 init 1
< 2017-05-29 22:42:18.686 CEST > LOG:  selected new timeline ID: 4
...


--
Ludovic


Re: [GENERAL] wal_retrieve_retry_interval

От
Michael Paquier
Дата:
On Mon, May 29, 2017 at 3:58 PM, Ludovic Vaugeois-Pepin
<ludovicvp@gmail.com> wrote:
> < 2017-05-29 22:42:17.026 CEST > DEBUG:  switched WAL source from
> archive to stream after failure
> ... (15 seconds later) ...
> < 2017-05-29 22:42:32.042 CEST > DEBUG:  switched WAL source from
> stream to archive after failure

Ludovic, is there some event happening between those two logs?
Something that could explain a longer delay is the time it takes for a
WAL receiver to be considered as started (see WALRCV_STARTUP_TIMEOUT).
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general