Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Дата
Msg-id CAB7nPqQ4iZzjDrJc5VBvHY6uG7wGe9A8pjPSEfj3H7_NfNiu9Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code  (Fujii Masao <masao.fujii@gmail.com>)
Список pgsql-hackers
On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
>> On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> Not this patch's fault, but I'm getting a bit tired seeing the above open
>> >> coded. How about adding a function that does the sleeping based on a
>> >> timestamptz and a ms interval?
>> > You mean in plugins, right? I don't recall seeing similar patterns in
>> > other code paths of backend. But I think that we can use something
>> > like that in timestamp.c then because we need to leverage that between
>> > two timestamps, the last failure and now():
>> > TimestampSleepDifference(start_time, stop_time, internal_ms);
>> > Perhaps you have something else in mind?
>> >
>> > Attached is an updated patch.
>
>> Actually I came with better than last patch by using a boolean flag as
>> return value of TimestampSleepDifference and use
>> TimestampDifferenceExceeds directly inside it.
>
>> Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
>>  on failure
>
> I think that name isn't a very good. And its isn't very accurate
> either.
> How about wal_retrieve_retry_interval? Not very nice, but I think it's
> still better than the above.
Fine for me.

>> +     <varlistentry id="wal-availability-check-interval" xreflabel="wal_availability_check_interval">
>> +      <term><varname>wal_availability_check_interval</varname> (<type>integer</type>)
>> +      <indexterm>
>> +        <primary><varname>wal_availability_check_interval</> recovery parameter</primary>
>> +      </indexterm>
>> +      </term>
>> +      <listitem>
>> +       <para>
>> +        This parameter specifies the amount of time to wait when
>> +        WAL is not available for a node in recovery. Default value is
>> +        <literal>5s</>.
>> +       </para>
>> +       <para>
>> +        A node in recovery will wait for this amount of time if
>> +        <varname>restore_command</> returns nonzero exit status code when
>> +        fetching new WAL segment files from archive or when a WAL receiver
>> +        is not able to fetch a WAL record when using streaming replication.
>> +       </para>
>> +      </listitem>
>> +     </varlistentry>
>> +
>>      </variablelist>
>
> Walreceiver doesn't wait that amount, but rather how long the connection
> is intact. And restore_command may or may not retry.
I changed this paragraph as follows:
+        This parameter specifies the amount of time to wait when a failure
+        occurred when reading WAL from a source (be it via streaming
+        replication, local <filename>pg_xlog</> or WAL archive) for a node
+        in standby mode, or when WAL is expected from a source. Default
+        value is <literal>5s</>.
Pondering more about this patch, I am getting the feeling that it is
not that much a win to explain in details in the doc each failure
depending on the source from which WAL is taken. But it is essential
to mention that the wait is done only for a node using standby mode.

Btw, I noticed two extra things that I think should be done for consistency:
- We should use as well wal_retrieve_retry_interval when waiting for
WAL to arrive after checking for the failures:
                                        /*
-                                        * Wait for more WAL to
arrive. Time out after 5 seconds,
-                                        * like when polling the
archive, to react to a trigger
-                                        * file promptly.
+                                        * Wait for more WAL to
arrive. Time out after
+                                        * wal_retrieve_retry_interval
ms, like when polling the archive
+                                        * to react to a trigger file promptly.
                                         */
                                        WaitLatch(&XLogCtl->recoveryWakeupLatch,
                                                          WL_LATCH_SET
| WL_TIMEOUT,
-                                                         5000L);
+
wal_retrieve_retry_interval * 1L);
-

Updated patch attached.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: tim_wilson
Дата:
Сообщение: Re: Inaccuracy in VACUUM's tuple count estimates
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: B-Tree support function number 3 (strxfrm() optimization)