Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

Поиск
Список
Период
Сортировка
От Abhijit Menon-Sen
Тема Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Дата
Msg-id 20130116074817.GA5011@toroid.org
обсуждение исходный текст
Ответ на Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown  (Boszormenyi Zoltan <zb@cybertec.at>)
Список pgsql-hackers
Hi.

This patch was marked "Needs review" with no reviewers in the ongoing
CF, so I decided to take a look at it. I see that Zoltan has posted a
review, so I've added him to the list.

But I took a look at the latest patch in any case. Here are some
comments, mostly cosmetic ones.

> diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml postgresql/doc/src/sgml/ref/pg_basebackup.sgml
> *** postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml    2013-01-05 17:34:30.742135371 +0100
> --- postgresql/doc/src/sgml/ref/pg_basebackup.sgml    2013-01-07 15:11:40.787007890 +0100
> *************** PostgreSQL documentation
> *** 400,405 ****
> --- 400,425 ----
>        </varlistentry>
>
>        <varlistentry>
> +       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
> +       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
> +       <listitem>
> +        <para>
> +         time that receiver waits for communication from server (in seconds).
> +        </para>
> +       </listitem>
> +      </varlistentry>

I would reword this as "The maximum time (in seconds) to wait for data
from the server (default: wait forever)".

> +      <varlistentry>
> +       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
> +       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
> +       <listitem>
> +        <para>
> +         time that client wait for connection to establish with server (in seconds).
> +        </para>
> +       </listitem>
> +      </varlistentry>

Likewise, "The maximum time (in seconds) to wait for a connection to the
server to succeed (default: wait forever)".

Same thing in pg_receivexlog.sgml. Also, there's trailing whitespace in
various places in these files (and elsewhere in the patch), which should
be fixed.

> diff -dcrpN postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c postgresql/src/bin/pg_basebackup/pg_basebackup.c
> *** postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c    2013-01-05 17:34:30.778135625 +0100
> --- postgresql/src/bin/pg_basebackup/pg_basebackup.c    2013-01-07 15:16:24.610037886 +0100
> *************** bool        streamwal = false;
> *** 45,50 ****
> --- 45,54 ----
>   bool        fastcheckpoint = false;
>   bool        writerecoveryconf = false;
>   int            standby_message_timeout = 10 * 1000;        /* 10 sec = default */
> + int        standby_recv_timeout = 60*1000;        /* 60 sec = default */
> + char        *standby_connect_timeout = NULL;

I don't really like standby_recv_timeout being an int and
standby_connect_timeout being a char *. I understand that it's so that
it can be assigned to "values[i]" in GetConnection(), but that reason is
very distant, and not obvious from this code at all.

That said, I don't know if it's really worth bothering with.

> + #define NAPTIME_PER_CYCLE 100    /* max sleep time between cycles (100ms) */

This probably needs a better comment. Why are we sleeping between
cycles? What cycles?

> +     printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
> +            "                             server (in seconds)\n"));
> +     printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
> +            "                             with server (in seconds)\n"));

Same comments about wording apply, but perhaps there's no need to
mention the default.

> !             if (r == 0 || (r < 0 && errno == EINTR))
> !             {
> !                 /*
> !                  * Got a timeout or signal. Before Continuing the loop, check for timeout.
> !                  */
> !                 if (standby_recv_timeout > 0)
> !                 {
> !                     now = localGetCurrentTimestamp();

I'd make "now" local to this block, and get rid of the comment. The two
"if"s are perfectly clear. This applies to the same pattern in other
places in the patch as well.

> !                     if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
> !                     {
> !                         fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),

Better wording? "DB File receive" is confusing. Even something like
"Closing connection due to read timeout" would be better. Or perhaps
you can make it like the following message, slightly lower:

> !             if (PQconsumeInput(conn) == 0)
> !             {
> !                 fprintf(stderr,
> !                         _("%s: could not receive data from WAL Sender: %s"),
> !                         progname, PQerrorMessage(conn));

…and in the former case, say "read timeout" instead of PQerrorMessage().

> !             /* Set the last reply timestamp */
> !             last_recv_timestamp = localGetCurrentTimestamp();
> !
> !             /* Some data is received, so go back read them in buffer*/
> !             continue;

No need for these comments.

> +     /* Set the last reply timestamp */
> +     last_recv_timestamp = localGetCurrentTimestamp();

Likewise (in various places).

>       /*
> !      * Connect in replication mode to the server, Sending connect_timeout
> !      * as configured, there is no need for rw_timeout.
>        */
> !     conn = GetConnection(standby_connect_timeout);

This comment is pretty confusing.

>    * Connect to the server. Returns a valid PGconn pointer if connected,
>    * or NULL on non-permanent error. On permanent error, the function will
>    * call exit(1) directly.
> +  * Set conn_timeout to PGconn structure if their value
> +  * is not NULL.
>    */
>   PGconn *
> ! GetConnection(char *conn_timeout)

And this comment is just wrong.

The patch looks OK otherwise. Zoltan indicated that his tests were
successful, so I didn't retest. Marking "Waiting on author" again.

-- Abhijit



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Parallel query execution
Следующее
От: Abhijit Menon-Sen
Дата:
Сообщение: CF3+4 (was Re: Parallel query execution)