Обсуждение: Synch Rep: direct transfer of WAL file from the primary to the standby

Поиск

Список

Период

Сортировка

Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

16 июня 2009 г., 03:13:53

Hi,

http://archives.postgresql.org/message-id/496B9495.4010902@enterprisedb.com

> IMHO, the synchronous replication isn't in such good shape, I'm afraid.
> I've said this before, but I'm not happy with the "built from spare parts"
> nature of it. You shouldn't have to configure an archive, file-based log
> shipping using rsync or whatever, and pg_standby. All that is in addition
> to the direct connection between master and slave. The slave really should
> be able to just connect to the master, and download all the WAL it needs
> directly. That's a huge usability issue if left as is, but requires very large
> architectural changes to fix.

One of the major problems in Synch Rep was that WAL files generated
before replication starts are not automatically transferred to the standby
server. Those files needed to be shipped by hand or using warm-standby
mechanism. This degraded the usability of Synch Rep.

So, I'd like to propose the capability that the startup process automatically
restores the missing file (WAL file, backup history file or timeline history
file) from the primary server. Specifically, the startup process tries
to retrieve
the file in the following order:

    1) from the archive in the standby server
    2) from the primary server               <--- New Feature!
    3) from pg_xlog in the standby server

This means that users don't need extra copy operations anymore to
set up replication.

Implementation
--------------------
The main part of this capability is the new function to read the specified
WAL file. The following is the definition of it.

    pg_read_xlogfile (filename text [, restore bool]) returns setof bytea

    - filename: name of file to read
    - restore: indicates whether to try to restore the file from the archive

    - returns the content of the specified file
      (max size of one row is 8KB, i.e. this function returns 2,048 rows when
       WAL file whose size is 16MB is requested.)

If restore=true, this function tries to retrieve the file from the
archive at first.
This requires restore_command which needs to be specified in postgresql.conf.

If that restore fails or restore=false, it tries to retrieve the file
from pg_xlog.
In this case, WAL files or backup history file might be removed from pg_xlog
by concurrent checkpoint or pg_stop_backup, respectively. So, ControlFileLock
must be held to read it.

On the other hand, we should not send (return) any read data while holding
the lock. Otherwise, a network outage would seriously block the processing
which requires the lock. So, WAL file or backup history file in pg_xlog is
copied to a temporary file while holding the lock, then read and sent (returned)
after releasing it.

In the standby server, if a missing file is found, the startup process connects
to the primary server as a normal client, and retrieves the binary contents of
the WAL file by using the following SQL. Then, the restored file is written to
pg_xlog, and applied.

COPY (SELECT pg_read_xlogfilie('filename', true)) TO STDOUT WITH BINARY


The attached latest patch provides this capability. You can easily set up the
synch rep according to the following procedure.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#How_to_set_up_Synch_Rep

Comments? Do you have another better approach?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

synch_rep_0616.tgz

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Robert Haas

Дата:

02 июля 2009 г., 23:02:34

On Tue, Jun 16, 2009 at 2:13 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
> The attached latest patch provides this capability. You can easily set up the
> synch rep according to the following procedure.
> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#How_to_set_up_Synch_Rep

This patch no longer applies cleanly.  Can you rebase and resubmit it
for the upcoming CommitFest?  It might also be good to go through and
clean up the various places where you have trailing whitespace and/or
spaces preceding tabs.

It seems this will be one of the "big" patches for the upcoming
CommitFest.  Hot Standby seems to be off the table, because Simon has
indicated that he thinks Synch Rep should go first, and Heikki has
indicated that he's willing to review and commit, but not also play
lead developer.

http://archives.postgresql.org/pgsql-hackers/2009-07/msg00005.php
http://archives.postgresql.org/pgsql-hackers/2009-06/msg01534.php

Given that this is a substantial patch, I have a couple of questions
about strategy.  First, I am wondering whether this patch should be
reviewed (and committed) as a whole, or whether there are distinct
chunks of it that should be reviewed and committed separately -
particularly the signal handling piece, which AIUI is independently
useful.  I note that it seems to be included in the tarball as a
separate patch file, which is very useful.

Second, I am wondering whether Heikki feels that it would be useful to
assign round-robin reviewers for this patch, or whether he's going to
be the principal reviewer himself.  We could assign either a reviewer
(or reviewers) to the whole patch, or we could assign reviewers to
particular chunks of the patch, such as the signal handling piece.

Thanks,

...Robert

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Robert Haas

Дата:

02 июля 2009 г., 23:07:28

On Thu, Jul 2, 2009 at 10:02 PM, Robert Haas<robertmhaas@gmail.com> wrote:
> Second, I am wondering whether Heikki feels that it would be useful to
> assign round-robin reviewers for this patch, or whether he's going to
> be the principal reviewer himself.  We could assign either a reviewer
> (or reviewers) to the whole patch, or we could assign reviewers to
> particular chunks of the patch, such as the signal handling piece.

Hmm, taking a look at the wiki, I see that Simon's name is listed for
this patch as a reviewer already.  Assuming that's a point of view
that Simon agrees with and not the result of his name having been
added by someone else, I guess the question is whether we need
additional reviewers here beyond Heikki and Simon.

...Robert

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Alvaro Herrera

Дата:

02 июля 2009 г., 23:59:42

Robert Haas escribió:

> Second, I am wondering whether Heikki feels that it would be useful to
> assign round-robin reviewers for this patch, or whether he's going to
> be the principal reviewer himself.  We could assign either a reviewer
> (or reviewers) to the whole patch, or we could assign reviewers to
> particular chunks of the patch, such as the signal handling piece.

WRT the signal handling piece, I remember something in that area being
committed and then reverted because it had issues.  Does this version
fix those issues?  (Assuming it's the same patch)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

03 июля 2009 г., 01:32:21

Hi,

On Fri, Jul 3, 2009 at 11:02 AM, Robert Haas<robertmhaas@gmail.com> wrote:
> On Tue, Jun 16, 2009 at 2:13 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
>> The attached latest patch provides this capability. You can easily set up the
>> synch rep according to the following procedure.
>> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#How_to_set_up_Synch_Rep
>
> This patch no longer applies cleanly.  Can you rebase and resubmit it
> for the upcoming CommitFest?  It might also be good to go through and
> clean up the various places where you have trailing whitespace and/or
> spaces preceding tabs.

Sure. I'll resubmit the patch after fixing some bugs and finishing
the documents.

> Given that this is a substantial patch, I have a couple of questions
> about strategy.  First, I am wondering whether this patch should be
> reviewed (and committed) as a whole, or whether there are distinct
> chunks of it that should be reviewed and committed separately -
> particularly the signal handling piece, which AIUI is independently
> useful.  I note that it seems to be included in the tarball as a
> separate patch file, which is very useful.

I think that the latter strategy makes more sense. At least, the signal
handling piece and non-blocking pqcomm (communication between
a frontend and a backend) can be reviewed independently of synch rep
itself.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

03 июля 2009 г., 01:45:43

Hi,

On Fri, Jul 3, 2009 at 11:59 AM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
> WRT the signal handling piece, I remember something in that area being
> committed and then reverted because it had issues.  Does this version
> fix those issues?  (Assuming it's the same patch)

Yes. After the patch was reverted, Heikki and I fixed the problems.

The problem which was pointed out is:
http://archives.postgresql.org/message-id/14969.1228835521@sss.pgh.pa.us

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Robert Haas

Дата:

03 июля 2009 г., 02:01:43

On Fri, Jul 3, 2009 at 12:32 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
> Hi,
>
> On Fri, Jul 3, 2009 at 11:02 AM, Robert Haas<robertmhaas@gmail.com> wrote:
>> On Tue, Jun 16, 2009 at 2:13 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
>>> The attached latest patch provides this capability. You can easily set up the
>>> synch rep according to the following procedure.
>>> http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#How_to_set_up_Synch_Rep
>>
>> This patch no longer applies cleanly.  Can you rebase and resubmit it
>> for the upcoming CommitFest?  It might also be good to go through and
>> clean up the various places where you have trailing whitespace and/or
>> spaces preceding tabs.
>
> Sure. I'll resubmit the patch after fixing some bugs and finishing
> the documents.
>
>> Given that this is a substantial patch, I have a couple of questions
>> about strategy.  First, I am wondering whether this patch should be
>> reviewed (and committed) as a whole, or whether there are distinct
>> chunks of it that should be reviewed and committed separately -
>> particularly the signal handling piece, which AIUI is independently
>> useful.  I note that it seems to be included in the tarball as a
>> separate patch file, which is very useful.
>
> I think that the latter strategy makes more sense. At least, the signal
> handling piece and non-blocking pqcomm (communication between
> a frontend and a backend) can be reviewed independently of synch rep
> itself.

My preference for ease of CommitFest management would be one thread on
-hackers for each chunk that can be separately reviewed and committed.So if there are three severable chunks here, send
apatch for each 
one with a descriptive subject line, and mention the dependencies in
the body of the email ("before applying this patch, you must first
apply blah blah <link to archives>").  That way, we can keep the
discussion of each topic separate, have separate entries on the
CommitFest page with subjects that match the email thread, etc.

Thanks,

...Robert

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

03 июля 2009 г., 02:13:58

Hi,

On Fri, Jul 3, 2009 at 2:01 PM, Robert Haas<robertmhaas@gmail.com> wrote:
> My preference for ease of CommitFest management would be one thread on
> -hackers for each chunk that can be separately reviewed and committed.
>  So if there are three severable chunks here, send a patch for each
> one with a descriptive subject line, and mention the dependencies in
> the body of the email ("before applying this patch, you must first
> apply blah blah <link to archives>").  That way, we can keep the
> discussion of each topic separate, have separate entries on the
> CommitFest page with subjects that match the email thread, etc.

That sounds good. I'll submit the patches separately.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

06 июля 2009 г., 03:53:07

Hi,

On Tue, Jun 16, 2009 at 3:13 PM, Fujii Masao<masao.fujii@gmail.com> wrote:
> The main part of this capability is the new function to read the specified
> WAL file. The following is the definition of it.
>
>    pg_read_xlogfile (filename text [, restore bool]) returns setof bytea
>
>    - filename: name of file to read
>    - restore: indicates whether to try to restore the file from the archive
>
>    - returns the content of the specified file
>      (max size of one row is 8KB, i.e. this function returns 2,048 rows when
>       WAL file whose size is 16MB is requested.)
>
> If restore=true, this function tries to retrieve the file from the
> archive at first.
> This requires restore_command which needs to be specified in postgresql.conf.

In order for the primary server (ie. a normal backend) to read an archived file,
restore_command needs to be specified in also postgresql.conf. In this case,
how should we handle restore_command in recovery.conf?

1) Delete restore_command from recovery.conf. In this case, an user has to   specify it in postgresql.conf instead of
recovery.confwhen doing PITR.   This is simple, but tempts me to merge two configuration files. I'm not sure   why the
parametersfor recovery should be set apart from postgresql.conf. 

2) Leave restore_command in recovery.conf; it can be set in both or either of   two configuration files. We put
recovery.confbefore postgresql.conf only   during recovery if it's in both. After recovery, we prioritize 
postgresql.conf.   In this case, recovery.conf also needs to be re-read during recovery when   SIGHUP arrives. This
mightbe complicated for an user. 

3) Separate restore_command into two parameters. For example,   - normal_restore_command: is used by a normal backend
-recovery_restore_command: is used by startup process for PITR   In this case, it's bothersome that the same command
mustbe set in both of   two configuration files. 

I'm leaning to 1) that restore_command is simply moved from recovery.conf
to postgresql.conf. What's your opinion?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Tom Lane

Дата:

06 июля 2009 г., 12:16:41

Fujii Masao <masao.fujii@gmail.com> writes:
> In order for the primary server (ie. a normal backend) to read an archived file,
> restore_command needs to be specified in also postgresql.conf. In this case,
> how should we handle restore_command in recovery.conf?

I confess to not having paid much attention to this thread so far, but ...
what is the rationale for having such a capability at all?  It seems to
me to be exposing implementation details that we do not need to expose,
as well as making assumptions that we shouldn't make (like there is
exactly one archive and the primary server has read access to it).
        regards, tom lane

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

07 июля 2009 г., 00:47:40

Hi,

Thanks for the comment!

On Tue, Jul 7, 2009 at 12:16 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> In order for the primary server (ie. a normal backend) to read an archived file,
>> restore_command needs to be specified in also postgresql.conf. In this case,
>> how should we handle restore_command in recovery.conf?
>
> I confess to not having paid much attention to this thread so far, but ...
> what is the rationale for having such a capability at all?

If the XLOG files which are required for recovery exist only in the
primary server,
the standby server has to read them in some way. For example, when the latest
XLOG file of the primary server is 09 and the standby server has only 01, the
missing files (02-08) has to be read for recovery by the standby server. In this
case, the XLOG records in 09 or later are shipped to the standby server in real
time by synchronous replication feature.

The problem which I'd like to solve is how to make the standby server read the
XLOG files (XLOG file, backup history file and timeline history) which
exist only
in the primary server. In the previous patch, we had to manually copy those
missing files to the archive of the standby server or use the warm-standby
mechanism. This would decrease the usability of synchronous replication. So,
I proposed one of the solutions which makes the standby server read those
missing files automatically: introducing new function pg_read_xlogfile() which
reads the specified XLOG file.

Is this solution in the right direction? Do you have another
reasonable solution?

> It seems to
> me to be exposing implementation details that we do not need to expose,
> as well as making assumptions that we shouldn't make (like there is
> exactly one archive and the primary server has read access to it).

You mean that one archive is shared between two servers? If so, no.
I attached the picture of the environment which I assume.

Please feel free to comment.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

pg_read_xlogfile.pdf

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Heikki Linnakangas

Дата:

07 июля 2009 г., 05:11:08

Fujii Masao wrote:
> On Tue, Jul 7, 2009 at 12:16 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> Fujii Masao <masao.fujii@gmail.com> writes:
>>> In order for the primary server (ie. a normal backend) to read an archived file,
>>> restore_command needs to be specified in also postgresql.conf. In this case,
>>> how should we handle restore_command in recovery.conf?
>> I confess to not having paid much attention to this thread so far, but ...
>> what is the rationale for having such a capability at all?
> 
> If the XLOG files which are required for recovery exist only in the
> primary server,
> the standby server has to read them in some way. For example, when the latest
> XLOG file of the primary server is 09 and the standby server has only 01, the
> missing files (02-08) has to be read for recovery by the standby server. In this
> case, the XLOG records in 09 or later are shipped to the standby server in real
> time by synchronous replication feature.
> 
> The problem which I'd like to solve is how to make the standby server read the
> XLOG files (XLOG file, backup history file and timeline history) which
> exist only
> in the primary server. In the previous patch, we had to manually copy those
> missing files to the archive of the standby server or use the warm-standby
> mechanism. This would decrease the usability of synchronous replication. So,
> I proposed one of the solutions which makes the standby server read those
> missing files automatically: introducing new function pg_read_xlogfile() which
> reads the specified XLOG file.

pg_read_xlogfile() feels like a quite hacky way to implement that. Do we
require the master to always have read access to the PITR archive? And
indeed, to have a PITR archive configured to begin with. If you need to
set up archiving just because of the standby server, how do old files
that are no longer required by the standby get cleaned up?

I feel that the master needs to explicitly know what is the oldest WAL
file the standby might still need, and refrain from deleting files the
standby might still need. IOW, keep enough history in pg_xlog. Then we
have the risk of running out of disk space on pg_xlog if the connection
to the standby is lost for a long time, so we'll need some cap on that,
after which the master declares the standby as dead and deletes the old
WAL anyway. Nevertheless, I think that would be much simpler to
implement, and simpler for admins. And if the standby can read old WAL
segments from the PITR archive, in addition to requesting them from the
primary, it is just as safe.

I'd like to see a description of the proposed master/slave protocol for
replication. If I understood correctly, you're proposing that the
standby server connects to the master with libpq like any client,
authenticates as usual, and then sends a message indicating that it
wants to switch to "replication mode". In replication mode, normal FE/BE
messages are not accepted, but there's a different set of message types
for tranferring XLOG data.

I'd like to see a more formal description of that protocol and the new
message types. Some examples of how they would be in different
scenarios, like when standby server connects to the master for the first
time and needs to catch up.

Looking at the patch briefly, it seems to assume that there is only one
WAL sender active at any time. What happens when a new WAL sender
connects and one is active already? While supporting multiple slaves
isn't a priority, I think we should support multiple WAL senders right
from the start. It shouldn't be much harder, and otherwise we need to
ensure that the switch from old WAL sender to a new one is clean, which
seems non-trivial. Or not accept a new WAL sender while old one is still
active, but then a dead WAL sender process (because the standby suddenly
crashed, for example) would inhibit a new standby from connecting,
possibly for several minutes.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Andrew Dunstan

Дата:

07 июля 2009 г., 08:47:14


Heikki Linnakangas wrote:
> While supporting multiple slaves
> isn't a priority, 
>   

Really? I should have thought it was a basic requirement. At the very 
least we need to design with it in mind.

cheers

andrew

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

07 июля 2009 г., 08:51:46

Hi,

Thanks for the comment!

On Tue, Jul 7, 2009 at 5:07 PM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> pg_read_xlogfile() feels like a quite hacky way to implement that. Do we
> require the master to always have read access to the PITR archive? And
> indeed, to have a PITR archive configured to begin with. If you need to
> set up archiving just because of the standby server, how do old files
> that are no longer required by the standby get cleaned up?
>
> I feel that the master needs to explicitly know what is the oldest WAL
> file the standby might still need, and refrain from deleting files the
> standby might still need. IOW, keep enough history in pg_xlog. Then we
> have the risk of running out of disk space on pg_xlog if the connection
> to the standby is lost for a long time, so we'll need some cap on that,
> after which the master declares the standby as dead and deletes the old
> WAL anyway. Nevertheless, I think that would be much simpler to
> implement, and simpler for admins. And if the standby can read old WAL
> segments from the PITR archive, in addition to requesting them from the
> primary, it is just as safe.

I think of making pg_read_xlogfile() read the XLOG files from pg_xlog when
restore_command is not specified or returns non-zero code (ie. failure). So,
pg_read_xlogfile() with the following conditions might already cover the case
you described.
   - checkpoint_segments = N (big number)   - restore_command = ''

In this case, we can expect that the XLOG files which are required for the
standby exist in pg_xlog because of big checkpoint_segments. And,
pg_read_xlogfile() reads them only from pg_xlog. checkpoint_segments
would play a role of the cap and determine the maximum disk size of
pg_xlog. The overflow files which might be no longer required for the
standby are removed safely by postgres. OTOH, if there is not enough
disk space for pg_xlog, we can specify restore_command and decrease
checkpoint_segments. This is more flexible approach, I think.

But, if the primary should not restore any archived file at any time, I have
only to get rid of the code which pg_read_xlogfile() restores it?

> I'd like to see a description of the proposed master/slave protocol for
> replication. If I understood correctly, you're proposing that the
> standby server connects to the master with libpq like any client,
> authenticates as usual, and then sends a message indicating that it
> wants to switch to "replication mode". In replication mode, normal FE/BE
> messages are not accepted, but there's a different set of message types
> for tranferring XLOG data.

http://archives.postgresql.org/message-id/4951108A.5040608@enterprisedb.com
> I don't think we need or should
> allow running regular queries before entering "replication mode". the
> backend should become a walsender process directly after authentication.

I changed the protocol according to your suggestion.
Here is the current protocol:

On start-up, the standby calls PQstartReplication() which is new libpq
function. It sends the startup packet with a special code for replication
to the primary, like a cancel request. The backend which received this
code becomes walsender directly. Authentication is performed as
normal. Then, walsender switches the XLOG file, and sends the
ReplicationStart message 'l' which includes the timeline ID and the
replication start XLOG position.

ReplicationStart (B)   Byte1('l'): Identifies the message as a replication-start indicator.   Int32(17): Length of
messagecontents in bytes, including self.   Int32: The timeline ID   Int32: The start log file of replication   Int32:
Thestart byte offset of replication

After that, walsender sends the XLogData message 'w' which includes
the XLOG records, the flag (e.g. indicates whether the records should
be fsynced or not), and the XLOG position, in real time. The standby
receives the message using PQgetXLogData() which is new libpq
function. OTOH, after writing or fsyncing the records, the standby
sends the XLogResponse message 'r' which includes the flag and the
position of the written/fsynced records, using PQputXLogRecPtr()
which is new libpq function.

XLogData (B)   Byte1('w'): Identifies the message as XLOG records.   Int32: Length of message contents in bytes,
includingself.   Int8: Flag bits indicating how the records should be treated.   Int32: The log file number of the
records.  Int32: The byte offset of the records.   Byte n: The XLOG records.

XLogResponse (F)   Byte1('r'):  Identifies the message as ACK for XLOG records.   Int32: Length of message contents in
bytes,including self.   Int8: Flag bits indicating how the records were treated.   Int32: The log file number of the
records.  Int32: The byte offset of the records.

Normal exit of walsender (e.g. by smart shutdown) sends the
ReplicationEnd message 'z'. OTOH, normal exit of walreceiver
sends the existing Terminate message 'X'.

The above protocol is used between walsender and walreceiver.

> I'd like to see a more formal description of that protocol and the new
> message types. Some examples of how they would be in different
> scenarios, like when standby server connects to the master for the first
> time and needs to catch up.

If there is a missing XLOG file which is required for recovery, the
startup process connects to the primary as a normal client, and
receives the binary contents of the file by using the following SQL.
This has nothing to do with the above protocol. So, the transfer of
missing file and synchronous XLOG streaming are performed
concurrently.

COPY (SELECT pg_read_xlogfilie('filename', true)) TO STDOUT WITH BINARY

If no missing files are found (ie. recovery of the standby has
reached the replication start position), the transfer of file drops
out of use.

> Looking at the patch briefly, it seems to assume that there is only one
> WAL sender active at any time. What happens when a new WAL sender
> connects and one is active already?

The new request is refused because of existing walsender.

> While supporting multiple slaves
> isn't a priority, I think we should support multiple WAL senders right
> from the start. It shouldn't be much harder, and otherwise we need to
> ensure that the switch from old WAL sender to a new one is clean, which
> seems non-trivial. Or not accept a new WAL sender while old one is still
> active,

Yeah, the current patch doesn't accept a new walsender while old
one is still active.

> but then a dead WAL sender process (because the standby suddenly
> crashed, for example) would inhibit a new standby from connecting,
> possibly for several minutes.

Yes, new standby cannot start walsender until walsender detects the
death of old standby. You can shorten the time to detect it by setting
some timeout (replication_timeout and some keepalive parameters).
I don't think that it's a problem that walsender cannot start for a short
time. You think that walsender must *always* be able to start?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Tom Lane

Дата:

07 июля 2009 г., 12:51:35

Fujii Masao <masao.fujii@gmail.com> writes:
> On Tue, Jul 7, 2009 at 12:16 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> I confess to not having paid much attention to this thread so far, but ...
>> what is the rationale for having such a capability at all?

> If the XLOG files which are required for recovery exist only in the
> primary server,
> the standby server has to read them in some way. For example, when the latest
> XLOG file of the primary server is 09 and the standby server has only 01, the
> missing files (02-08) has to be read for recovery by the standby server. In this
> case, the XLOG records in 09 or later are shipped to the standby server in real
> time by synchronous replication feature.

> The problem which I'd like to solve is how to make the standby server read the
> XLOG files (XLOG file, backup history file and timeline history) which
> exist only
> in the primary server. In the previous patch, we had to manually copy those
> missing files to the archive of the standby server or use the warm-standby
> mechanism. This would decrease the usability of synchronous replication. So,
> I proposed one of the solutions which makes the standby server read those
> missing files automatically: introducing new function pg_read_xlogfile() which
> reads the specified XLOG file.

> Is this solution in the right direction? Do you have another
> reasonable solution?

This design seems totally wrong to me.  It's confusing the master's
pg_xlog directory with the archive.  We should *not* use pg_xlog as
a long-term archive area; that's terrible from both a performance
and a reliability perspective.  Performance because pg_xlog has to be
fairly high-speed storage, which conflicts with it needing to hold
a lot of stuff; and reliability because the entire point of all this
is to survive a master server crash, and you're probably not going to
have its pg_xlog anymore after that.

If slaves need to be able to get at past WAL, they should be getting
it from a separate archive server that is independent of the master DB.
        regards, tom lane

Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Greg Stark

Дата:

07 июля 2009 г., 14:26:22

On Tue, Jul 7, 2009 at 4:49 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> This design seems totally wrong to me.  It's confusing the master's
> pg_xlog directory with the archive.  We should *not* use pg_xlog as
> a long-term archive area; that's terrible from both a performance
> and a reliability perspective.  Performance because pg_xlog has to be
> fairly high-speed storage, which conflicts with it needing to hold
> a lot of stuff; and reliability because the entire point of all this
> is to survive a master server crash, and you're probably not going to
> have its pg_xlog anymore after that.

Hm, those are all good points.

> If slaves need to be able to get at past WAL, they should be getting
> it from a separate archive server that is independent of the master DB.

But this conflicts with earlier discussions where we were concerned
about the length of the path wal has to travel between the master and
the slaves. We want slaves to be able to be turned on simply using a
simple robust configuration and to be able to respond quickly to
transactions that are committed in the master for synchronous
operation.

Having wal have to be written to the master xlog directory, be copied
to the archive, then be copied from the archive to the slave's wal
directory, and then finally be reread and replayed in the slave means
a lot of extra complicated configuration which can be set up wrong and
which might not be apparent until things fall apart. And it means a
huge latency before the wal files are finally replayed on the slave
which will make transitioning to synchronous mode -- with a whole
other different mode of operation to configure -- quite tricky and
potentialy slow.

I'm not sure how to reconcile these two sets of priorities though.
Your points above are perfectly valid as well. How do other databases
handle log shipping? Do they depend on archived logs to bring the
slaves up to speed? Is there a separate log management daemon?

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Tom Lane

Дата:

07 июля 2009 г., 15:25:52

Greg Stark <gsstark@mit.edu> writes:
> On Tue, Jul 7, 2009 at 4:49 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> This design seems totally wrong to me.
>> ...

> But this conflicts with earlier discussions where we were concerned
> about the length of the path wal has to travel between the master and
> the slaves. We want slaves to be able to be turned on simply using a
> simple robust configuration and to be able to respond quickly to
> transactions that are committed in the master for synchronous
> operation.

Well, the problem I've really got with this is that if you want sync
replication, couching it in terms of WAL files in the first place seems
like getting off on fundamentally the wrong foot.  That still leaves you
with all the BS about having to force WAL file switches (and eat LSN
space) for all sorts of undesirable reasons.  I think we want the
API to operate more like a WAL stream.  I would envision the slaves
connecting to the master's replication port and asking "feed me WAL
beginning at LSN position thus-and-so", with no notion of WAL file
boundaries exposed anyplace.  The point about not wanting to archive
lots of WAL on the master would imply that the master reserves the right
to fail if the requested starting position is too old, whereupon the
slave needs some way to resync --- but that probably involves something
close to taking a fresh base backup to copy to the slave.  You either
have the master not recycle its WAL while the backup is going on (so the
slave can start reading afterwards), or expect the slave to absorb and
buffer the WAL stream while the backup is going on.  In neither case is
there any reason to have an API that involves fetching arbitrary chunks
of past WAL, and certainly not one that is phrased as fetching specific
WAL segment files.

There are still some interesting questions in this about exactly how you
switch over from "catchup mode" to following the live WAL broadcast.
With the above design it would be the master's responsibility to manage
that, since presumably the requested start position will almost always
be somewhat behind the live end of WAL.  It might be nicer to push that
complexity to the slave side, but then you do need two data paths
somehow (ie, retrieving the slightly-stale WAL is separated from
tracking live events).  Which is what you're saying we should avoid,
and I do see the point there.
        regards, tom lane

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Heikki Linnakangas

Дата:

07 июля 2009 г., 16:00:23

Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> On Tue, Jul 7, 2009 at 4:49 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>>> This design seems totally wrong to me.
>>> ...
> 
>> But this conflicts with earlier discussions where we were concerned
>> about the length of the path wal has to travel between the master and
>> the slaves. We want slaves to be able to be turned on simply using a
>> simple robust configuration and to be able to respond quickly to
>> transactions that are committed in the master for synchronous
>> operation.
> 
> Well, the problem I've really got with this is that if you want sync
> replication, couching it in terms of WAL files in the first place seems
> like getting off on fundamentally the wrong foot.  That still leaves you
> with all the BS about having to force WAL file switches (and eat LSN
> space) for all sorts of undesirable reasons.  I think we want the
> API to operate more like a WAL stream.

I think we all agree on that.

>  I would envision the slaves
> connecting to the master's replication port and asking "feed me WAL
> beginning at LSN position thus-and-so", with no notion of WAL file
> boundaries exposed anyplace.  

Yep, that's the way I envisioned it to work in my protocol suggestion
that Fujii adopted
(http://archives.postgresql.org/message-id/4951108A.5040608@enterprisedb.com).
The <begin> and <end> values are XLogRecPtrs, not WAL filenames.

>The point about not wanting to archive
> lots of WAL on the master would imply that the master reserves the right
> to fail if the requested starting position is too old, whereupon the
> slave needs some way to resync --- but that probably involves something
> close to taking a fresh base backup to copy to the slave.

Works for me, except that people will want the ability to use a PITR
archive for the catchup, if available. The master should have no
business business peeking into the archive, however. That should be
implemented entirely in the slave.

And I'm sure people will want the option to retain WAL longer in the
master, to avoid an expensive resync if the slave falls behind. It would
be simple to provide a GUC option for "always retain X GB of old WAL in
pg_xlog".

> There are still some interesting questions in this about exactly how you
> switch over from "catchup mode" to following the live WAL broadcast.
> With the above design it would be the master's responsibility to manage
> that, since presumably the requested start position will almost always
> be somewhat behind the live end of WAL.  It might be nicer to push that
> complexity to the slave side, but then you do need two data paths
> somehow (ie, retrieving the slightly-stale WAL is separated from
> tracking live events).  Which is what you're saying we should avoid,
> and I do see the point there.

Yeah, that logic belongs to the master.

We'll want to send message from the master to the slave when the catchup
is done, so that the slave knows it's up-to-date. For logging, if for no
other reason.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Tom Lane

Дата:

07 июля 2009 г., 16:12:55

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> And I'm sure people will want the option to retain WAL longer in the
> master, to avoid an expensive resync if the slave falls behind. It would
> be simple to provide a GUC option for "always retain X GB of old WAL in
> pg_xlog".

Right, we would want to provide some more configurability on the
when-to-recycle-WAL decision than there is now.  But the basic point
is that I don't see the master pg_xlog as being a long-term archive.
The amount of back WAL that you'd want to keep there is measured in
minutes or hours, not weeks or months.

(If nothing else, there is no point in keeping so much WAL that catching
up by scanning it would take longer than taking a fresh base backup.
My impression from recent complaints about our WAL-reading speed is that
that might be a pretty tight threshold ...)
        regards, tom lane

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Dimitri Fontaine

Дата:

07 июля 2009 г., 17:02:38

Le 7 juil. 09 à 21:12, Tom Lane a écrit :
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> And I'm sure people will want the option to retain WAL longer in the
>> master, to avoid an expensive resync if the slave falls behind. It
>> would
>> be simple to provide a GUC option for "always retain X GB of old
>> WAL in
>> pg_xlog".
>
> Right, we would want to provide some more configurability on the
> when-to-recycle-WAL decision than there is now.  But the basic point
> is that I don't see the master pg_xlog as being a long-term archive.
> The amount of back WAL that you'd want to keep there is measured in
> minutes or hours, not weeks or months.

Could we add yet another postmaster specialized child to handle the
archive, which would be like a default archive_command implemented in
core. This separate process could then be responsible for feeding the
slave(s) with the WAL history for any LSN not available in pg_xlog
anymore.

The bonus would be to have a good reliable WAL archiving default setup
for simple PITR and simple replication setups. One of the reasons PITR
looks so difficult is that it involves reading a lot of documentation
then hand-writing scripts even in the simple default case.

> (If nothing else, there is no point in keeping so much WAL that
> catching
> up by scanning it would take longer than taking a fresh base backup.
> My impression from recent complaints about our WAL-reading speed is
> that
> that might be a pretty tight threshold ...)

Could the design above make it so that your later PITR backup is
always an option for setting up a WAL Shipping slave?

Regards,
--
dim

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Tom Lane

Дата:

07 июля 2009 г., 18:34:42

Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Could we add yet another postmaster specialized child to handle the  
> archive, which would be like a default archive_command implemented in  
> core.

I think this fails the basic sanity check: do you need it to still work
when the master is dead.  It's reasonable to ask the master to supply a
few gigs of very-recent WAL, but as soon as the word "archive" enters
the conversation, you should be thinking in terms of a different
machine.  Or at least a design that easily scales to put the archive on
a different machine.
        regards, tom lane

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Greg Stark

Дата:

07 июля 2009 г., 20:57:00

On Tue, Jul 7, 2009 at 8:12 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> (If nothing else, there is no point in keeping so much WAL that catching
> up by scanning it would take longer than taking a fresh base backup.
> My impression from recent complaints about our WAL-reading speed is that
> that might be a pretty tight threshold ...)

Well those are two independent variables. The time taken to scan WAL
is dependent on the transaction rate and the time to take a fresh
backup is dependent on the total database size. There are plenty of
low transaction rate humungous databases where it would be faster to
replay weeks of transactions than try to take a fresh base backup.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

08 июля 2009 г., 00:57:42

Hi,

On Wed, Jul 8, 2009 at 12:49 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> This design seems totally wrong to me.  It's confusing the master's
> pg_xlog directory with the archive.  We should *not* use pg_xlog as
> a long-term archive area; that's terrible from both a performance
> and a reliability perspective.  Performance because pg_xlog has to be
> fairly high-speed storage, which conflicts with it needing to hold
> a lot of stuff; and reliability because the entire point of all this
> is to survive a master server crash, and you're probably not going to
> have its pg_xlog anymore after that.

Yeah, I agree that pg_xlog is not a long-term archive area. So, in my
design, the primary server tries to read the old XLOG file from not only
pg_xlog but also an archive if available, and transfers it.

> If slaves need to be able to get at past WAL, they should be getting
> it from a separate archive server that is independent of the master DB.

You assume that restore_command which retrieves the old XLOG file
from a separate archive server is specified in the standby?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

08 июля 2009 г., 01:26:13

Hi,

On Wed, Jul 8, 2009 at 4:00 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>>  I would envision the slaves
>> connecting to the master's replication port and asking "feed me WAL
>> beginning at LSN position thus-and-so", with no notion of WAL file
>> boundaries exposed anyplace.
>
> Yep, that's the way I envisioned it to work in my protocol suggestion
> that Fujii adopted
> (http://archives.postgresql.org/message-id/4951108A.5040608@enterprisedb.com).
> The <begin> and <end> values are XLogRecPtrs, not WAL filenames.

If <begin> indicates the middle of the XLOG file, the file written to the
standby is partial. Is this OK? After two server failed, the XLOG file
including <begin> might still be required for crash recovery of the
standby server. But, since it's partial, the crash recovery would fail.
I think that any XLOG file should be written to the standby as it can
be replayed by a normal recovery.

>>The point about not wanting to archive
>> lots of WAL on the master would imply that the master reserves the right
>> to fail if the requested starting position is too old, whereupon the
>> slave needs some way to resync --- but that probably involves something
>> close to taking a fresh base backup to copy to the slave.

What if the XLOG file required for recovery has gone while doing
resync a large data? In this case, the standby might never start
because the requested starting position is always too old.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

08 июля 2009 г., 03:06:47

Hi,

Thanks for the brilliant comments!

On Wed, Jul 8, 2009 at 4:00 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>> There are still some interesting questions in this about exactly how you
>> switch over from "catchup mode" to following the live WAL broadcast.
>> With the above design it would be the master's responsibility to manage
>> that, since presumably the requested start position will almost always
>> be somewhat behind the live end of WAL.  It might be nicer to push that
>> complexity to the slave side, but then you do need two data paths
>> somehow (ie, retrieving the slightly-stale WAL is separated from
>> tracking live events).  Which is what you're saying we should avoid,
>> and I do see the point there.
>
> Yeah, that logic belongs to the master.
>
> We'll want to send message from the master to the slave when the catchup
> is done, so that the slave knows it's up-to-date. For logging, if for no
> other reason.

This seems to be a main difference between us. You and Tom think
that the catchup (transferring the old XLOG file) and WAL streaming
(shipping the latest XLOG records continuously) are performed in
serial by using the same connection. I think that in parallel by using
more than one connection. I'd like to build consensus which design
should be chosen. If my design is worse, I'll change the patch
according to the other design.

In my design, WAL streaming is performed between walsender and
walreceiver. In parallel with that, the startup process requests the
old XLOG file to a normal backend if it's not found during recovery.
If the startup process has reached the WAL streaming start position,
it's guaranteed that all the XLOG files required for recovery exist in
the standby, which means that it's up-to-date. After that, the startup
process replays only the records shipped by WAL streaming.

The advantage of my design is:

- It's guaranteed that the standby can catch up with the primary within a reasonable period.
- We can keep walsender simple. It has only to take care of the latest XLOG records (ie. doesn't need to control the
oldrecords and some history files). And, it doesn't need to calculate whether the standby is already up-to-date or not
bycomparing some LSNs. 
- In the future, in order to make the standby catch up more quickly, we can easily extend the mechanism so that two or
moreold XLOG files might be transferred concurrently by using multiple connections. 

What is your opinion?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Dimitri Fontaine

Дата:

08 июля 2009 г., 05:08:34

Hi,

Tom Lane <tgl@sss.pgh.pa.us> writes:
> I think this fails the basic sanity check: do you need it to still work
> when the master is dead.  

I don't get it. Why would we want to setup a slave against a dead
master?

The way I understand the current design of Synch Rep, when you start a
new slave the following happen:
1. init: slave asks the master the current LSN and start streaming WAL
2. setup: slave asks the master for missing WALs from its current   position to this LSN it just got, and apply them
allto reach   initial LSN (this happen in parallel to 1.)

3. catchup: slave has replayed missing WALs and now is replaying the   stream he received in parallel, and which
appliesfrom init LSN   (just reached)

4. sync: slave is no more lagging, it's applying the stream as it gets   it, either as part of the master transaction
ornot depending on the   GUC settings

So, what I'm understanding you're saying is that the slave still should
be able to setup properly when master died before it synced. What I'm
saying is that if master dies before any sync slave exists, you get to
start from backups (filesystem snaphost + archives for example, PITR
recovery etc), as there's no slave.

Regards,
-- 
dim

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

"Kevin Grittner"

Дата:

08 июля 2009 г., 11:01:13

Dimitri Fontaine <dfontaine@hi-media.com> wrote: 
>  4. sync: slave is no more lagging, it's applying the stream as it
>     gets it, either as part of the master transaction or not
>     depending on the GUC settings
I think the interesting bit is when you're at this point and the
connection between the master and slave goes down for a couple days. 
How do you handle that?
-Kevin

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Dimitri Fontaine

Дата:

08 июля 2009 г., 11:39:36

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> Dimitri Fontaine <dfontaine@hi-media.com> wrote:
>
>>  4. sync: slave is no more lagging, it's applying the stream as it
>>     gets it, either as part of the master transaction or not
>>     depending on the GUC settings
>
> I think the interesting bit is when you're at this point and the
> connection between the master and slave goes down for a couple days.
> How do you handle that?

Maybe how londiste handle the case could help us here:
http://skytools.projects.postgresql.org/doc/londiste.ref.html#toc18


State                | Owner  | What is done
---------------------+--------+--------------------
NULL                 | replay | Changes state to "in-copy", launches                               londiste.py copy
process,continues with                               it's work 


in-copy              | copy   | drops indexes, truncates, copies data                               in, restores
indexes,changes state to                               "catching-up" 


catching-up          | copy   | replay events for that table only until                               no more batches
(meanscurrent moment),                               change state to "wanna-sync:<tick_id>"
 and wait for state to change 

wanna-sync:<tick_id> | replay | catch up to given tick_id, change state                               to
"do-sync:<tick_id>"and wait for                               state to change 

do-sync:<tick_id>    | copy   | catch up to given tick_id, both replay                               and copy must now
beat same                               position. change state to "ok" and exit 

ok                   | replay | synced table, events can be applied


Such state changes must guarantee that any process can die at any time
and by just restarting it can continue where it left.

"subscriber add" registers table with NULL state. "subscriber add
—expect-sync" registers table with ok state.

"subscriber resync" sets table state to NULL.

Regards,
--
dim

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Mark Mielke

Дата:

08 июля 2009 г., 11:45:33

On 07/08/2009 09:59 AM, Kevin Grittner wrote: <blockquote cite="mid:4A545FFB020000250002855B@gw.wicourts.gov"
type="cite"><prewrap="">Dimitri Fontaine <a class="moz-txt-link-rfc2396E"
href="mailto:dfontaine@hi-media.com"><dfontaine@hi-media.com></a>wrote:  </pre><blockquote type="cite"><pre
wrap="">4. sync: slave is no more lagging, it's applying the stream as it   gets it, either as part of the master
transactionor not   depending on the GUC settings   </pre></blockquote><pre wrap=""> 
 
I think the interesting bit is when you're at this point and the
connection between the master and slave goes down for a couple days. 
How do you handle that?</pre></blockquote><br /> Been following with great interest...<br /><br /> If the updates are
notperformed at a regular enough interval, the slave is not truly a functioning standby. I think it's a different
problemdomain, probably best served by the existing pg_standby support? If the slave can be out of touch with the
masterfor an extended period of time, near real time logs provide no additional benefit over just shipping the archived
WALlogs and running the standby in continuous recovery mode?<br /><br /> Cheers,<br /> mark<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Heikki Linnakangas

Дата:

08 июля 2009 г., 12:00:11

Mark Mielke wrote:
> On 07/08/2009 09:59 AM, Kevin Grittner wrote:
>> I think the interesting bit is when you're at this point and the
>> connection between the master and slave goes down for a couple days.
>> How do you handle that?
> 
> Been following with great interest...
> 
> If the updates are not performed at a regular enough interval, the slave
> is not truly a functioning standby. I think it's a different problem
> domain, probably best served by the existing pg_standby support? If the
> slave can be out of touch with the master for an extended period of
> time, near real time logs provide no additional benefit over just
> shipping the archived WAL logs and running the standby in continuous
> recovery mode?

Might be easier to set up than pg_standby..

But more importantly, it can happen by accident. Someone trips on the
power plug of the slave on Friday, and it goes unnoticed until Monday
when DBA comes to work.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Re: Synch Rep: direct transfer of WAL file from theprimary to the standby

От

"Kevin Grittner"

Дата:

08 июля 2009 г., 12:15:58

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: 
> But more importantly, it can happen by accident. Someone trips on
> the power plug of the slave on Friday, and it goes unnoticed until
> Monday when DBA comes to work.
We've had people unplug things by accident exactly that way.  :-/
We've also had replication across part of our WAN go down for the
better part of a day because a beaver chewed through a fiber optic
cable where it ran through a marsh.  Our (application framework based)
replication just picks up where it left off, without any intervention,
when connectivity is restored.  I think it would be a mistake to
design something less robust than that.
By the way, we don't use any state transitions for this, other than
keeping track of when we seem to have a working connection.  The
client side knows what it last got, and when its reconnection attempts
eventually succeed it makes a request of the server side to provide a
stream of transactions from that point on.  The response to that
request continues indefinitely, as long as the connection is up, which
can be months at a time.
-Kevin
"Everything should be made as simple as possible, but no simpler." - Albert Einstein

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Heikki Linnakangas

Дата:

08 июля 2009 г., 12:30:03

Fujii Masao wrote:
> On Wed, Jul 8, 2009 at 4:00 AM, Heikki
> Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>>>  I would envision the slaves
>>> connecting to the master's replication port and asking "feed me WAL
>>> beginning at LSN position thus-and-so", with no notion of WAL file
>>> boundaries exposed anyplace.
>> Yep, that's the way I envisioned it to work in my protocol suggestion
>> that Fujii adopted
>> (http://archives.postgresql.org/message-id/4951108A.5040608@enterprisedb.com).
>> The <begin> and <end> values are XLogRecPtrs, not WAL filenames.
> 
> If <begin> indicates the middle of the XLOG file, the file written to the
> standby is partial. Is this OK? After two server failed, the XLOG file
> including <begin> might still be required for crash recovery of the
> standby server. But, since it's partial, the crash recovery would fail.
> I think that any XLOG file should be written to the standby as it can
> be replayed by a normal recovery.

The standby can store the streamed WAL to files in pg_xlog of the
standby, to facilitate crash recovery, but it doesn't need to be exposed
in the protocol.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

09 июля 2009 г., 03:08:58

Hi,

On Wed, Jul 8, 2009 at 10:59 PM, Kevin
Grittner<Kevin.Grittner@wicourts.gov> wrote:
> Dimitri Fontaine <dfontaine@hi-media.com> wrote:
>
>>  4. sync: slave is no more lagging, it's applying the stream as it
>>     gets it, either as part of the master transaction or not
>>     depending on the GUC settings
>
> I think the interesting bit is when you're at this point and the
> connection between the master and slave goes down for a couple days.
> How do you handle that?

In the current design of synch rep, you have only to restart the standby
after repairing the network. The startup process of the standby would
restart an archive recovery from the last restart point, and request the
missing file from the primary if it's found. On the other hand, WAL
streaming would start from the current XLOG position of the primary,
which is performed by walsender and walreceiver.

If the file required for the archive recovery has gone from the primary
(pg_xlog and archive) during a couple of days, and now exists only in
a separate archive server, the archive recovery by the standby would
fail. In this case, you need to copy the missing files from the archive
server to the standby before restarting the standby. Otherwise you
need to make a new base backup of the primary, and start the setup
of the standby from the beginning.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

09 июля 2009 г., 04:16:38

Hi,

On Tue, Jul 7, 2009 at 8:51 PM, Fujii Masao<masao.fujii@gmail.com> wrote:
> http://archives.postgresql.org/message-id/4951108A.5040608@enterprisedb.com
>> I don't think we need or should
>> allow running regular queries before entering "replication mode". the
>> backend should become a walsender process directly after authentication.
>
> I changed the protocol according to your suggestion.
> Here is the current protocol:

Just to the record, I'd like to explain the correspondence relationship
between Heikki's protocol and mine.

> ReplicationStart (B)
>    Byte1('l'): Identifies the message as a replication-start indicator.
>    Int32(17): Length of message contents in bytes, including self.
>    Int32: The timeline ID
>    Int32: The start log file of replication
>    Int32: The start byte offset of replication

This corresponds to "StartReplication <begin>". But this is sent
from the primary to the standby, though "StartReplication" is sent
in theopposite direction. So, in the current design, the primary
determines the WAL streaming start position, which indicates the
head of the next XLOG file of the switched file by walsender.

> XLogData (B)
>    Byte1('w'): Identifies the message as XLOG records.
>    Int32: Length of message contents in bytes, including self.
>    Int8: Flag bits indicating how the records should be treated.
>    Int32: The log file number of the records.
>    Int32: The byte offset of the records.
>    Byte n: The XLOG records.

This corresponds to "WALRange <begin> <end> <data>". But
XLogData doesn't have <begin> in order to reduce the wire
traffic because it can be calculated from <end> and the length
of the records.

> XLogResponse (F)
>    Byte1('r'):  Identifies the message as ACK for XLOG records.
>    Int32: Length of message contents in bytes, including self.
>    Int8: Flag bits indicating how the records were treated.
>    Int32: The log file number of the records.
>    Int32: The byte offset of the records.

This corresponds to "ReplicatedUpTo <end>". They are almost
the same.

> If there is a missing XLOG file which is required for recovery, the
> startup process connects to the primary as a normal client, and
> receives the binary contents of the file by using the following SQL.
> This has nothing to do with the above protocol. So, the transfer of
> missing file and synchronous XLOG streaming are performed
> concurrently.
>
> COPY (SELECT pg_read_xlogfilie('filename', true)) TO STDOUT WITH BINARY

This corresponds to "RequestWAL <begin> <end>". Since the
XLOG file written to the standby has to be recoverable, I use the
filename instead of XLogRecPtr here, and make the primary send
the whole file. Also, this filename can indicate not only XLOG file
but also a history file.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

"Kevin Grittner"

Дата:

09 июля 2009 г., 12:14:28

Fujii Masao <masao.fujii@gmail.com> wrote: 
> Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote:
>> I think the interesting bit is when you're at this point and the
>> connection between the master and slave goes down for a couple
>> days.  How do you handle that?
> 
> In the current design of synch rep, you have only to restart the
> standby after repairing the network.
How long does the interruption need to last to require manual
intervention?  Would an automated retry make sense?  (I'd bet that
more days than not we lose connectivity to at least one of our remote
sites for at least a few minutes.)
-Kevin

Re: Re: Synch Rep: direct transfer of WAL file from the primary to the standby

От

Fujii Masao

Дата:

09 июля 2009 г., 22:11:52

Hi,

On Thu, Jul 9, 2009 at 11:13 PM, Kevin
Grittner<Kevin.Grittner@wicourts.gov> wrote:
> Fujii Masao <masao.fujii@gmail.com> wrote:
>> Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote:
>
>>> I think the interesting bit is when you're at this point and the
>>> connection between the master and slave goes down for a couple
>>> days.  How do you handle that?
>>
>> In the current design of synch rep, you have only to restart the
>> standby after repairing the network.
>
> How long does the interruption need to last to require manual
> intervention?

It depends on when the files required for the standby's recovery
have gone from the primary. If they exist in the primary's archive
forever, there would be no chance to do manual copying of them.

>  Would an automated retry make sense?  (I'd bet that
> more days than not we lose connectivity to at least one of our remote
> sites for at least a few minutes.)

Yes, but I think that it's not postgres' but clusterware's job.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Synch Rep: direct transfer of WAL file from the primary to the standby

Вложения

Вложения