Обсуждение: streaming replication does not work across datacenter with 20ms latency?

Поиск
Список
Период
Сортировка

streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
I was doing postgresql streaming replication, which was fine when two
machine in the same datecenter. but recently I was planning to deploy
new slave at a different datecent, the latency between the master and
slave is 20ms;
below is the related configurateion:
Both master and slave have below configuration:
hot_standby = on
wal_level = hot_standby
max_wal_senders = 5

checkpoint_segments = 64
wal_keep_segments = 128

I am using pgpool to automation but the method is similar to the
method described here:
http://wiki.postgresql.org/wiki/Streaming_Replication

the data dir size is about 30G,  I have tried many times but every
time after the sync was over and slave was started,  postgresql  is
just hanging there with error message(attached below), while trying to
connect it returns error message "psql: FATAL:  the database system is
starting up"


the strange part is with same configuration, other slaves in the same
datacenter works fine...



what does invalid record length and invalid magic number  normally
means? xlog  corrupted?
Thanks for any further help!
the log message with debug5 level was like this(just clips, I could
upload full log file if necessary):



17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG:  database
system was interrupted; last known up at 2011-07-23 07:07:57 CDT
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  forked
new backend, pid=17998 socket=8
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  forked
new backend, pid=17999 socket=8
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]FATAL:  the database system is starting up
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  shmem_exit(1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  proc_exit(1): 1 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  exit(1)
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  shmem_exit(-1): 0 callbacks to make
17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@postgres [local]DEBUG:  proc_exit(-1): 0 callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  server
process (PID 17999) exited with exit code 1
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)FATAL:  the database system is
starting up
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  shmem_exit(1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  proc_exit(1): 1 callbacks
to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  exit(1)
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  shmem_exit(-1): 0
callbacks to make
17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT
postgres@template1 10.28.53.11(33647)DEBUG:  proc_exit(-1): 0
callbacks to make
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
reaping dead processes
17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  server
process (PID 17998) exited with exit code 1
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
standby_mode = 'on'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
primary_conninfo = 'host=jefferson port=5432 user=postgres'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:
trigger_file = '/var/log/pgpool/trigger/trigger_file1'
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG:  entering
standby mode
17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG:  could
not open file "pg_xlog/0000000300000054000000DB" (log file 84, segment
219): No such file or directory


17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG:  record
known xact 36933672 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT:  xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG:  remove
KnownAssignedXid 36933672
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT:  xlog
redo commit: 2011-07-23 06:41:41.264405-05
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG:  record
known xact 36933674 latestObservedXid 36933674
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT:  xlog
redo insert: rel 1663/16386/17404; tid 18378/37
17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ LOG:  invalid
record length at 54/DDFE4010

17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG:  remove
KnownAssignedXid 36929085
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT:  xlog
redo commit: 2011-07-23 06:33:29.760915-05
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG:  record
known xact 36929100 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT:  xlog
redo insert: rel 1663/16386/16436; tid 88370/2
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG:  record
known xact 36929109 latestObservedXid 36929102
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT:  xlog
redo insert: rel 1663/16386/16436; tid 88370/3
17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ LOG:  invalid
magic number 0000 in log file 84, segment 219, offset 7733248


--

Re: streaming replication does not work across datacenter with 20ms latency?

От
Scott Ribe
Дата:
On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote:

> what does invalid record length and invalid magic number  normally
> means? xlog  corrupted?
> Thanks for any further help!

It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and
theother is 64-bit, or one machine is big-endian and the other is little-endian... 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
thanks for the help!
are there any other possible reasons?

both system are using Debian amd64, as uname -a shows:
Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
x86_64 GNU/Linux

and using the following program it tells both of them are little-endian
#include <stdio.h>
#include <stdbool.h>


bool isBigEndian()
{
    int no = 1;
    char *chk = (char *)&no;

    if (chk[0] == 1)
    {
        return 0;
    }
    else
    {
        return 1;
    }
}

main()
{
    printf("this is %d \n",(int)isBigEndian());
    return 0;
}
~

On Sat, Jul 23, 2011 at 11:55 PM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
> On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote:
>
>> what does invalid record length and invalid magic number  normally
>> means? xlog  corrupted?
>> Thanks for any further help!
>
> It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and
theother is 64-bit, or one machine is big-endian and the other is little-endian... 
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>



--
闫春路

Re: streaming replication does not work across datacenter with 20ms latency?

От
"Tomas Vondra"
Дата:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
> thanks for the help!
> are there any other possible reasons?
>
> both system are using Debian amd64, as uname -a shows:
> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
> x86_64 GNU/Linux

It is not just about the architecture, it means the PostgreSQL was
configured somehow differently during the build. E.g. a different block
size or WAL block size would make such problems.

Or maybe one of the buils might be 32-bit for some reason (you can run
32-bit system in a 64-bit environment). You can do this

$ less postgres | grep Class

to check this (ELF32 => 32bit, ELF64 => 64bit).

Did you use the same binary packages or have you built the server yourself?

Tomas


Re: streaming replication does not work across datacenter with 20ms latency?

От
"Joshua D. Drake"
Дата:
On 07/23/2011 10:55 AM, Tomas Vondra wrote:
> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>> thanks for the help!
>> are there any other possible reasons?
>>
>> both system are using Debian amd64, as uname -a shows:
>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>> x86_64 GNU/Linux
>
> It is not just about the architecture, it means the PostgreSQL was
> configured somehow differently during the build. E.g. a different block
> size or WAL block size would make such problems.
>
> Or maybe one of the buils might be 32-bit for some reason (you can run
> 32-bit system in a 64-bit environment). You can do this
>
> $ less postgres | grep Class
>
> to check this (ELF32 =>  32bit, ELF64 =>  64bit).
>
> Did you use the same binary packages or have you built the server yourself?

Run a pg_config and compare the differences.


>
> Tomas
>
>


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

Re: streaming replication does not work across datacenter with 20ms latency?

От
Scott Marlowe
Дата:
On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>> thanks for the help!
>> are there any other possible reasons?
>>
>> both system are using Debian amd64, as uname -a shows:
>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>> x86_64 GNU/Linux
>
> It is not just about the architecture, it means the PostgreSQL was
> configured somehow differently during the build. E.g. a different block
> size or WAL block size would make such problems.
>
> Or maybe one of the buils might be 32-bit for some reason (you can run
> 32-bit system in a 64-bit environment). You can do this
>
> $ less postgres | grep Class
>
> to check this (ELF32 => 32bit, ELF64 => 64bit).
>
> Did you use the same binary packages or have you built the server yourself?

Different date formats too.

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
I used apt-get to install postgresql, running pg_config showing they
are exactly the same...


running on master:
https://gist.github.com/1102148


running on slave:
https://gist.github.com/1102151

On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>>> thanks for the help!
>>> are there any other possible reasons?
>>>
>>> both system are using Debian amd64, as uname -a shows:
>>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>>> x86_64 GNU/Linux
>>
>> It is not just about the architecture, it means the PostgreSQL was
>> configured somehow differently during the build. E.g. a different block
>> size or WAL block size would make such problems.
>>
>> Or maybe one of the buils might be 32-bit for some reason (you can run
>> 32-bit system in a 64-bit environment). You can do this
>>
>> $ less postgres | grep Class
>>
>> to check this (ELF32 => 32bit, ELF64 => 64bit).
>>
>> Did you use the same binary packages or have you built the server yourself?
>
> Different date formats too.
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
less postgres  didn't showing anything... cause it's binary, I tried with -a
 less postgres |grep -a ELF

less postgres | grep -a Class

nothing related to (ELF32 => 32bit, ELF64 => 64bit).





On Sun, Jul 24, 2011 at 1:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>> thanks for the help!
>> are there any other possible reasons?
>>
>> both system are using Debian amd64, as uname -a shows:
>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>> x86_64 GNU/Linux
>
> It is not just about the architecture, it means the PostgreSQL was
> configured somehow differently during the build. E.g. a different block
> size or WAL block size would make such problems.
>
> Or maybe one of the buils might be 32-bit for some reason (you can run
> 32-bit system in a 64-bit environment). You can do this
>
> $ less postgres | grep Class
>
> to check this (ELF32 => 32bit, ELF64 => 64bit).
>
> Did you use the same binary packages or have you built the server yourself?
>
> Tomas
>
>



--

Re: streaming replication does not work across datacenter with 20ms latency?

От
Adrian Klaver
Дата:
On Saturday, July 23, 2011 7:43:56 pm Yan Chunlu wrote:
> I used apt-get to install postgresql, running pg_config showing they
> are exactly the same...
>
>
> running on master:
> https://gist.github.com/1102148
>
>
> running on slave:
> https://gist.github.com/1102151
>

Are you sure there is only one instance of Postgres running on each machine?
--
Adrian Klaver
adrian.klaver@gmail.com

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
the system date formats? looks the same:

master:
#date
Sat Jul 23 21:53:34 CDT 2011

slave:
#date
Sat Jul 23 21:52:56 CDT 2011



On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 23 Červenec 2011, 18:14, Yan Chunlu wrote:
>>> thanks for the help!
>>> are there any other possible reasons?
>>>
>>> both system are using Debian amd64, as uname -a shows:
>>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010
>>> x86_64 GNU/Linux
>>
>> It is not just about the architecture, it means the PostgreSQL was
>> configured somehow differently during the build. E.g. a different block
>> size or WAL block size would make such problems.
>>
>> Or maybe one of the buils might be 32-bit for some reason (you can run
>> 32-bit system in a 64-bit environment). You can do this
>>
>> $ less postgres | grep Class
>>
>> to check this (ELF32 => 32bit, ELF64 => 64bit).
>>
>> Did you use the same binary packages or have you built the server yourself?
>
> Different date formats too.
>



--
闫春路

Re: streaming replication does not work across datacenter with 20ms latency?

От
Scott Marlowe
Дата:
On Sat, Jul 23, 2011 at 8:53 PM, Yan Chunlu <springrider@gmail.com> wrote:
> the system date formats? looks the same:

hehe, no, the internal formats.  There's a floating point and an
integer method.  They have to be the same and according to your output
of pg_config they are, with this config flag listed for both:

--enable-integer-datetimes

btw, integer is preferred over floating point for date types, at least
as far as I know.

Re: streaming replication does not work across datacenter with 20ms latency?

От
Scott Ribe
Дата:
On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:

> I used apt-get to install postgresql, running pg_config showing they
> are exactly the same...

BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over
100msand no problems. 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
thanks for all the help!

@Adrian:  yes, only one instance on each machine

not the slave finally started and could be connect, replication didn't
begin, just following errors:
https://gist.github.com/1102225





BTW: is that possible that rsync has finished but the data didn't
flush to disk, so when postgresql started it was seeing corrupted
files?


On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
> On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
>
>> I used apt-get to install postgresql, running pg_config showing they
>> are exactly the same...
>
> BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well
over100ms and no problems. 
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
is there anything special you have configured on master and slave?

could I see the related configuration on your master and slave? such
as wal_keep_segments,checkpoint_segments or any other might be
related? thanks a lot!

On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe
<scott_ribe@elevated-dev.com> wrote:
> On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote:
>
>> I used apt-get to install postgresql, running pg_config showing they
>> are exactly the same...
>
> BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well
over100ms and no problems. 
>
> --
> Scott Ribe
> scott_ribe@elevated-dev.com
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
>



--
闫春路

Re: streaming replication does not work across datacenter with 20ms latency?

От
"Tomas Vondra"
Дата:
On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
> thanks for all the help!
>
> @Adrian:  yes, only one instance on each machine
>
> not the slave finally started and could be connect, replication didn't
> begin, just following errors:
> https://gist.github.com/1102225

These errors just mean the master already removed WAL segments, so the
slave can't actually start the replication because there'd be a gap. This
usually happens with enough write activity (inserts, updates) when the
slave is being setup.

Whaht is your wal_keep_segments value? Increase it or set up WAL
archiving, so that the slave can get the data.

Tomas


Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
checkpoint_segments = 64
wal_keep_segments = 128

On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
>> thanks for all the help!
>>
>> @Adrian:  yes, only one instance on each machine
>>
>> not the slave finally started and could be connect, replication didn't
>> begin, just following errors:
>> https://gist.github.com/1102225
>
> These errors just mean the master already removed WAL segments, so the
> slave can't actually start the replication because there'd be a gap. This
> usually happens with enough write activity (inserts, updates) when the
> slave is being setup.
>
> Whaht is your wal_keep_segments value? Increase it or set up WAL
> archiving, so that the slave can get the data.
>
> Tomas
>
>

pgpool HA not working

От
Sanjay Rao
Дата:
Hi

I am working on pgpool /postgresql/linux first time. All these three are new to me.

I am running pgpool-HA using pacemaker-corosync.
I am having following error in my setup. Does any body have any idea about the warning highlighted in logs ? Any type of help would be appreciated a lot...

Jul 24 18:40:41 squarepant attrd: [3772]: info: attrd_perform_update: Sent update 1010: probe_complete=true
Jul 24 18:40:41 squarepant cib: [3770]: WARN: cib_process_request: Operation complete: op cib_modify for section status (origin=local/attrd/1010, version=0.52.352): The object/attribute does not exist (rc=-22)
Jul 24 18:40:41 squarepant attrd: [3772]: WARN: attrd_cib_callback: Update 1010 for probe_complete=true failed: The object/attribute does not exist
Jul 24 18:40:41 squarepant pgpool[7635]: INFO: pgpoolRA: request stop, but not running.
Jul 24 18:40:41 squarepant crmd: [3896]: info: process_lrm_event: LRM operation pgpool_stop_0 (call=339, rc=0, cib-update=472, confirmed=true) ok
Jul 24 18:40:41 squarepant crmd: [3896]: info: match_graph_event: Action pgpool_stop_0 (2) confirmed on squarepant (rc=0)
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_rsc_command: Initiating action 12: start pgpool_start_0 on squarepant (local)
Jul 24 18:40:41 squarepant crmd: [3896]: info: do_lrm_rsc_op: Performing key=12:112:0:6d68acf3-ab99-409f-b686-8533e4b24ca0 op=pgpool_start_0 )
Jul 24 18:40:41 squarepant lrmd: [3771]: info: rsc:pgpool:340: start
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_pseudo_action: Pseudo action 5 fired and confirmed


ouput of crm_mon command is


============
Last updated: Sun Jul 24 18:15:59 2011
Stack: Heartbeat
Current DC: squarepant (85f4f3d6-650e-4620-8cbb-edb1bc9d389c) - partition with quorum
Version: 1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
1 Nodes configured, unknown expected votes
3 Resources configured.
============

Online: [ squarepant ]

 DBIP   (ocf::heartbeat:IPaddr2):       Started squarepant
 postgresql     (ocf::heartbeat:pgsql): Started squarepant
 pgpool (ocf::heartbeat:pgpool):        Started squarepant FAILED

Failed actions:
    pgpool_monitor_30000 (node=squarepant, call=1874, rc=7, status=complete): not running


Thanks & Regards
Sanjay

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
I did the SR procedure again, still no luck:

is that normal that after start slave postgresql, the first line of log is:
database system was interrupted; last known up at 2011-07-24 10:53:38 CDT??



4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG:  database
system was interrupted; last known up at 2011-07-24 10:53:38 CDT
4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG:  entering
standby mode
4762 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4761 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT @ LOG:  streaming
replication successfully connected to primary
4764 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres
10.28.53.11(53442)FATAL:  the database system is starting up
4770 2011-07-24 10:56:00 CDT 2011-07-24 10:56:00 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4802 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG:  redo
starts at 57/6B002028
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG:  invalid
record length at 57/6B20E010
4761 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG:  invalid
magic number 0000 in log file 87, segment 107, offset 2490368
4847 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4850 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres
10.28.53.11(53443)FATAL:  the database system is starting up
4851 2011-07-24 10:56:03 CDT 2011-07-24 10:56:03 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4860 2011-07-24 10:56:04 CDT 2011-07-24 10:56:04 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4865 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4859 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT @ LOG:  streaming
replication successfully connected to primary
4874 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4869 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT
postgres@template1 10.28.53.11(53444)FATAL:  the database system is
starting up
4879 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG:  invalid
record length at 57/6B2BA010
4859 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG:  invalid
magic number 0000 in log file 87, segment 107, offset 2883584
4887 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4888 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT @ LOG:  streaming
replication successfully connected to primary
4892 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4896 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT
postgres@template1 10.28.53.11(53445)FATAL:  the database system is
starting up
4901 2011-07-24 10:56:10 CDT 2011-07-24 10:56:10 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4906 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT postgres@postgres
[local]FATAL:  the database system is starting up
4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG:  invalid
record length at 57/6B486010
4888 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ FATAL:
terminating walreceiver process due to administrator command
4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG:  invalid
magic number 0000 in log file 87, segment 107, offset 4849664



On Sun, Jul 24, 2011 at 8:46 PM, Yan Chunlu <springrider@gmail.com> wrote:
> checkpoint_segments = 64
> wal_keep_segments = 128
>
> On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> On 24 Červenec 2011, 6:09, Yan Chunlu wrote:
>>> thanks for all the help!
>>>
>>> @Adrian:  yes, only one instance on each machine
>>>
>>> not the slave finally started and could be connect, replication didn't
>>> begin, just following errors:
>>> https://gist.github.com/1102225
>>
>> These errors just mean the master already removed WAL segments, so the
>> slave can't actually start the replication because there'd be a gap. This
>> usually happens with enough write activity (inserts, updates) when the
>> slave is being setup.
>>
>> Whaht is your wal_keep_segments value? Increase it or set up WAL
>> archiving, so that the slave can get the data.
>>
>> Tomas
>>
>>
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Tomas Vondra
Дата:
Dne 24.7.2011 14:46, Yan Chunlu napsal(a):
> checkpoint_segments = 64
> wal_keep_segments = 128

This information alone is not sufficient - we don't know how much write
activity is on the primary system, so we can't say if those number are
sufficient or not. You have to tune them according to write activity on
the primary server.

For example let's suppose the current WAL segment on the primary is "1"
and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of
data).

Before you prepare and start the slave machine, someone writes 100MB of
data to the primary database (one big insert/update or a lot of small
ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL
segment on the primary is 7, and because of wal_keep_segments there are
segments 3,4,5,6,7 available.

But when the slave connects, it asks for segment no. 2 and it's not
available. It's not possible to skip that segment so the replication
fails to start.

If the primary only received 60MB of data, it'd probably worked (there'd
be enough segments kept on the primary).

Those 128 segments is about 2GB of data. How much data is written on the
primary between creating a filesystem copy and starting the slave?

You don't neet to keep the files on the master, you can set up archiving
and keep them somewhere else (on a different system etc.).

Tomas


Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
 checkpoint_segments = 64
wal_keep_segments = 128
this setting seems is for 5GB capacity, I think there is noway I would
ever write 5GB data during the rsync progress.


I think the problem is still "invalid record length" and "invalid
magic number", it start showing right after I complete sync data and
start slave.  If I stop slave later and restart, yes it could show
xlog not found and can not catch master. but why the "invalid" things
in the first place?


On Mon, Jul 25, 2011 at 4:28 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 24.7.2011 14:46, Yan Chunlu napsal(a):
>> checkpoint_segments = 64
>> wal_keep_segments = 128
>
> This information alone is not sufficient - we don't know how much write
> activity is on the primary system, so we can't say if those number are
> sufficient or not. You have to tune them according to write activity on
> the primary server.
>
> For example let's suppose the current WAL segment on the primary is "1"
> and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of
> data).
>
> Before you prepare and start the slave machine, someone writes 100MB of
> data to the primary database (one big insert/update or a lot of small
> ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL
> segment on the primary is 7, and because of wal_keep_segments there are
> segments 3,4,5,6,7 available.
>
> But when the slave connects, it asks for segment no. 2 and it's not
> available. It's not possible to skip that segment so the replication
> fails to start.
>
> If the primary only received 60MB of data, it'd probably worked (there'd
> be enough segments kept on the primary).
>
> Those 128 segments is about 2GB of data. How much data is written on the
> primary between creating a filesystem copy and starting the slave?
>
> You don't neet to keep the files on the master, you can set up archiving
> and keep them somewhere else (on a different system etc.).
>
> Tomas
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Fujii Masao
Дата:
On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote:
> I think the problem is still "invalid record length" and "invalid
> magic number", it start showing right after I complete sync data and
> start slave.  If I stop slave later and restart, yes it could show
> xlog not found and can not catch master. but why the "invalid" things
> in the first place?

You might have the same problem which was reported before.
http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php

That problem was fixed, and the fix will be included in next minor
update (i.e., 9.0.5).
http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php

Of course, you can avoid the problem by building PostgreSQL with
gcc != 4.6.0, I think.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: streaming replication does not work across datacenter with 20ms latency?

От
Simon Riggs
Дата:
On Mon, Jul 25, 2011 at 8:38 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote:
>> I think the problem is still "invalid record length" and "invalid
>> magic number", it start showing right after I complete sync data and
>> start slave.  If I stop slave later and restart, yes it could show
>> xlog not found and can not catch master. but why the "invalid" things
>> in the first place?
>
> You might have the same problem which was reported before.
> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php

That's what it looks like to me also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
I am using debian ant apt-get to install postgresql, dpkg list shows
they are the same?  is there anyway to tell what's version it is
compiled from? thanks!

Master# dpkg -l |grep post
ii  postgresql-9.0                      9.0.4-1+b1
  object-relational SQL database, version 9.0 server
ii  postgresql-client-9.0               9.0.4-1+b1
  front-end programs for PostgreSQL 9.0
ii  postgresql-client-common            118
  manager for multiple PostgreSQL client versions
ii  postgresql-common                   118
  PostgreSQL database-cluster manager
ii  postgresql-contrib-9.0              9.0.4-1+b1
  additional facilities for PostgreSQL
ii  postgresql-server-dev-9.0           9.0.4-1+b1
  development files for PostgreSQL 9.0 server-side programming


Slave $ dpkg -l |grep post
ii  postgresql-9.0                      9.0.4-1+b1
object-relational SQL database, version 9.0 server
ii  postgresql-client-9.0               9.0.4-1+b1
front-end programs for PostgreSQL 9.0
ii  postgresql-client-common            118
manager for multiple PostgreSQL client versions
ii  postgresql-common                   118
PostgreSQL database-cluster manager
ii  postgresql-contrib-9.0              9.0.4-1+b1
additional facilities for PostgreSQL



On Mon, Jul 25, 2011 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote:
>> I think the problem is still "invalid record length" and "invalid
>> magic number", it start showing right after I complete sync data and
>> start slave.  If I stop slave later and restart, yes it could show
>> xlog not found and can not catch master. but why the "invalid" things
>> in the first place?
>
> You might have the same problem which was reported before.
> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php
>
> That problem was fixed, and the fix will be included in next minor
> update (i.e., 9.0.5).
> http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php
>
> Of course, you can avoid the problem by building PostgreSQL with
> gcc != 4.6.0, I think.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
sorry for the typo, "debian and apt-get"

On Mon, Jul 25, 2011 at 5:39 PM, Yan Chunlu <springrider@gmail.com> wrote:
> I am using debian and apt-get to install postgresql, dpkg list shows
> they are the same?  is there anyway to tell what's version it is
> compiled from? thanks!
>
> Master# dpkg -l |grep post
> ii  postgresql-9.0                      9.0.4-1+b1
>  object-relational SQL database, version 9.0 server
> ii  postgresql-client-9.0               9.0.4-1+b1
>  front-end programs for PostgreSQL 9.0
> ii  postgresql-client-common            118
>  manager for multiple PostgreSQL client versions
> ii  postgresql-common                   118
>  PostgreSQL database-cluster manager
> ii  postgresql-contrib-9.0              9.0.4-1+b1
>  additional facilities for PostgreSQL
> ii  postgresql-server-dev-9.0           9.0.4-1+b1
>  development files for PostgreSQL 9.0 server-side programming
>
>
> Slave $ dpkg -l |grep post
> ii  postgresql-9.0                      9.0.4-1+b1
> object-relational SQL database, version 9.0 server
> ii  postgresql-client-9.0               9.0.4-1+b1
> front-end programs for PostgreSQL 9.0
> ii  postgresql-client-common            118
> manager for multiple PostgreSQL client versions
> ii  postgresql-common                   118
> PostgreSQL database-cluster manager
> ii  postgresql-contrib-9.0              9.0.4-1+b1
> additional facilities for PostgreSQL
>
>
>
> On Mon, Jul 25, 2011 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote:
>>> I think the problem is still "invalid record length" and "invalid
>>> magic number", it start showing right after I complete sync data and
>>> start slave.  If I stop slave later and restart, yes it could show
>>> xlog not found and can not catch master. but why the "invalid" things
>>> in the first place?
>>
>> You might have the same problem which was reported before.
>> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php
>>
>> That problem was fixed, and the fix will be included in next minor
>> update (i.e., 9.0.5).
>> http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php
>>
>> Of course, you can avoid the problem by building PostgreSQL with
>> gcc != 4.6.0, I think.
>>
>> Regards,
>>
>> --
>> Fujii Masao
>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>> NTT Open Source Software Center
>>
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
"Tomas Vondra"
Дата:
On 25 Červenec 2011, 11:39, Yan Chunlu wrote:
> I am using debian ant apt-get to install postgresql, dpkg list shows
> they are the same?  is there anyway to tell what's version it is
> compiled from? thanks!

AFAIK there's no way to find out which compiler was used to build
PostgreSQL binaries (IIRC this is not in pg_config). So you can do this

gcc --version

That should print version of the default compiler, and it's very likely it
was used to build the PostgreSQL binaries. If it's 4.6.0, you've been hit
by the problem Fujii Masao described. Use a different gcc version and it
should work fine.

Tomas


Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
gcc compiler on my machine is 4.6.1, but I didn't compile it myself.
just installed the binary from apt-get. I will try to compile it by
myself to see what will happen

On Mon, Jul 25, 2011 at 8:18 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 25 Červenec 2011, 11:39, Yan Chunlu wrote:
>> I am using debian ant apt-get to install postgresql, dpkg list shows
>> they are the same?  is there anyway to tell what's version it is
>> compiled from? thanks!
>
> AFAIK there's no way to find out which compiler was used to build
> PostgreSQL binaries (IIRC this is not in pg_config). So you can do this
>
> gcc --version
>
> That should print version of the default compiler, and it's very likely it
> was used to build the PostgreSQL binaries. If it's 4.6.0, you've been hit
> by the problem Fujii Masao described. Use a different gcc version and it
> should work fine.
>
> Tomas
>
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Tom Lane
Дата:
"Tomas Vondra" <tv@fuzzy.cz> writes:
> On 25 Červenec 2011, 11:39, Yan Chunlu wrote:
>> I am using debian ant apt-get to install postgresql, dpkg list shows
>> they are the same?  is there anyway to tell what's version it is
>> compiled from? thanks!

> AFAIK there's no way to find out which compiler was used to build
> PostgreSQL binaries (IIRC this is not in pg_config).

No, but "SELECT version();" should tell the tale.

            regards, tom lane

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
seems the Master server is compiled using 4.6.0:
                                                          version

----------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.6.real
(Debian 4.6.0-6) 4.6.1 20110428 (prerelease), 64-bit


and there is no way  to know what slave is using since I have remove it.

I am using a compiled version(gcc 4.3.2)  of postgresql as slave,
seems working now.

I think the problem maybe is like Fujii said, does that bug only
effect hot-stanby server?  seems master is okay.


On Mon, Jul 25, 2011 at 10:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> On 25 Červenec 2011, 11:39, Yan Chunlu wrote:
>>> I am using debian ant apt-get to install postgresql, dpkg list shows
>>> they are the same?  is there anyway to tell what's version it is
>>> compiled from? thanks!
>
>> AFAIK there's no way to find out which compiler was used to build
>> PostgreSQL binaries (IIRC this is not in pg_config).
>
> No, but "SELECT version();" should tell the tale.
>
>                        regards, tom lane
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Tom Lane
Дата:
Yan Chunlu <springrider@gmail.com> writes:
> seems the Master server is compiled using 4.6.0:

>  PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.6.real
> (Debian 4.6.0-6) 4.6.1 20110428 (prerelease), 64-bit

Hmm.  Given the datestamp, that version of gcc almost certainly does
have the bug.  I wonder whether Martin Pitt knows about this issue
and the workaround we put in --- I'd have thought he'd push updated
.debs with a workaround, as I did for Fedora ...

Martin: see
http://archives.postgresql.org/pgsql-hackers/2011-06/msg00890.php
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c2ba0121c73b7461331104a46d140156e847572a

> and there is no way  to know what slave is using since I have remove it.

If it was installed from the same .deb then it'd be the same build.

> I think the problem maybe is like Fujii said, does that bug only
> effect hot-stanby server?  seems master is okay.

Well, actually, the bug affects WAL replay of any sort, which means
if your master were to crash and restart you'd be at risk of data
corruption on the master.  I'd replace the master build too, ASAP.

            regards, tom lane

Re: streaming replication does not work across datacenter with 20ms latency?

От
Tom Lane
Дата:
Yan Chunlu <springrider@gmail.com> writes:
> how about 4.3.2?

Yes, pre-4.6 gcc should be fine.

            regards, tom lane

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
how about 4.3.2?

 I have gcc 4.3.2 compiled postgresql 9.0.4 as slave, is that okay
that I turn the slave into master?  so the switch will be a lot more
smooth.

On Tue, Jul 26, 2011 at 12:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yan Chunlu <springrider@gmail.com> writes:
>> oh god...thanks a lot for the tip. I did actually lost some data, the
>> master server has crashed two times. every time it comes back, the
>> index were broken. I need to reindex it. I have already set fsync=on.
>>   just thought it was normal behavior....
>
> Uh, no.
>
>> about gcc version, only 4.6.0 effected?    4.6.1 is okay? then I could
>> compile my own version with 4.6.1
>
> It's supposed to be fixed in 4.6.1 according to
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390
>
> Personally I'd use 4.6.1 *and* apply the Postgres patch.
>
>                        regards, tom lane
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Tom Lane
Дата:
Yan Chunlu <springrider@gmail.com> writes:
> oh god...thanks a lot for the tip. I did actually lost some data, the
> master server has crashed two times. every time it comes back, the
> index were broken. I need to reindex it. I have already set fsync=on.
>   just thought it was normal behavior....

Uh, no.

> about gcc version, only 4.6.0 effected?    4.6.1 is okay? then I could
> compile my own version with 4.6.1

It's supposed to be fixed in 4.6.1 according to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390

Personally I'd use 4.6.1 *and* apply the Postgres patch.

            regards, tom lane

Re: streaming replication does not work across datacenter with 20ms latency?

От
Yan Chunlu
Дата:
oh god...thanks a lot for the tip. I did actually lost some data, the
master server has crashed two times. every time it comes back, the
index were broken. I need to reindex it. I have already set fsync=on.
  just thought it was normal behavior....

about gcc version, only 4.6.0 effected?    4.6.1 is okay? then I could
compile my own version with 4.6.1

but how about the data? from the bug information, the data file seems
not compatible....
I need to do pg_dump and pg_restore?   what a nightmare.....

On Mon, Jul 25, 2011 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yan Chunlu <springrider@gmail.com> writes:
>> seems the Master server is compiled using 4.6.0:
>
>>  PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.6.real
>> (Debian 4.6.0-6) 4.6.1 20110428 (prerelease), 64-bit
>
> Hmm.  Given the datestamp, that version of gcc almost certainly does
> have the bug.  I wonder whether Martin Pitt knows about this issue
> and the workaround we put in --- I'd have thought he'd push updated
> .debs with a workaround, as I did for Fedora ...
>
> Martin: see
> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00890.php
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c2ba0121c73b7461331104a46d140156e847572a
>
>> and there is no way  to know what slave is using since I have remove it.
>
> If it was installed from the same .deb then it'd be the same build.
>
>> I think the problem maybe is like Fujii said, does that bug only
>> effect hot-stanby server?  seems master is okay.
>
> Well, actually, the bug affects WAL replay of any sort, which means
> if your master were to crash and restart you'd be at risk of data
> corruption on the master.  I'd replace the master build too, ASAP.
>
>                        regards, tom lane
>

Re: streaming replication does not work across datacenter with 20ms latency?

От
Michael Nolan
Дата:


On Mon, Jul 25, 2011 at 7:18 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 25 Červenec 2011, 11:39, Yan Chunlu wrote:
> I am using debian ant apt-get to install postgresql, dpkg list shows
> they are the same?  is there anyway to tell what's version it is
> compiled from? thanks!

AFAIK there's no way to find out which compiler was used to build
PostgreSQL binaries

You can do a strings on a binary file (eg, postmaster) and search for GCC in the output.
--
Mike Nolan