Обсуждение: streaming replication does not work across datacenter with 20ms latency?
I was doing postgresql streaming replication, which was fine when two machine in the same datecenter. but recently I was planning to deploy new slave at a different datecent, the latency between the master and slave is 20ms; below is the related configurateion: Both master and slave have below configuration: hot_standby = on wal_level = hot_standby max_wal_senders = 5 checkpoint_segments = 64 wal_keep_segments = 128 I am using pgpool to automation but the method is similar to the method described here: http://wiki.postgresql.org/wiki/Streaming_Replication the data dir size is about 30G, I have tried many times but every time after the sync was over and slave was started, postgresql is just hanging there with error message(attached below), while trying to connect it returns error message "psql: FATAL: the database system is starting up" the strange part is with same configuration, other slaves in the same datacenter works fine... what does invalid record length and invalid magic number normally means? xlog corrupted? Thanks for any further help! the log message with debug5 level was like this(just clips, I could upload full log file if necessary): 17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG: database system was interrupted; last known up at 2011-07-23 07:07:57 CDT 17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: forked new backend, pid=17998 socket=8 17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: forked new backend, pid=17999 socket=8 17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@postgres [local]FATAL: the database system is starting up 17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@postgres [local]DEBUG: shmem_exit(1): 0 callbacks to make 17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@postgres [local]DEBUG: proc_exit(1): 1 callbacks to make 17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@postgres [local]DEBUG: exit(1) 17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@postgres [local]DEBUG: shmem_exit(-1): 0 callbacks to make 17999 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@postgres [local]DEBUG: proc_exit(-1): 0 callbacks to make 17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: reaping dead processes 17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: server process (PID 17999) exited with exit code 1 17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@template1 10.28.53.11(33647)FATAL: the database system is starting up 17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@template1 10.28.53.11(33647)DEBUG: shmem_exit(1): 0 callbacks to make 17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@template1 10.28.53.11(33647)DEBUG: proc_exit(1): 1 callbacks to make 17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@template1 10.28.53.11(33647)DEBUG: exit(1) 17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@template1 10.28.53.11(33647)DEBUG: shmem_exit(-1): 0 callbacks to make 17998 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT postgres@template1 10.28.53.11(33647)DEBUG: proc_exit(-1): 0 callbacks to make 17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: reaping dead processes 17828 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: server process (PID 17998) exited with exit code 1 17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: standby_mode = 'on' 17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: primary_conninfo = 'host=jefferson port=5432 user=postgres' 17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: trigger_file = '/var/log/pgpool/trigger/trigger_file1' 17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ LOG: entering standby mode 17997 2011-07-23 07:12:19 CDT 2011-07-23 07:12:19 CDT @ DEBUG: could not open file "pg_xlog/0000000300000054000000DB" (log file 84, segment 219): No such file or directory 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: record known xact 36933672 latestObservedXid 36933674 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog redo commit: 2011-07-23 06:41:41.264405-05 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: remove KnownAssignedXid 36933672 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog redo commit: 2011-07-23 06:41:41.264405-05 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ DEBUG: record known xact 36933674 latestObservedXid 36933674 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ CONTEXT: xlog redo insert: rel 1663/16386/17404; tid 18378/37 17997 2011-07-23 07:27:32 CDT 2011-07-23 07:27:32 CDT @ LOG: invalid record length at 54/DDFE4010 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: remove KnownAssignedXid 36929085 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog redo commit: 2011-07-23 06:33:29.760915-05 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: record known xact 36929100 latestObservedXid 36929102 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog redo insert: rel 1663/16386/16436; tid 88370/2 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ DEBUG: record known xact 36929109 latestObservedXid 36929102 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ CONTEXT: xlog redo insert: rel 1663/16386/16436; tid 88370/3 17997 2011-07-23 07:13:26 CDT 2011-07-23 07:13:26 CDT @ LOG: invalid magic number 0000 in log file 84, segment 219, offset 7733248 --
On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote: > what does invalid record length and invalid magic number normally > means? xlog corrupted? > Thanks for any further help! It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and theother is 64-bit, or one machine is big-endian and the other is little-endian... -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
thanks for the help! are there any other possible reasons? both system are using Debian amd64, as uname -a shows: Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux and using the following program it tells both of them are little-endian #include <stdio.h> #include <stdbool.h> bool isBigEndian() { int no = 1; char *chk = (char *)&no; if (chk[0] == 1) { return 0; } else { return 1; } } main() { printf("this is %d \n",(int)isBigEndian()); return 0; } ~ On Sat, Jul 23, 2011 at 11:55 PM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > On Jul 23, 2011, at 6:50 AM, Yan Chunlu wrote: > >> what does invalid record length and invalid magic number normally >> means? xlog corrupted? >> Thanks for any further help! > > It means your build settings for pg are not compatible across the 2 machines. For instance, one machine is 32-bit and theother is 64-bit, or one machine is big-endian and the other is little-endian... > > -- > Scott Ribe > scott_ribe@elevated-dev.com > http://www.elevated-dev.com/ > (303) 722-0567 voice > > > > > -- 闫春路
Re: streaming replication does not work across datacenter with 20ms latency?
От
"Tomas Vondra"
Дата:
On 23 Červenec 2011, 18:14, Yan Chunlu wrote: > thanks for the help! > are there any other possible reasons? > > both system are using Debian amd64, as uname -a shows: > Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 > x86_64 GNU/Linux It is not just about the architecture, it means the PostgreSQL was configured somehow differently during the build. E.g. a different block size or WAL block size would make such problems. Or maybe one of the buils might be 32-bit for some reason (you can run 32-bit system in a 64-bit environment). You can do this $ less postgres | grep Class to check this (ELF32 => 32bit, ELF64 => 64bit). Did you use the same binary packages or have you built the server yourself? Tomas
Re: streaming replication does not work across datacenter with 20ms latency?
От
"Joshua D. Drake"
Дата:
On 07/23/2011 10:55 AM, Tomas Vondra wrote: > On 23 Červenec 2011, 18:14, Yan Chunlu wrote: >> thanks for the help! >> are there any other possible reasons? >> >> both system are using Debian amd64, as uname -a shows: >> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 >> x86_64 GNU/Linux > > It is not just about the architecture, it means the PostgreSQL was > configured somehow differently during the build. E.g. a different block > size or WAL block size would make such problems. > > Or maybe one of the buils might be 32-bit for some reason (you can run > 32-bit system in a 64-bit environment). You can do this > > $ less postgres | grep Class > > to check this (ELF32 => 32bit, ELF64 => 64bit). > > Did you use the same binary packages or have you built the server yourself? Run a pg_config and compare the differences. > > Tomas > > -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579
On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 23 Červenec 2011, 18:14, Yan Chunlu wrote: >> thanks for the help! >> are there any other possible reasons? >> >> both system are using Debian amd64, as uname -a shows: >> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 >> x86_64 GNU/Linux > > It is not just about the architecture, it means the PostgreSQL was > configured somehow differently during the build. E.g. a different block > size or WAL block size would make such problems. > > Or maybe one of the buils might be 32-bit for some reason (you can run > 32-bit system in a 64-bit environment). You can do this > > $ less postgres | grep Class > > to check this (ELF32 => 32bit, ELF64 => 64bit). > > Did you use the same binary packages or have you built the server yourself? Different date formats too.
I used apt-get to install postgresql, running pg_config showing they are exactly the same... running on master: https://gist.github.com/1102148 running on slave: https://gist.github.com/1102151 On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 23 Červenec 2011, 18:14, Yan Chunlu wrote: >>> thanks for the help! >>> are there any other possible reasons? >>> >>> both system are using Debian amd64, as uname -a shows: >>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 >>> x86_64 GNU/Linux >> >> It is not just about the architecture, it means the PostgreSQL was >> configured somehow differently during the build. E.g. a different block >> size or WAL block size would make such problems. >> >> Or maybe one of the buils might be 32-bit for some reason (you can run >> 32-bit system in a 64-bit environment). You can do this >> >> $ less postgres | grep Class >> >> to check this (ELF32 => 32bit, ELF64 => 64bit). >> >> Did you use the same binary packages or have you built the server yourself? > > Different date formats too. >
less postgres didn't showing anything... cause it's binary, I tried with -a less postgres |grep -a ELF less postgres | grep -a Class nothing related to (ELF32 => 32bit, ELF64 => 64bit). On Sun, Jul 24, 2011 at 1:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 23 Červenec 2011, 18:14, Yan Chunlu wrote: >> thanks for the help! >> are there any other possible reasons? >> >> both system are using Debian amd64, as uname -a shows: >> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 >> x86_64 GNU/Linux > > It is not just about the architecture, it means the PostgreSQL was > configured somehow differently during the build. E.g. a different block > size or WAL block size would make such problems. > > Or maybe one of the buils might be 32-bit for some reason (you can run > 32-bit system in a 64-bit environment). You can do this > > $ less postgres | grep Class > > to check this (ELF32 => 32bit, ELF64 => 64bit). > > Did you use the same binary packages or have you built the server yourself? > > Tomas > > --
On Saturday, July 23, 2011 7:43:56 pm Yan Chunlu wrote: > I used apt-get to install postgresql, running pg_config showing they > are exactly the same... > > > running on master: > https://gist.github.com/1102148 > > > running on slave: > https://gist.github.com/1102151 > Are you sure there is only one instance of Postgres running on each machine? -- Adrian Klaver adrian.klaver@gmail.com
the system date formats? looks the same: master: #date Sat Jul 23 21:53:34 CDT 2011 slave: #date Sat Jul 23 21:52:56 CDT 2011 On Sun, Jul 24, 2011 at 2:44 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Sat, Jul 23, 2011 at 11:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 23 Červenec 2011, 18:14, Yan Chunlu wrote: >>> thanks for the help! >>> are there any other possible reasons? >>> >>> both system are using Debian amd64, as uname -a shows: >>> Linux washington 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 >>> x86_64 GNU/Linux >> >> It is not just about the architecture, it means the PostgreSQL was >> configured somehow differently during the build. E.g. a different block >> size or WAL block size would make such problems. >> >> Or maybe one of the buils might be 32-bit for some reason (you can run >> 32-bit system in a 64-bit environment). You can do this >> >> $ less postgres | grep Class >> >> to check this (ELF32 => 32bit, ELF64 => 64bit). >> >> Did you use the same binary packages or have you built the server yourself? > > Different date formats too. > -- 闫春路
On Sat, Jul 23, 2011 at 8:53 PM, Yan Chunlu <springrider@gmail.com> wrote: > the system date formats? looks the same: hehe, no, the internal formats. There's a floating point and an integer method. They have to be the same and according to your output of pg_config they are, with this config flag listed for both: --enable-integer-datetimes btw, integer is preferred over floating point for date types, at least as far as I know.
On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote: > I used apt-get to install postgresql, running pg_config showing they > are exactly the same... BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over 100msand no problems. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
thanks for all the help! @Adrian: yes, only one instance on each machine not the slave finally started and could be connect, replication didn't begin, just following errors: https://gist.github.com/1102225 BTW: is that possible that rsync has finished but the data didn't flush to disk, so when postgresql started it was seeing corrupted files? On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote: > >> I used apt-get to install postgresql, running pg_config showing they >> are exactly the same... > > BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over100ms and no problems. > > -- > Scott Ribe > scott_ribe@elevated-dev.com > http://www.elevated-dev.com/ > (303) 722-0567 voice > > > > >
is there anything special you have configured on master and slave? could I see the related configuration on your master and slave? such as wal_keep_segments,checkpoint_segments or any other might be related? thanks a lot! On Sun, Jul 24, 2011 at 11:23 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > On Jul 23, 2011, at 8:43 PM, Yan Chunlu wrote: > >> I used apt-get to install postgresql, running pg_config showing they >> are exactly the same... > > BTW, forgot to mention this in my first message: I run streaming replication across the country with latency well over100ms and no problems. > > -- > Scott Ribe > scott_ribe@elevated-dev.com > http://www.elevated-dev.com/ > (303) 722-0567 voice > > > > > -- 闫春路
Re: streaming replication does not work across datacenter with 20ms latency?
От
"Tomas Vondra"
Дата:
On 24 Červenec 2011, 6:09, Yan Chunlu wrote: > thanks for all the help! > > @Adrian: yes, only one instance on each machine > > not the slave finally started and could be connect, replication didn't > begin, just following errors: > https://gist.github.com/1102225 These errors just mean the master already removed WAL segments, so the slave can't actually start the replication because there'd be a gap. This usually happens with enough write activity (inserts, updates) when the slave is being setup. Whaht is your wal_keep_segments value? Increase it or set up WAL archiving, so that the slave can get the data. Tomas
checkpoint_segments = 64 wal_keep_segments = 128 On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 24 Červenec 2011, 6:09, Yan Chunlu wrote: >> thanks for all the help! >> >> @Adrian: yes, only one instance on each machine >> >> not the slave finally started and could be connect, replication didn't >> begin, just following errors: >> https://gist.github.com/1102225 > > These errors just mean the master already removed WAL segments, so the > slave can't actually start the replication because there'd be a gap. This > usually happens with enough write activity (inserts, updates) when the > slave is being setup. > > Whaht is your wal_keep_segments value? Increase it or set up WAL > archiving, so that the slave can get the data. > > Tomas > >
Hi
I am working on pgpool /postgresql/linux first time. All these three are new to me.
I am running pgpool-HA using pacemaker-corosync.
I am having following error in my setup. Does any body have any idea about the warning highlighted in logs ? Any type of help would be appreciated a lot...
Jul 24 18:40:41 squarepant attrd: [3772]: info: attrd_perform_update: Sent update 1010: probe_complete=true
Jul 24 18:40:41 squarepant cib: [3770]: WARN: cib_process_request: Operation complete: op cib_modify for section status (origin=local/attrd/1010, version=0.52.352): The object/attribute does not exist (rc=-22)
Jul 24 18:40:41 squarepant attrd: [3772]: WARN: attrd_cib_callback: Update 1010 for probe_complete=true failed: The object/attribute does not exist
Jul 24 18:40:41 squarepant pgpool[7635]: INFO: pgpoolRA: request stop, but not running.
Jul 24 18:40:41 squarepant crmd: [3896]: info: process_lrm_event: LRM operation pgpool_stop_0 (call=339, rc=0, cib-update=472, confirmed=true) ok
Jul 24 18:40:41 squarepant crmd: [3896]: info: match_graph_event: Action pgpool_stop_0 (2) confirmed on squarepant (rc=0)
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_rsc_command: Initiating action 12: start pgpool_start_0 on squarepant (local)
Jul 24 18:40:41 squarepant crmd: [3896]: info: do_lrm_rsc_op: Performing key=12:112:0:6d68acf3-ab99-409f-b686-8533e4b24ca0 op=pgpool_start_0 )
Jul 24 18:40:41 squarepant lrmd: [3771]: info: rsc:pgpool:340: start
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_pseudo_action: Pseudo action 5 fired and confirmed
ouput of crm_mon command is
============
Last updated: Sun Jul 24 18:15:59 2011
Stack: Heartbeat
Current DC: squarepant (85f4f3d6-650e-4620-8cbb-edb1bc9d389c) - partition with quorum
Version: 1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
1 Nodes configured, unknown expected votes
3 Resources configured.
============
Online: [ squarepant ]
DBIP (ocf::heartbeat:IPaddr2): Started squarepant
postgresql (ocf::heartbeat:pgsql): Started squarepant
pgpool (ocf::heartbeat:pgpool): Started squarepant FAILED
Failed actions:
pgpool_monitor_30000 (node=squarepant, call=1874, rc=7, status=complete): not running
Thanks & Regards
Sanjay
I am working on pgpool /postgresql/linux first time. All these three are new to me.
I am running pgpool-HA using pacemaker-corosync.
I am having following error in my setup. Does any body have any idea about the warning highlighted in logs ? Any type of help would be appreciated a lot...
Jul 24 18:40:41 squarepant attrd: [3772]: info: attrd_perform_update: Sent update 1010: probe_complete=true
Jul 24 18:40:41 squarepant cib: [3770]: WARN: cib_process_request: Operation complete: op cib_modify for section status (origin=local/attrd/1010, version=0.52.352): The object/attribute does not exist (rc=-22)
Jul 24 18:40:41 squarepant attrd: [3772]: WARN: attrd_cib_callback: Update 1010 for probe_complete=true failed: The object/attribute does not exist
Jul 24 18:40:41 squarepant pgpool[7635]: INFO: pgpoolRA: request stop, but not running.
Jul 24 18:40:41 squarepant crmd: [3896]: info: process_lrm_event: LRM operation pgpool_stop_0 (call=339, rc=0, cib-update=472, confirmed=true) ok
Jul 24 18:40:41 squarepant crmd: [3896]: info: match_graph_event: Action pgpool_stop_0 (2) confirmed on squarepant (rc=0)
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_rsc_command: Initiating action 12: start pgpool_start_0 on squarepant (local)
Jul 24 18:40:41 squarepant crmd: [3896]: info: do_lrm_rsc_op: Performing key=12:112:0:6d68acf3-ab99-409f-b686-8533e4b24ca0 op=pgpool_start_0 )
Jul 24 18:40:41 squarepant lrmd: [3771]: info: rsc:pgpool:340: start
Jul 24 18:40:41 squarepant crmd: [3896]: info: te_pseudo_action: Pseudo action 5 fired and confirmed
ouput of crm_mon command is
============
Last updated: Sun Jul 24 18:15:59 2011
Stack: Heartbeat
Current DC: squarepant (85f4f3d6-650e-4620-8cbb-edb1bc9d389c) - partition with quorum
Version: 1.1.5-1.fc14-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
1 Nodes configured, unknown expected votes
3 Resources configured.
============
Online: [ squarepant ]
DBIP (ocf::heartbeat:IPaddr2): Started squarepant
postgresql (ocf::heartbeat:pgsql): Started squarepant
pgpool (ocf::heartbeat:pgpool): Started squarepant FAILED
Failed actions:
pgpool_monitor_30000 (node=squarepant, call=1874, rc=7, status=complete): not running
Thanks & Regards
Sanjay
I did the SR procedure again, still no luck: is that normal that after start slave postgresql, the first line of log is: database system was interrupted; last known up at 2011-07-24 10:53:38 CDT?? 4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG: database system was interrupted; last known up at 2011-07-24 10:53:38 CDT 4760 2011-07-24 10:55:58 CDT 2011-07-24 10:55:58 CDT @ LOG: entering standby mode 4762 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres [local]FATAL: the database system is starting up 4761 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT @ LOG: streaming replication successfully connected to primary 4764 2011-07-24 10:55:59 CDT 2011-07-24 10:55:59 CDT postgres@postgres 10.28.53.11(53442)FATAL: the database system is starting up 4770 2011-07-24 10:56:00 CDT 2011-07-24 10:56:00 CDT postgres@postgres [local]FATAL: the database system is starting up 4802 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT postgres@postgres [local]FATAL: the database system is starting up 4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG: redo starts at 57/6B002028 4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG: invalid record length at 57/6B20E010 4761 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ FATAL: terminating walreceiver process due to administrator command 4760 2011-07-24 10:56:01 CDT 2011-07-24 10:56:01 CDT @ LOG: invalid magic number 0000 in log file 87, segment 107, offset 2490368 4847 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres [local]FATAL: the database system is starting up 4850 2011-07-24 10:56:02 CDT 2011-07-24 10:56:02 CDT postgres@postgres 10.28.53.11(53443)FATAL: the database system is starting up 4851 2011-07-24 10:56:03 CDT 2011-07-24 10:56:03 CDT postgres@postgres [local]FATAL: the database system is starting up 4860 2011-07-24 10:56:04 CDT 2011-07-24 10:56:04 CDT postgres@postgres [local]FATAL: the database system is starting up 4865 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT postgres@postgres [local]FATAL: the database system is starting up 4859 2011-07-24 10:56:05 CDT 2011-07-24 10:56:05 CDT @ LOG: streaming replication successfully connected to primary 4874 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT postgres@postgres [local]FATAL: the database system is starting up 4869 2011-07-24 10:56:06 CDT 2011-07-24 10:56:06 CDT postgres@template1 10.28.53.11(53444)FATAL: the database system is starting up 4879 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT postgres@postgres [local]FATAL: the database system is starting up 4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG: invalid record length at 57/6B2BA010 4859 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ FATAL: terminating walreceiver process due to administrator command 4760 2011-07-24 10:56:07 CDT 2011-07-24 10:56:07 CDT @ LOG: invalid magic number 0000 in log file 87, segment 107, offset 2883584 4887 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT postgres@postgres [local]FATAL: the database system is starting up 4888 2011-07-24 10:56:08 CDT 2011-07-24 10:56:08 CDT @ LOG: streaming replication successfully connected to primary 4892 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT postgres@postgres [local]FATAL: the database system is starting up 4896 2011-07-24 10:56:09 CDT 2011-07-24 10:56:09 CDT postgres@template1 10.28.53.11(53445)FATAL: the database system is starting up 4901 2011-07-24 10:56:10 CDT 2011-07-24 10:56:10 CDT postgres@postgres [local]FATAL: the database system is starting up 4906 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT postgres@postgres [local]FATAL: the database system is starting up 4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG: invalid record length at 57/6B486010 4888 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ FATAL: terminating walreceiver process due to administrator command 4760 2011-07-24 10:56:11 CDT 2011-07-24 10:56:11 CDT @ LOG: invalid magic number 0000 in log file 87, segment 107, offset 4849664 On Sun, Jul 24, 2011 at 8:46 PM, Yan Chunlu <springrider@gmail.com> wrote: > checkpoint_segments = 64 > wal_keep_segments = 128 > > On Sun, Jul 24, 2011 at 8:25 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 24 Červenec 2011, 6:09, Yan Chunlu wrote: >>> thanks for all the help! >>> >>> @Adrian: yes, only one instance on each machine >>> >>> not the slave finally started and could be connect, replication didn't >>> begin, just following errors: >>> https://gist.github.com/1102225 >> >> These errors just mean the master already removed WAL segments, so the >> slave can't actually start the replication because there'd be a gap. This >> usually happens with enough write activity (inserts, updates) when the >> slave is being setup. >> >> Whaht is your wal_keep_segments value? Increase it or set up WAL >> archiving, so that the slave can get the data. >> >> Tomas >> >> >
Dne 24.7.2011 14:46, Yan Chunlu napsal(a): > checkpoint_segments = 64 > wal_keep_segments = 128 This information alone is not sufficient - we don't know how much write activity is on the primary system, so we can't say if those number are sufficient or not. You have to tune them according to write activity on the primary server. For example let's suppose the current WAL segment on the primary is "1" and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of data). Before you prepare and start the slave machine, someone writes 100MB of data to the primary database (one big insert/update or a lot of small ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL segment on the primary is 7, and because of wal_keep_segments there are segments 3,4,5,6,7 available. But when the slave connects, it asks for segment no. 2 and it's not available. It's not possible to skip that segment so the replication fails to start. If the primary only received 60MB of data, it'd probably worked (there'd be enough segments kept on the primary). Those 128 segments is about 2GB of data. How much data is written on the primary between creating a filesystem copy and starting the slave? You don't neet to keep the files on the master, you can set up archiving and keep them somewhere else (on a different system etc.). Tomas
checkpoint_segments = 64 wal_keep_segments = 128 this setting seems is for 5GB capacity, I think there is noway I would ever write 5GB data during the rsync progress. I think the problem is still "invalid record length" and "invalid magic number", it start showing right after I complete sync data and start slave. If I stop slave later and restart, yes it could show xlog not found and can not catch master. but why the "invalid" things in the first place? On Mon, Jul 25, 2011 at 4:28 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 24.7.2011 14:46, Yan Chunlu napsal(a): >> checkpoint_segments = 64 >> wal_keep_segments = 128 > > This information alone is not sufficient - we don't know how much write > activity is on the primary system, so we can't say if those number are > sufficient or not. You have to tune them according to write activity on > the primary server. > > For example let's suppose the current WAL segment on the primary is "1" > and that it's configured with wal_keep_segments = 5 (i.e. about 80MB of > data). > > Before you prepare and start the slave machine, someone writes 100MB of > data to the primary database (one big insert/update or a lot of small > ones, doesn't matter). 100MB is about 6 WAL segments, so the current WAL > segment on the primary is 7, and because of wal_keep_segments there are > segments 3,4,5,6,7 available. > > But when the slave connects, it asks for segment no. 2 and it's not > available. It's not possible to skip that segment so the replication > fails to start. > > If the primary only received 60MB of data, it'd probably worked (there'd > be enough segments kept on the primary). > > Those 128 segments is about 2GB of data. How much data is written on the > primary between creating a filesystem copy and starting the slave? > > You don't neet to keep the files on the master, you can set up archiving > and keep them somewhere else (on a different system etc.). > > Tomas > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote: > I think the problem is still "invalid record length" and "invalid > magic number", it start showing right after I complete sync data and > start slave. If I stop slave later and restart, yes it could show > xlog not found and can not catch master. but why the "invalid" things > in the first place? You might have the same problem which was reported before. http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php That problem was fixed, and the fix will be included in next minor update (i.e., 9.0.5). http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php Of course, you can avoid the problem by building PostgreSQL with gcc != 4.6.0, I think. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jul 25, 2011 at 8:38 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote: >> I think the problem is still "invalid record length" and "invalid >> magic number", it start showing right after I complete sync data and >> start slave. If I stop slave later and restart, yes it could show >> xlog not found and can not catch master. but why the "invalid" things >> in the first place? > > You might have the same problem which was reported before. > http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php That's what it looks like to me also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I am using debian ant apt-get to install postgresql, dpkg list shows they are the same? is there anyway to tell what's version it is compiled from? thanks! Master# dpkg -l |grep post ii postgresql-9.0 9.0.4-1+b1 object-relational SQL database, version 9.0 server ii postgresql-client-9.0 9.0.4-1+b1 front-end programs for PostgreSQL 9.0 ii postgresql-client-common 118 manager for multiple PostgreSQL client versions ii postgresql-common 118 PostgreSQL database-cluster manager ii postgresql-contrib-9.0 9.0.4-1+b1 additional facilities for PostgreSQL ii postgresql-server-dev-9.0 9.0.4-1+b1 development files for PostgreSQL 9.0 server-side programming Slave $ dpkg -l |grep post ii postgresql-9.0 9.0.4-1+b1 object-relational SQL database, version 9.0 server ii postgresql-client-9.0 9.0.4-1+b1 front-end programs for PostgreSQL 9.0 ii postgresql-client-common 118 manager for multiple PostgreSQL client versions ii postgresql-common 118 PostgreSQL database-cluster manager ii postgresql-contrib-9.0 9.0.4-1+b1 additional facilities for PostgreSQL On Mon, Jul 25, 2011 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote: >> I think the problem is still "invalid record length" and "invalid >> magic number", it start showing right after I complete sync data and >> start slave. If I stop slave later and restart, yes it could show >> xlog not found and can not catch master. but why the "invalid" things >> in the first place? > > You might have the same problem which was reported before. > http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php > > That problem was fixed, and the fix will be included in next minor > update (i.e., 9.0.5). > http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php > > Of course, you can avoid the problem by building PostgreSQL with > gcc != 4.6.0, I think. > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center >
sorry for the typo, "debian and apt-get" On Mon, Jul 25, 2011 at 5:39 PM, Yan Chunlu <springrider@gmail.com> wrote: > I am using debian and apt-get to install postgresql, dpkg list shows > they are the same? is there anyway to tell what's version it is > compiled from? thanks! > > Master# dpkg -l |grep post > ii postgresql-9.0 9.0.4-1+b1 > object-relational SQL database, version 9.0 server > ii postgresql-client-9.0 9.0.4-1+b1 > front-end programs for PostgreSQL 9.0 > ii postgresql-client-common 118 > manager for multiple PostgreSQL client versions > ii postgresql-common 118 > PostgreSQL database-cluster manager > ii postgresql-contrib-9.0 9.0.4-1+b1 > additional facilities for PostgreSQL > ii postgresql-server-dev-9.0 9.0.4-1+b1 > development files for PostgreSQL 9.0 server-side programming > > > Slave $ dpkg -l |grep post > ii postgresql-9.0 9.0.4-1+b1 > object-relational SQL database, version 9.0 server > ii postgresql-client-9.0 9.0.4-1+b1 > front-end programs for PostgreSQL 9.0 > ii postgresql-client-common 118 > manager for multiple PostgreSQL client versions > ii postgresql-common 118 > PostgreSQL database-cluster manager > ii postgresql-contrib-9.0 9.0.4-1+b1 > additional facilities for PostgreSQL > > > > On Mon, Jul 25, 2011 at 3:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Mon, Jul 25, 2011 at 10:56 AM, Yan Chunlu <springrider@gmail.com> wrote: >>> I think the problem is still "invalid record length" and "invalid >>> magic number", it start showing right after I complete sync data and >>> start slave. If I stop slave later and restart, yes it could show >>> xlog not found and can not catch master. but why the "invalid" things >>> in the first place? >> >> You might have the same problem which was reported before. >> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php >> >> That problem was fixed, and the fix will be included in next minor >> update (i.e., 9.0.5). >> http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php >> >> Of course, you can avoid the problem by building PostgreSQL with >> gcc != 4.6.0, I think. >> >> Regards, >> >> -- >> Fujii Masao >> NIPPON TELEGRAPH AND TELEPHONE CORPORATION >> NTT Open Source Software Center >> >
Re: streaming replication does not work across datacenter with 20ms latency?
От
"Tomas Vondra"
Дата:
On 25 Červenec 2011, 11:39, Yan Chunlu wrote: > I am using debian ant apt-get to install postgresql, dpkg list shows > they are the same? is there anyway to tell what's version it is > compiled from? thanks! AFAIK there's no way to find out which compiler was used to build PostgreSQL binaries (IIRC this is not in pg_config). So you can do this gcc --version That should print version of the default compiler, and it's very likely it was used to build the PostgreSQL binaries. If it's 4.6.0, you've been hit by the problem Fujii Masao described. Use a different gcc version and it should work fine. Tomas
gcc compiler on my machine is 4.6.1, but I didn't compile it myself. just installed the binary from apt-get. I will try to compile it by myself to see what will happen On Mon, Jul 25, 2011 at 8:18 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 25 Červenec 2011, 11:39, Yan Chunlu wrote: >> I am using debian ant apt-get to install postgresql, dpkg list shows >> they are the same? is there anyway to tell what's version it is >> compiled from? thanks! > > AFAIK there's no way to find out which compiler was used to build > PostgreSQL binaries (IIRC this is not in pg_config). So you can do this > > gcc --version > > That should print version of the default compiler, and it's very likely it > was used to build the PostgreSQL binaries. If it's 4.6.0, you've been hit > by the problem Fujii Masao described. Use a different gcc version and it > should work fine. > > Tomas > >
"Tomas Vondra" <tv@fuzzy.cz> writes: > On 25 Červenec 2011, 11:39, Yan Chunlu wrote: >> I am using debian ant apt-get to install postgresql, dpkg list shows >> they are the same? is there anyway to tell what's version it is >> compiled from? thanks! > AFAIK there's no way to find out which compiler was used to build > PostgreSQL binaries (IIRC this is not in pg_config). No, but "SELECT version();" should tell the tale. regards, tom lane
seems the Master server is compiled using 4.6.0: version ---------------------------------------------------------------------------------------------------------------------------- PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.6.real (Debian 4.6.0-6) 4.6.1 20110428 (prerelease), 64-bit and there is no way to know what slave is using since I have remove it. I am using a compiled version(gcc 4.3.2) of postgresql as slave, seems working now. I think the problem maybe is like Fujii said, does that bug only effect hot-stanby server? seems master is okay. On Mon, Jul 25, 2011 at 10:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> On 25 Červenec 2011, 11:39, Yan Chunlu wrote: >>> I am using debian ant apt-get to install postgresql, dpkg list shows >>> they are the same? is there anyway to tell what's version it is >>> compiled from? thanks! > >> AFAIK there's no way to find out which compiler was used to build >> PostgreSQL binaries (IIRC this is not in pg_config). > > No, but "SELECT version();" should tell the tale. > > regards, tom lane >
Yan Chunlu <springrider@gmail.com> writes: > seems the Master server is compiled using 4.6.0: > PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.6.real > (Debian 4.6.0-6) 4.6.1 20110428 (prerelease), 64-bit Hmm. Given the datestamp, that version of gcc almost certainly does have the bug. I wonder whether Martin Pitt knows about this issue and the workaround we put in --- I'd have thought he'd push updated .debs with a workaround, as I did for Fedora ... Martin: see http://archives.postgresql.org/pgsql-hackers/2011-06/msg00890.php http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c2ba0121c73b7461331104a46d140156e847572a > and there is no way to know what slave is using since I have remove it. If it was installed from the same .deb then it'd be the same build. > I think the problem maybe is like Fujii said, does that bug only > effect hot-stanby server? seems master is okay. Well, actually, the bug affects WAL replay of any sort, which means if your master were to crash and restart you'd be at risk of data corruption on the master. I'd replace the master build too, ASAP. regards, tom lane
Yan Chunlu <springrider@gmail.com> writes: > how about 4.3.2? Yes, pre-4.6 gcc should be fine. regards, tom lane
how about 4.3.2? I have gcc 4.3.2 compiled postgresql 9.0.4 as slave, is that okay that I turn the slave into master? so the switch will be a lot more smooth. On Tue, Jul 26, 2011 at 12:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yan Chunlu <springrider@gmail.com> writes: >> oh god...thanks a lot for the tip. I did actually lost some data, the >> master server has crashed two times. every time it comes back, the >> index were broken. I need to reindex it. I have already set fsync=on. >> just thought it was normal behavior.... > > Uh, no. > >> about gcc version, only 4.6.0 effected? 4.6.1 is okay? then I could >> compile my own version with 4.6.1 > > It's supposed to be fixed in 4.6.1 according to > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390 > > Personally I'd use 4.6.1 *and* apply the Postgres patch. > > regards, tom lane >
Yan Chunlu <springrider@gmail.com> writes: > oh god...thanks a lot for the tip. I did actually lost some data, the > master server has crashed two times. every time it comes back, the > index were broken. I need to reindex it. I have already set fsync=on. > just thought it was normal behavior.... Uh, no. > about gcc version, only 4.6.0 effected? 4.6.1 is okay? then I could > compile my own version with 4.6.1 It's supposed to be fixed in 4.6.1 according to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390 Personally I'd use 4.6.1 *and* apply the Postgres patch. regards, tom lane
oh god...thanks a lot for the tip. I did actually lost some data, the master server has crashed two times. every time it comes back, the index were broken. I need to reindex it. I have already set fsync=on. just thought it was normal behavior.... about gcc version, only 4.6.0 effected? 4.6.1 is okay? then I could compile my own version with 4.6.1 but how about the data? from the bug information, the data file seems not compatible.... I need to do pg_dump and pg_restore? what a nightmare..... On Mon, Jul 25, 2011 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yan Chunlu <springrider@gmail.com> writes: >> seems the Master server is compiled using 4.6.0: > >> PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.6.real >> (Debian 4.6.0-6) 4.6.1 20110428 (prerelease), 64-bit > > Hmm. Given the datestamp, that version of gcc almost certainly does > have the bug. I wonder whether Martin Pitt knows about this issue > and the workaround we put in --- I'd have thought he'd push updated > .debs with a workaround, as I did for Fedora ... > > Martin: see > http://archives.postgresql.org/pgsql-hackers/2011-06/msg00890.php > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c2ba0121c73b7461331104a46d140156e847572a > >> and there is no way to know what slave is using since I have remove it. > > If it was installed from the same .deb then it'd be the same build. > >> I think the problem maybe is like Fujii said, does that bug only >> effect hot-stanby server? seems master is okay. > > Well, actually, the bug affects WAL replay of any sort, which means > if your master were to crash and restart you'd be at risk of data > corruption on the master. I'd replace the master build too, ASAP. > > regards, tom lane >
On Mon, Jul 25, 2011 at 7:18 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 25 Červenec 2011, 11:39, Yan Chunlu wrote:AFAIK there's no way to find out which compiler was used to build
> I am using debian ant apt-get to install postgresql, dpkg list shows
> they are the same? is there anyway to tell what's version it is
> compiled from? thanks!
PostgreSQL binaries
You can do a strings on a binary file (eg, postmaster) and search for GCC in the output.
--
Mike Nolan