Обсуждение: Streaming Replication Randomly Locking Up
I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:
postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730
I've never seen this happen. Looks like you might be using 9.1? Are you up to date on all the 9.1.x releases? Do you have just 1 slave syncing from the master? Which OS are you using? Did you verify that there aren't any network problems between the slave & master? Or hardware problems (like the NIC dying, or dropping packets)? On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote: > Hello, > > I'm having an issue where streaming replication just randomly stops working. > I haven't been able to find anything in the logs which point to an issue, > but the Postgres process shows a "waiting" status on the slave: > > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: > startup process recovering 000000010000053D0000003F waiting > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: > writer process > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: > stats collector process > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: > wal receiver process streaming 549/216B3730 > > The replication works great for days, but randomly seems to lock up and > replication halts. I verified that the two databases were out of sync with > a query on both of them. Has anyone experienced this issue before? > > Here are some relevant config settings: > > Master: > > wal_level = hot_standby > checkpoint_segments = 32 > checkpoint_completion_target = 0.9 > archive_mode = on > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f > </dev/null' > max_wal_senders = 2 > wal_keep_segments = 32 > > Slave: > > wal_level = hot_standby > checkpoint_segments = 32 > #checkpoint_completion_target = 0.5 > hot_standby = on > max_standby_archive_delay = -1 > max_standby_streaming_delay = -1 > #wal_receiver_status_interval = 10s > #hot_standby_feedback = off > > Thank you for any help you can provide! > > Andrew > -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?
Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?--
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
> startup process recovering 000000010000053D0000003F waiting
> postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
> writer process
> postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
> stats collector process
> postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
> wal receiver process streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and
> replication halts. I verified that the two databases were out of sync with
> a query on both of them. Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
> </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32
>
> Slave:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> #checkpoint_completion_target = 0.5
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> #wal_receiver_status_interval = 10s
> #hot_standby_feedback = off
>
> Thank you for any help you can provide!
>
> Andrew
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org
Are you certain that there are no relevant errors in the database logs (on both master & slave)? Also, are you sure that you didn't misconfigure logging such that errors wouldn't appear? On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote: > Hi Lonni, > > Yes, I am using PG 9.1.9. > Yes, 1 slave syncing from the master > CentOS 6.4 > I don't see any network or hardware issues (e.g. NIC) but will look more > into this. They are communicating on a private network and switch. > > I forgot to mention that after I restart the slave, everything syncs right > back up and all if working again so if it is a network issue, the > replication is just stopping after some hiccup instead of retrying and > resuming when things are back up. > > Thanks! > > > > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com> > wrote: >> >> I've never seen this happen. Looks like you might be using 9.1? Are >> you up to date on all the 9.1.x releases? >> >> Do you have just 1 slave syncing from the master? >> Which OS are you using? >> Did you verify that there aren't any network problems between the >> slave & master? >> Or hardware problems (like the NIC dying, or dropping packets)? >> >> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote: >> > Hello, >> > >> > I'm having an issue where streaming replication just randomly stops >> > working. >> > I haven't been able to find anything in the logs which point to an >> > issue, >> > but the Postgres process shows a "waiting" status on the slave: >> > >> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 >> > postgres: >> > startup process recovering 000000010000053D0000003F waiting >> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 >> > postgres: >> > writer process >> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 >> > postgres: >> > stats collector process >> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 >> > postgres: >> > wal receiver process streaming 549/216B3730 >> > >> > The replication works great for days, but randomly seems to lock up and >> > replication halts. I verified that the two databases were out of sync >> > with >> > a query on both of them. Has anyone experienced this issue before? >> > >> > Here are some relevant config settings: >> > >> > Master: >> > >> > wal_level = hot_standby >> > checkpoint_segments = 32 >> > checkpoint_completion_target = 0.9 >> > archive_mode = on >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f >> > </dev/null' >> > max_wal_senders = 2 >> > wal_keep_segments = 32 >> > >> > Slave: >> > >> > wal_level = hot_standby >> > checkpoint_segments = 32 >> > #checkpoint_completion_target = 0.5 >> > hot_standby = on >> > max_standby_archive_delay = -1 >> > max_standby_streaming_delay = -1 >> > #wal_receiver_status_interval = 10s >> > #hot_standby_feedback = off >> > >> > Thank you for any help you can provide! >> > >> > Andrew >> >
Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?
On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Lonni,
>
> Yes, I am using PG 9.1.9.
> Yes, 1 slave syncing from the master
> CentOS 6.4
> I don't see any network or hardware issues (e.g. NIC) but will look more
> into this. They are communicating on a private network and switch.
>
> I forgot to mention that after I restart the slave, everything syncs right
> back up and all if working again so if it is a network issue, the
> replication is just stopping after some hiccup instead of retrying and
> resuming when things are back up.
>
> Thanks!
>
>
>
> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> I've never seen this happen. Looks like you might be using 9.1? Are
>> you up to date on all the 9.1.x releases?
>>
>> Do you have just 1 slave syncing from the master?
>> Which OS are you using?
>> Did you verify that there aren't any network problems between the
>> slave & master?
>> Or hardware problems (like the NIC dying, or dropping packets)?
>>
>>
>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hello,
>> >
>> > I'm having an issue where streaming replication just randomly stops
>> > working.
>> > I haven't been able to find anything in the logs which point to an
>> > issue,
>> > but the Postgres process shows a "waiting" status on the slave:
>> >
>> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
>> > postgres:
>> > startup process recovering 000000010000053D0000003F waiting
>> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
>> > postgres:
>> > writer process
>> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
>> > postgres:
>> > stats collector process
>> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
>> > postgres:
>> > wal receiver process streaming 549/216B3730
>> >
>> > The replication works great for days, but randomly seems to lock up and
>> > replication halts. I verified that the two databases were out of sync
>> > with
>> > a query on both of them. Has anyone experienced this issue before?
>> >
>> > Here are some relevant config settings:
>> >
>> > Master:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> > </dev/null'
>> > max_wal_senders = 2
>> > wal_keep_segments = 32
>> >
>> > Slave:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > #checkpoint_completion_target = 0.5
>> > hot_standby = on
>> > max_standby_archive_delay = -1
>> > max_standby_streaming_delay = -1
>> > #wal_receiver_status_interval = 10s
>> > #hot_standby_feedback = off
>> >
>> > Thank you for any help you can provide!
>> >
>> > Andrew
>> >
I'd suggest enhancing your logging to include time/datestamps for every entry, and also the client hostname. That will help to rule in/out those 'unexpected EOF' errors. On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote: > The only thing I see that is a possibility for the issue is in the slave > log: > > LOG: unexpected EOF on client connection > LOG: could not receive data from client: Connection reset by peer > > I don't know if that's related or not as it could just be somebody running a > query. The log file does seem to be riddled with these but the replication > failures don't happen constantly. > > As far as I know I'm not swallowing any errors. The logging is all set as > the default: > > log_destination = 'stderr' > logging_collector = on > #client_min_messages = notice > #log_min_messages = warning > #log_min_error_statement = error > #log_min_duration_statement = -1 > #log_checkpoints = off > #log_connections = off > #log_disconnections = off > #log_error_verbosity = default > > I'm going to have a look at the NICs to make sure there's no issue there. > > Thanks again for your help! > > > On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com> > wrote: >> >> Are you certain that there are no relevant errors in the database logs >> (on both master & slave)? Also, are you sure that you didn't >> misconfigure logging such that errors wouldn't appear? >> >> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote: >> > Hi Lonni, >> > >> > Yes, I am using PG 9.1.9. >> > Yes, 1 slave syncing from the master >> > CentOS 6.4 >> > I don't see any network or hardware issues (e.g. NIC) but will look more >> > into this. They are communicating on a private network and switch. >> > >> > I forgot to mention that after I restart the slave, everything syncs >> > right >> > back up and all if working again so if it is a network issue, the >> > replication is just stopping after some hiccup instead of retrying and >> > resuming when things are back up. >> > >> > Thanks! >> > >> > >> > >> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com> >> > wrote: >> >> >> >> I've never seen this happen. Looks like you might be using 9.1? Are >> >> you up to date on all the 9.1.x releases? >> >> >> >> Do you have just 1 slave syncing from the master? >> >> Which OS are you using? >> >> Did you verify that there aren't any network problems between the >> >> slave & master? >> >> Or hardware problems (like the NIC dying, or dropping packets)? >> >> >> >> >> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> >> >> wrote: >> >> > Hello, >> >> > >> >> > I'm having an issue where streaming replication just randomly stops >> >> > working. >> >> > I haven't been able to find anything in the logs which point to an >> >> > issue, >> >> > but the Postgres process shows a "waiting" status on the slave: >> >> > >> >> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 >> >> > postgres: >> >> > startup process recovering 000000010000053D0000003F waiting >> >> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 >> >> > postgres: >> >> > writer process >> >> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 >> >> > postgres: >> >> > stats collector process >> >> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 >> >> > postgres: >> >> > wal receiver process streaming 549/216B3730 >> >> > >> >> > The replication works great for days, but randomly seems to lock up >> >> > and >> >> > replication halts. I verified that the two databases were out of >> >> > sync >> >> > with >> >> > a query on both of them. Has anyone experienced this issue before? >> >> > >> >> > Here are some relevant config settings: >> >> > >> >> > Master: >> >> > >> >> > wal_level = hot_standby >> >> > checkpoint_segments = 32 >> >> > checkpoint_completion_target = 0.9 >> >> > archive_mode = on >> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f >> >> > </dev/null' >> >> > max_wal_senders = 2 >> >> > wal_keep_segments = 32 >> >> > >> >> > Slave: >> >> > >> >> > wal_level = hot_standby >> >> > checkpoint_segments = 32 >> >> > #checkpoint_completion_target = 0.5 >> >> > hot_standby = on >> >> > max_standby_archive_delay = -1 >> >> > max_standby_streaming_delay = -1 >> >> > #wal_receiver_status_interval = 10s >> >> > #hot_standby_feedback = off >> >> > >> >> > Thank you for any help you can provide! >> >> > >> >> > Andrew >> >> > > > -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname. That will help to rule
in/out those 'unexpected EOF' errors.
On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> The only thing I see that is a possibility for the issue is in the slave
> log:
>
> LOG: unexpected EOF on client connection
> LOG: could not receive data from client: Connection reset by peer
>
> I don't know if that's related or not as it could just be somebody running a
> query. The log file does seem to be riddled with these but the replication
> failures don't happen constantly.
>
> As far as I know I'm not swallowing any errors. The logging is all set as
> the default:
>
> log_destination = 'stderr'
> logging_collector = on
> #client_min_messages = notice
> #log_min_messages = warning
> #log_min_error_statement = error
> #log_min_duration_statement = -1
> #log_checkpoints = off
> #log_connections = off
> #log_disconnections = off
> #log_error_verbosity = default
>
> I'm going to have a look at the NICs to make sure there's no issue there.
>
> Thanks again for your help!
>
>
> On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> Are you certain that there are no relevant errors in the database logs
>> (on both master & slave)? Also, are you sure that you didn't
>> misconfigure logging such that errors wouldn't appear?
>>
>> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hi Lonni,
>> >
>> > Yes, I am using PG 9.1.9.
>> > Yes, 1 slave syncing from the master
>> > CentOS 6.4
>> > I don't see any network or hardware issues (e.g. NIC) but will look more
>> > into this. They are communicating on a private network and switch.
>> >
>> > I forgot to mention that after I restart the slave, everything syncs
>> > right
>> > back up and all if working again so if it is a network issue, the
>> > replication is just stopping after some hiccup instead of retrying and
>> > resuming when things are back up.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
>> > wrote:
>> >>
>> >> I've never seen this happen. Looks like you might be using 9.1? Are
>> >> you up to date on all the 9.1.x releases?
>> >>
>> >> Do you have just 1 slave syncing from the master?
>> >> Which OS are you using?
>> >> Did you verify that there aren't any network problems between the
>> >> slave & master?
>> >> Or hardware problems (like the NIC dying, or dropping packets)?
>> >>
>> >>
>> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm having an issue where streaming replication just randomly stops
>> >> > working.
>> >> > I haven't been able to find anything in the logs which point to an
>> >> > issue,
>> >> > but the Postgres process shows a "waiting" status on the slave:
>> >> >
>> >> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
>> >> > postgres:
>> >> > startup process recovering 000000010000053D0000003F waiting
>> >> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
>> >> > postgres:
>> >> > writer process
>> >> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
>> >> > postgres:
>> >> > stats collector process
>> >> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
>> >> > postgres:
>> >> > wal receiver process streaming 549/216B3730
>> >> >
>> >> > The replication works great for days, but randomly seems to lock up
>> >> > and
>> >> > replication halts. I verified that the two databases were out of
>> >> > sync
>> >> > with
>> >> > a query on both of them. Has anyone experienced this issue before?
>> >> >
>> >> > Here are some relevant config settings:
>> >> >
>> >> > Master:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > checkpoint_completion_target = 0.9
>> >> > archive_mode = on
>> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> >> > </dev/null'
>> >> > max_wal_senders = 2
>> >> > wal_keep_segments = 32
>> >> >
>> >> > Slave:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > #checkpoint_completion_target = 0.5
>> >> > hot_standby = on
>> >> > max_standby_archive_delay = -1
>> >> > max_standby_streaming_delay = -1
>> >> > #wal_receiver_status_interval = 10s
>> >> > #hot_standby_feedback = off
>> >> >
>> >> > Thank you for any help you can provide!
>> >> >
>> >> > Andrew
>> >> >
>
>--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote: > Hello, > > I'm having an issue where streaming replication just randomly stops working. > I haven't been able to find anything in the logs which point to an issue, > but the Postgres process shows a "waiting" status on the slave: > > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: > startup process recovering 000000010000053D0000003F waiting There is a recovery conflict which it is waiting to go away. In other words, you have a long-running (or long-idle) transaction on the slave which is blocking recovery. > max_standby_archive_delay = -1 > max_standby_streaming_delay = -1 ...and you are willing to wait forever. Cheers, Jeff
postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730
postgres 10403 0.0 0.2 3430372 25920 ? Ss Aug14 0:31 postgres: user db x.x.x.x(61656) idle in transaction
postgres 19933 0.0 0.4 3426604 49564 ? S Aug05 0:06 /usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/9.1/data
postgres 19935 0.0 0.0 175288 396 ? Ss Aug05 0:13 postgres: logger process
postgres 21133 0.0 0.2 3443600 30680 ? Ss 09:28 0:00 postgres: user db x.x.x.x(64430) idle
postgres 21134 0.4 0.2 3430160 27656 ? Ss 09:28 0:16 postgres: user db x.x.x.x(64431) idle
root 21529 0.0 0.0 103240 844 pts/0 S+ 10:33 0:00 grep --color postgres
Thanks,
Andrew
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:> Hello,There is a recovery conflict which it is waiting to go away. In other
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
> startup process recovering 000000010000053D0000003F waiting
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery....and you are willing to wait forever.
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
Cheers,
Jeff
On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote: > I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in thelogs which point to an issue, but the Postgres process shows a "waiting" status on the slave: > > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003Fwaiting > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730 > > The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databaseswere out of sync with a query on both of them. Has anyone experienced this issue before? > > Here are some relevant config settings: > > Master: > > wal_level = hot_standby > checkpoint_segments = 32 > checkpoint_completion_target = 0.9 > archive_mode = on > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null' > max_wal_senders = 2 > wal_keep_segments = 32 I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the logson master or slave. It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then rsyncwould just wait forever and never return. John DeSoi, Ph.D.
I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the logs on master or slave.
On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:
>
> postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
> postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
> postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
> postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32
It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then rsync would just wait forever and never return.
John DeSoi, Ph.D.
On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote: > Hi Jeff, > > Here is the full process list at the time it stopped working (I have changed > the actual username, db and IP for security). Would the idle in transaction > process be the culprit? Most likely, yes. You should be able to dig into pg_locks to verify. Cheers, Jeff
On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote: >> Hi Jeff, >> >> Here is the full process list at the time it stopped working (I have changed >> the actual username, db and IP for security). Would the idle in transaction >> process be the culprit? > > Most likely, yes. You should be able to dig into pg_locks to verify. Actually, you can't. The waiting doesn't show up in pg_locks, because it polls in a sleep-loop, rather than doing a normal wait on the lock. Still, that idle in transaction process is almost surely the culprit. Cheers, Jeff
Actually, you can't. The waiting doesn't show up in pg_locks, becauseOn Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
>> Hi Jeff,
>>
>> Here is the full process list at the time it stopped working (I have changed
>> the actual username, db and IP for security). Would the idle in transaction
>> process be the culprit?
>
> Most likely, yes. You should be able to dig into pg_locks to verify.
it polls in a sleep-loop, rather than doing a normal wait on the lock.
Still, that idle in transaction process is almost surely the culprit.
Cheers,
Jeff