streaming replication timeout error

Поиск

Список

Период

Сортировка

От	高健
Тема	streaming replication timeout error
Дата	9 октября 2013 г. 02:58:45
Msg-id	CAL454F067o+D67T0MjvQVS2quQdkLF6xRVBKoPynVB9wWtgbfg@mail.gmail.com обсуждение исходный текст
Ответы	Re: streaming replication timeout error Re: streaming replication timeout error
Список	pgsql-general

Дерево обсуждения

Hello:

My customer encountered some connection timeout, while using one primary-one standby streaming replication.

The original log is japanese, because there are no error-code like oracle's ora-xxx,

I tried to translate the japanese information into English, But that might be not correct English for PG.

The most important part is:

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout

scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory

I was asked about:

In what occasion will the above fatal error occur?

I looked into the postgresql.conf file for the primary and standby server.

And made some experiences.

I found:

Senario I:

If the wal file wanted is removed manually：

Both in primary and standby, log will be like this:

FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000011 has already been removed

Senario II:

If replication_timeout=60, wal_receiver_status_interval=0,

log will be like this:

On primary side：

LOG: terminating walsender process due to replication timeout

On standby side：

FATAL: could not receive data from WAL stream:

LOG: streaming replication successfully connected to primary

But they are not the same to my customer's case.

And I also noticed that my customer's recovery.conf:

In it it has primary_conninfo as this:

host=DB1 port=5432 application_name=testpg user=postgres connect_timeout=10 keepalives_idle=5 keepalives_interval=1

That is to say:

keepalives_idle=5

keepalives_interval=1

keepalives_count=3

With that setting, I think that For every 5 seconds,

In order to see whether the connection is alive, 3 keepalives package will be sent,

And if there is no repsonse in 1 seconds, then he connection will be invalid.

If my unserstanding is right,

when master is busy at network,

the three keepalives_ parameter will also cause the connection time out error.

But I haven't found a good explanation fitting the logs' FATAL error.

Can anybody give me some info?

2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95024][0][00000]LOG: Recovery replay at point 1F1/45EF8E20

2013-09-22 09:40:42 JST[27987][412d33c5.2d23-95025][0][00000]DETAIL: Last transaction log time 2013-09-22 09:40:42.673062+09

2013-09-22 09:44:00 JST[27987][412d33c5.2d23-95026][0][00000]LOG: restartpoint starting: time

2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95027][0][00000]LOG: restartpoint complete: wrote 1051 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=104.773 s, sync=0.003 s, total=104.781 s; sync files=90, longest=0.001 s, average=0.000 s

2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95028][0][00000]LOG: Recovery replay at point 1F1/45EF8E20

2013-09-22 09:45:45 JST[27987][412d33c5.2d23-95029][0][00000]DETAIL: Last transaction log time 2013-09-22 09:45:44.77683+09

2013-09-22 09:49:00 JST[27987][412d33c5.2d23-95030][0][00000]LOG: restartpoint starting: time

2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95031][0][00000]LOG: restartpoint complete: wrote 963 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=96.096 s, sync=0.006 s, total=96.106 s; sync files=98, longest=0.002 s, average=0.000 s

2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95032][0][00000]LOG: Recovery replay at point 1F1/46CE8EA0

2013-09-22 09:50:36 JST[27987][412d33c5.2d23-95033][0][00000]DETAIL: Last transaction log time 2013-09-22 09:50:36.564966+09

2013-09-22 09:52:47 JST[28297][51d1fbcb.6e89-2][0][XX000]FATAL: Could not receive data from WAL stream: could not receive data from server: connection timeout

scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory

2013-09-22 09:52:50 JST[27806][51d1fbc5.6c9e-12][0][00000]LOG: Log file 497、Segment 74、Offset 12263424 Page address 1F0/8DBB2000 is not expected

scp: /opt/PostgresPlus/9.2AS/data/arch/000000AC000001F10000004A: No such file or directory

scp: /opt/PostgresPlus/9.2AS/data/arch/000000AD.history: No such file or directory

2013-09-22 09:52:51 JST[2878][523e3f63.b3e-1][0][00000]LOG: Streaming replication connected to primary server successfully

2013-09-22 09:54:00 JST[27987][412d33c5.2d23-95034][0][00000]LOG: restartpoint starting: time

2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95035][0][00000]LOG: restartpoint complete: wrote 9929 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 10 recycled; write=149.436 s, sync=0.032 s, total=149.472 s; sync files=90, longest=0.012 s, average=0.000 s

2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95036][0][00000]LOG: Recovery replay at point 1F1/48E9D238

2013-09-22 09:56:30 JST[27987][412d33c5.2d23-95037][0][00000]DETAIL: Last transaction log time 2013-09-22 09:56:29.015514+09

Thanks in advance

jian gao

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

streaming replication timeout error