Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary

Поиск

Список

Период

Сортировка

От	Sergey Shchukin
Тема	Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary
Дата	17 марта 2015 г. 13:22:40
Msg-id	55080063.7020104@gmail.com обсуждение исходный текст
Ответ на	Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Ответы	Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary (Sergey Shchukin <shchukin.s.a@gmail.com>)
Список	pgsql-general

Дерево обсуждения

05.03.2015 11:25, Jim Nasby пишет:

On 2/27/15 5:11 AM, Sergey Shchukin wrote:

show max_standby_streaming_delay;
max_standby_streaming_delay
-----------------------------
30s

We both need to be more clear about which server we're talking about (master or replica).

What are max_standby_streaming_delay and max_standby_archive_delay set to *on the replica*?

My hope is that one or both of those is set to somewhere around 8 minutes on the replica. That would explain everything.

If that's not the case then I suspect what's happening is there's something running on the replica that isn't checking for interrupts frequently enough. That would also explain it.

When replication hangs, is the replication process using a lot of CPU? Or is it just sitting there? What's the process status for the replay process show?

Can you get a trace of the replay process on the replica when this is happening to see where it's spending all it's time?

How are you generating these log lines?
Tue Feb 24 15:05:07 MSK 2015 Stream: MASTER-masterdb:79607161592048 SLAVE:79607161550576 Replay:79607160986064 :: REPLAY 592 KBytes (00:00:00.398376 seconds)

Do you see the confl_* fields in pg_stat_database_conflicts on the *replica* increasing?

Hi Jim,

max_standby_streaming_delay and max_standby_archive_delay both are 30s on master and replica dbs

I don't see any specific or heavy workload during this issue with a hanging apply process. Just a normal queries as usual.

But I see an increased disk activity during the time when the apply issue is ongoing

DSK |          sdc |              | busy     61% | read   11511 |               | write   4534 | KiB/r     46 |              | KiB/w      4 | MBr/s 52.78 |               | MBw/s   1.88 | avq     1.45 |              | avio 0.38 ms |
DSK |          sde |              | busy     60% | read   11457 |               | write   4398 | KiB/r     46 |              | KiB/w      4 | MBr/s 51.97 |               | MBw/s   1.83 | avq     1.47 |              | avio 0.38 ms |
DSK |          sdd |              | busy     60% | read    9673 |               | write   4538 | KiB/r     61 |              | KiB/w      4 | MBr/s 58.24 |               | MBw/s   1.88 | avq     1.47 |              | avio 0.42 ms |
DSK |          sdj |              | busy     59% | read    9576 |               | write   4177 | KiB/r     63 |              | KiB/w      4 | MBr/s 59.30 |               | MBw/s   1.75 | avq     1.48 |              | avio 0.43 ms |
DSK |          sdh |              | busy     59% | read    9615 |               | write   4305 | KiB/r     63 |              | KiB/w      4 | MBr/s 59.23 |               | MBw/s   1.80 | avq     1.48 |              | avio 0.42 ms |
DSK |          sdf |              | busy     59% | read    9483 |               | write   4404 | KiB/r     63 |              | KiB/w      4 | MBr/s 59.11 |               | MBw/s   1.83 | avq     1.47 |              | avio 0.42 ms |
DSK |          sdi |              | busy     59% | read   11273 |               | write   4173 | KiB/r     46 |              | KiB/w      4 | MBr/s 51.50 |               | MBw/s   1.75 | avq     1.43 |              | avio 0.38 ms |
DSK |          sdg |              | busy     59% | read   11406 |               | write   4297 | KiB/r     46 |              | KiB/w      4 | MBr/s 51.66 |               | MBw/s   1.80 | avq     1.46 |              | avio 0.37 ms |

Although it's not seems to be an upper IO limit.

Normally disks are busy at 20-45%

DSK |          sde |              | busy     29% | read    6524 |               | write 14426 | KiB/r     26 |              | KiB/w      5 | MBr/s 17.08 |               | MBw/s   7.78 | avq    10.46 |              | avio 0.14 ms |
DSK |          sdi |              | busy     29% | read    6590 |               | write 14391 | KiB/r     26 |              | KiB/w      5 | MBr/s 17.19 |               | MBw/s   7.76 | avq     8.75 |              | avio 0.14 ms |
DSK |          sdg |              | busy     29% | read    6547 |               | write 14401 | KiB/r     26 |              | KiB/w      5 | MBr/s 16.94 |               | MBw/s   7.60 | avq     7.28 |              | avio 0.14 ms |
DSK |          sdc |              | busy     29% | read    6835 |               | write 14283 | KiB/r     27 |              | KiB/w      5 | MBr/s 18.08 |               | MBw/s   7.74 | avq     8.77 |              | avio 0.14 ms |
DSK |          sdf |              | busy     23% | read    3808 |               | write 14391 | KiB/r     36 |              | KiB/w      5 | MBr/s 13.49 |               | MBw/s   7.78 | avq    12.88 |              | avio 0.13 ms |
DSK |          sdd |              | busy     23% | read    3747 |               | write 14229 | KiB/r     33 |              | KiB/w      5 | MBr/s 12.32 |               | MBw/s   7.74 | avq    10.07 |              | avio 0.13 ms |
DSK |          sdj |              | busy     23% | read    3737 |               | write 14336 | KiB/r     36 |              | KiB/w      5 | MBr/s 13.16 |               | MBw/s   7.76 | avq    10.48 |              | avio 0.13 ms |
DSK |          sdh |              | busy     23% | read    3793 |               | write 14362 | KiB/r     35 |              | KiB/w      5 | MBr/s 13.29 |               | MBw/s   7.60 | avq     8.61 |              | avio 0.13 ms |

Also during the issue perf shows [k] copy_user_generic_string on the top positions
14.09%     postmaster postgres               [.] 0x00000000001b4569
10.25%     postmaster [kernel.kallsyms]      [k] copy_user_generic_string
4.15%     postmaster postgres               [.] hash_search_with_hash_value
2.08%     postmaster postgres               [.] SearchCatCache
1.79%     postmaster postgres               [.] LWLockAcquire
1.18%     postmaster libc-2.12.so           [.] memcpy
1.12%     postmaster postgres               [.] mdnblocks

Issue starts: at 19:43
Mon Mar 16 19:43:04 MSK 2015 Stream: MASTER-rdb04d:70837172337784 SLAVE:70837172314864 Replay:70837172316512 :: REPLAY 21 KBytes (00:00:00.006225 seconds)
Mon Mar 16 19:43:09 MSK 2015 Stream: MASTER-rdb04d:70837177455624 SLAVE:70837177390968 Replay:70837176794376 :: REPLAY 646 KBytes (00:00:00.367305 seconds)
Mon Mar 16 19:43:14 MSK 2015 Stream: MASTER-rdb04d:70837185005120 SLAVE:70837184961280 Replay:70837183253896 :: REPLAY 1710 KBytes (00:00:00.827881 seconds)
Mon Mar 16 19:43:19 MSK 2015 Stream: MASTER-rdb04d:70837190417984 SLAVE:70837190230232 Replay:70837183253896 :: REPLAY 6996 KBytes (00:00:05.873169 seconds)
Mon Mar 16 19:43:24 MSK 2015 Stream: MASTER-rdb04d:70837198538232 SLAVE:70837198485000 Replay:70837183253896 :: REPLAY 14926 KBytes (00:00:11.025561 seconds)
Mon Mar 16 19:43:29 MSK 2015 Stream: MASTER-rdb04d:70837209961192 SLAVE:70837209869384 Replay:70837183253896 :: REPLAY 26081 KBytes (00:00:16.068014 seconds)

We see [k] copy_user_generic_string

12.90%     postmaster [kernel.kallsyms]      [k] copy_user_generic_string
11.49%     postmaster postgres               [.] 0x00000000001f40c1
4.74%     postmaster postgres               [.] hash_search_with_hash_value
1.86%     postmaster postgres               [.] mdnblocks
1.73%     postmaster postgres               [.] LWLockAcquire
1.67%     postmaster postgres               [.] SearchCatCache

25.71%     postmaster [kernel.kallsyms]      [k] copy_user_generic_string
7.89%     postmaster postgres               [.] hash_search_with_hash_value
4.66%     postmaster postgres               [.] 0x00000000002108da
4.51%     postmaster postgres               [.] mdnblocks
3.36%     postmaster [kernel.kallsyms]      [k] put_page

Issue stops: at 19:51:39
Mon Mar 16 19:51:24 MSK 2015 Stream: MASTER-rdb04d:70838904179344 SLAVE:70838903934392 Replay:70837183253896 :: REPLAY 1680591 KBytes (00:08:10.384679 seconds)
Mon Mar 16 19:51:29 MSK 2015 Stream: MASTER-rdb04d:70838929994336 SLAVE:70838929873624 Replay:70837183253896 :: REPLAY 1705801 KBytes (00:08:15.428773 seconds)
Mon Mar 16 19:51:34 MSK 2015 Stream: MASTER-rdb04d:70838951993624 SLAVE:70838951899768 Replay:70837183253896 :: REPLAY 1727285 KBytes (00:08:20.472567 seconds)
Mon Mar 16 19:51:39 MSK 2015 Stream: MASTER-rdb04d:70838975297912 SLAVE:70838975180384 Replay:70837208050872 :: REPLAY 1725827 KBytes (00:08:10.256935 seconds)
Mon Mar 16 19:51:44 MSK 2015 Stream: MASTER-rdb04d:70839001502160 SLAVE:70839001412616 Replay:70837260116984 :: REPLAY 1700572 KBytes (00:07:49.849511 seconds)
Mon Mar 16 19:51:49 MSK 2015 Stream: MASTER-rdb04d:70839022866760 SLAVE:70839022751184 Replay:70837276732880 :: REPLAY 1705209 KBytes (00:07:42.307364 seconds)

And copy_user_generic_string goes down
+ 13.43%       postmaster postgres                 [.] 0x000000000023dc9a
+   3.71%       postmaster [kernel.kallsyms]        [k] copy_user_generic_string
+   2.46%             init [kernel.kallsyms]        [k] intel_idle
+   2.30%       postmaster postgres                 [.] hash_search_with_hash_value
+   2.01%       postmaster postgres                 [.] SearchCatCache

Could you clarify what types of traces did you mean? GDB?

To calculate slave and apply lag I use the following query at the replica site

slave_lag=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive" $p_db)
replay_lag=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay" $p_db)
replay_timediff=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT NOW() - pg_last_xact_replay_timestamp() AS replication_delay" $p_db)
master_lag=$($psql -U monitor -h$p_host -p$p_port -A -t -c "SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset" $p_db)
echo "$(date) Stream: MASTER-$p_host:$master_lag SLAVE:$slave_lag Replay:$replay_lag :: REPLAY $(bc <<< $master_lag/1024-$replay_lag/1024) KBytes (${replay_timediff} seconds)"

- 
Best regards,
Sergey Shchukin

В списке pgsql-general по дате отправления:

Предыдущее

От: "Deole, Pushkar (Pushkar)"
Дата: 17 марта 2015 г., 12:01:58
Сообщение: BDR with Postgres

Следующее

От: Serge Fonville
Дата: 17 марта 2015 г., 15:28:28
Сообщение: Re: BDR with Postgres

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary

Предыдущее

Следующее