Hanging startup process on the replica after vacuuming on master

Поиск

Список

Период

Сортировка

От	Vladimir Borodin
Тема	Hanging startup process on the replica after vacuuming on master
Дата	25 марта 2015 г. 19:32:16
Msg-id	FE82A9A7-0D52-41B5-A9ED-967F6927CB8A@simply.name обсуждение исходный текст
Список	pgsql-admin

Дерево обсуждения

Hi all.

We have already described [0] our problem on pgsql-general@ and written to pgsql-performance@ but the problem was described too confused so let me ask for help one for time in more clear way.

We have a database where one table contains 3+ billion records. The definition of the table is quite simple:

rpopdb04d/rpopdb M # \dS+ rpop.rpop_uidl

Table "rpop.rpop_uidl"

--------+------------------------+-----------+----------+--------------+-------------

Indexes:

"pk_rpop_uidl" PRIMARY KEY, btree (popid, uidl)

rpopdb04d/rpopdb M #

When vacuuming of this table finishes on master we see hanging of replay_location on all replicas on specific WAL position for up to 15 minutes. Other positions (sent_location, write_location, flush_location) in pg_stat_replication view continue increasing as expected.

rpopdb04d/postgres M # select client_hostname, sent_location, write_location, flush_location, replay_location from pg_stat_replication; \watch 1

<...>

Watch every 1s Fri Mar 20 14:15:16 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 413F/CD8223D8 | 413F/CD81EDF8 | 413F/CD81EDF8 | 413F/CD81D4C0

rpopdb04e.domain.com | 413F/CD8223D8 | 413F/CD81D4C0 | 413F/CD81D4C0 | 413F/CD81D3E0

(2 rows)

Watch every 1s Fri Mar 20 14:15:17 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 413F/CD9887C0 | 413F/CD9884C0 | 413F/CD9881F0 | 413F/CD988058

rpopdb04e.domain.com | 413F/CD9887C0 | 413F/CD988570 | 413F/CD9884C0 | 413F/CD9884C0

(2 rows)

Watch every 1s Fri Mar 20 14:15:18 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 413F/CDA89F38 | 413F/CDA89F38 | 413F/CDA89EA0 | 413F/CDA131C8

rpopdb04e.domain.com | 413F/CDA89F38 | 413F/CDA89EA0 | 413F/CDA89EA0 | 413F/CDA131C8

(2 rows)

Watch every 1s Fri Mar 20 14:15:19 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 413F/CDB8DA20 | 413F/CDB8D7C0 | 413F/CDB8D630 | 413F/CDA131C8

rpopdb04e.domain.com | 413F/CDB8DA20 | 413F/CDB8D630 | 413F/CDB8D630 | 413F/CDA131C8

<...>

<9 minutes of stuck replay_location>

<...>

Watch every 1s Fri Mar 20 14:24:11 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 4140/1F7E28C8 | 4140/1F7E2838 | 4140/1F7E2838 | 413F/CDA131C8

rpopdb04e.domain.com | 4140/1F7E28C8 | 4140/1F7E2550 | 4140/1F7E2550 | 413F/CDA131C8

(2 rows)

Watch every 1s Fri Mar 20 14:24:12 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 4140/1F9F6A28 | 4140/1F9F6508 | 4140/1F9F6508 | 413F/CDA131C8

rpopdb04e.domain.com | 4140/1F9F6A28 | 4140/1F9F6508 | 4140/1F9F6508 | 413F/CE258D48

(2 rows)

Watch every 1s Fri Mar 20 14:24:13 2015

client_hostname | sent_location | write_location | flush_location | replay_location

---------------------------+---------------+----------------+----------------+-----------------

rpopdb04g.domain.com | 4140/203BA108 | 4140/203BA108 | 4140/203BA108 | 413F/CDA131C8

rpopdb04e.domain.com | 4140/203BA108 | 4140/203BA108 | 4140/203BA108 | 413F/CE853C38

(2 rows)

The content of this WAL position according to pg_xlogdump is the following (actually it is the output of different playback of the problem so WAL positions differ but the essence is the same):

rmgr: Btree       len (rec/tot):     20/    52, tx:          0, lsn: 4115/56126DC0, prev 4115/56126D90, bkp: 0000, desc: vacuum: rel 1663/16420/16796; blk 31222118, lastBlockVacuumed 0

This OID (16420/16796) belongs to PK of this table.

We have also noticed that while replay_location is stuck, startup process on the replica consumes a lot of I/O reading lots of data from array with $PGDATA (we have xlogs on a separate array). Most of the backtraces of startup process in this time looks like that:

0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6

#0 0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6

#1 0x000000000065d2f5 in FileRead (file=<value optimized out>, buffer=0x7f53ac0dba20 ";9", amount=8192) at fd.c:1286

#2 0x000000000067acad in mdread (reln=<value optimized out>, forknum=<value optimized out>, blocknum=12063036, buffer=0x7f53ac0dba20 ";9") at md.c:679

#3 0x0000000000659b4e in ReadBuffer_common (smgr=<value optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=12063036, mode=RBM_NORMAL_NO_LOG, strategy=0x0, hit=0x7fff898a912f "") at bu

fmgr.c:476

#4 0x000000000065a61b in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=12063036, mode=<value optimized out>, strategy=<value optimized out>) at bufmgr.c:287

#5 0x00000000004cfb78 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=12063036, mode=RBM_NORMAL_NO_LOG) at xlogutils.c:324

#6 0x00000000004a3651 in btree_xlog_vacuum (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:522

#7 btree_redo (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:1144

#8 0x00000000004c903a in StartupXLOG () at xlog.c:6827

#9 0x000000000062f8bf in StartupProcessMain () at startup.c:224

#10 0x00000000004d3e9a in AuxiliaryProcessMain (argc=2, argv=0x7fff898a98a0) at bootstrap.c:416

#11 0x000000000062a99c in StartChildProcess (type=StartupProcess) at postmaster.c:5146

#12 0x000000000062e9e2 in PostmasterMain (argc=3, argv=<value optimized out>) at postmaster.c:1237

#13 0x00000000005c7d68 in main (argc=3, argv=0x1e22910) at main.c:228

Investigating the source code of btree_xlog_vacuum function we have found that it calls XLogReadBufferExtended with mode=RBM_NORMAL_NO_LOG only in one case:

* If queries might be active then we need to ensure every leaf page is

* unpinned between the lastBlockVacuumed and the current block, if there

* are any. This prevents replay of the VACUUM from reaching the stage of

* removing heap tuples while there could still be indexscans "in flight"

* to those particular tuples (see nbtree/README).

* It might be worth checking if there are actually any backends running;

* if not, we could just skip this.

* Since VACUUM can visit leaf pages out-of-order, it might issue records

* with lastBlockVacuumed >= block; that's not an error, it just means

* nothing to do now.

* Note: since we touch all pages in the range, we will lock non-leaf

* pages, and also any empty (all-zero) pages that may be in the index. It

* doesn't seem worth the complexity to avoid that. But it's important

* that HotStandbyActiveInReplay() will not return true if the database

* isn't yet consistent; so we need not fear reading still-corrupt blocks

* here during crash recovery.

if (HotStandbyActiveInReplay())

{

BlockNumber blkno;

for (blkno = xlrec->lastBlockVacuumed + 1; blkno < xlrec->block; blkno++)

{

* We use RBM_NORMAL_NO_LOG mode because it's not an error

* condition to see all-zero pages. The original btvacuumpage

* scan would have skipped over all-zero pages, noting them in FSM

* but not bothering to initialize them just yet; so we mustn't

* throw an error here. (We could skip acquiring the cleanup lock

* if PageIsNew, but it's probably not worth the cycles to test.)

* XXX we don't actually need to read the block, we just need to

* confirm it is unpinned. If we had a special call into the

* buffer manager we could optimise this so that if the block is

* not in shared_buffers we confirm it as unpinned.

buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno,

RBM_NORMAL_NO_LOG);

And if I understand everything right, while replaying this WAL record startup process reads from disk 31222118*8/1024/1024 ~ 238 GB of data (seems like the full size of the index).

Below are some details about our environment but my question is what we can do not to have delaying replicas? Any help would be really appropriate.

All hosts (master and two replicas) run RHEL 6.6 with PostgreSQL 9.4.1 (upgraded from 9.3.6 where we had the same problem). We use built-in streaming replication and the same configuration on all hosts:

rpopdb04g/postgres R # select name, setting from pg_settings where setting != boot_val and source = 'configuration file';

name | setting

---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------

archive_command | rsync -a --contimeout=1 --timeout=5 --password-file=/etc/barman.passwd %p rsync://barman@pg-backup04h.domain.com:/barman/rpopdb04/incoming/%f

archive_mode | on

autovacuum_analyze_scale_factor | 0.001

autovacuum_max_workers | 10

autovacuum_naptime | 15

autovacuum_vacuum_cost_delay | 5

autovacuum_vacuum_cost_limit | 2000

autovacuum_vacuum_scale_factor | 1e-05

autovacuum_vacuum_threshold | 200

bgwriter_delay | 50

bgwriter_lru_maxpages | 250

checkpoint_completion_target | 0.8

checkpoint_segments | 384

checkpoint_timeout | 600

checkpoint_warning | 420

DateStyle | ISO, DMY

default_text_search_config | pg_catalog.english

effective_cache_size | 13107200

hot_standby | on

hot_standby_feedback | on

lc_messages | en_US.UTF-8

lc_monetary | en_US.UTF-8

lc_numeric | en_US.UTF-8

lc_time | en_US.UTF-8

listen_addresses | *

log_autovacuum_min_duration | 0

log_checkpoints | on

log_hostname | on

log_line_prefix | < %m >

log_lock_waits | on

log_rotation_size | 0

log_statement | ddl

log_timezone | Europe/Moscow

log_truncate_on_rotation | on

logging_collector | on

maintenance_work_mem | 2097152

max_connections | 4000

max_wal_senders | 10

pg_stat_statements.max | 10000

pg_stat_statements.track | all

shared_buffers | 524288

shared_preload_libraries | pg_stat_statements

TimeZone | Europe/Moscow

track_functions | all

track_io_timing | on

update_process_title | off

vacuum_defer_cleanup_age | 200000

wal_buffers | 8192

wal_keep_segments | 64

wal_level | hot_standby

wal_receiver_status_interval | 1

wal_sender_timeout | 3000

work_mem | 16384

(53 rows)

Time: 1.955 ms

rpopdb04g/postgres R #

On replicas we issue select-queries that do index only scans on this table.

We would probably partition this table to reduce the size of the index so that btree_xlog_vacuum function would work faster. But perhaps there is anything else we can do right now?

Thanks in advance.

[0] http://www.postgresql.org/message-id/9358FF5E-0164-4508-A3DF-4A34A4C602A1@simply.name

May the force be with you…

https://simply.name

В списке pgsql-admin по дате отправления:

Предыдущее

От: naveen kumar
Дата: 25 марта 2015 г., 18:31:39
Сообщение: Re: pg_xlog-files not deleted on a standby after accidental server crash

Следующее

От: naveen kumar
Дата: 26 марта 2015 г., 01:54:59
Сообщение: ERROR: functions in index expression must be marked IMMUTABLE

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Hanging startup process on the replica after vacuuming on master

Предыдущее

Следующее