Re: Hot Backup with rsync fails at pg_clog if under load

Поиск
Список
Период
Сортировка
От Chris Redekop
Тема Re: Hot Backup with rsync fails at pg_clog if under load
Дата
Msg-id CAC2SuRLssGbdMBVwCwoQTJ9jt8L9oEvVPai++dR3d6EuTznw7g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Hot Backup with rsync fails at pg_clog if under load  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: Hot Backup with rsync fails at pg_clog if under load
Список pgsql-hackers
> Chris, can you rearrange the backup so you copy the pg_control file as
> the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference.  I also tried the patch and I can no longer reproduce the subtrans error, however instead it now it starts up, but never gets to the point where it'll accept connections.  It starts up but if I try to do anything I always get "FATAL:  the database system is starting up"...even if the load is removed from the primary, the standby still never finishes "starting up".  Attached below is a log of one of these startup attempts.  In my testing with the patch applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go into the "db is starting up" state....the pg_clog error is still there, but seems much harder to reproduce now....I've seen it only once since applying the patch (out of probably 50 or 60 under-load startup attempts).  It does seem to be "moody" like that tho....it will be very difficult to reproduce for a while, and then it will happen damn-near every time for a while...weirdness

On a bit of a side note, I've been thinking of changing my scripts so that they perform an initial rsync prior to doing the startbackup-rsync-stopbackup just so that the second rsync will be faster....so that the backup is in progress for a shorter period of time, as while it is running it will stop other standbys from starting up....this shouldn't cause any issues eh?


2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG:  database system was interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG:  creating missing WAL directory "pg_xlog/archive_status"
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG:  entering standby mode
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 2/CF000000
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG:  streaming replication successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL:  the database system is starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG:  checkpoint record is at 2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG:  redo record is at 2/CF000020; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG:  next transaction ID: 0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG:  next MultiXactId: 839; next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG:  oldest unfrozen transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG:  transaction ID wrap limit is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG:  resetting unlogged relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG:  initializing for hot standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG:  consistent recovery state reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG:  redo starts at 2/CF000020
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG:  consistent state delayed because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT:  xlog redo  running xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130 xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344 4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354 4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364 4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517 4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527 4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537 4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388 4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398 4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408 4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418 4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586 4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596 4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606 4634607;
 subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL:  the database system is starting up
DEBUG:  standby "sync_rep_test" has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL:  the database system is starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL:  the database system is starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL:  the database system is starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL:  the database system is starting up
and so on...


On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote:

> What I don't understand is how this affects the CLOG. How does oldestActiveXID
> factor into CLOG initialization?

It is an entirely different error.

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Firing order of RI triggers
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Firing order of RI triggers