Обсуждение: Hot Backup with rsync fails at pg_clog if under load
Hello, * Context * I'm observing problems with provisioning a standby from the master by following a basic and documented "Making a Base Backup" [1] procedure with rsync if, in the mean time, heavy load is applied on the master. After searching the archives, the only more discussed and similar issue I found hit was by Daniel Farina in a thread "hot backups: am I doing it wrong, or do we have a problem with pg_clog?" [2], but, it seems, the issue was discarded because of a non-standard backup procedure Deniel used. However, I'm observing the same error with a simple procedure, hence this message. * Details * Procedure: 1. Start load generator on the master (WAL archiving enabled). 2. Prepare a Streaming Replication standby (accepting WAL files too): 2.1. pg_switch_xlog() on the master; 2.2. pg_start_backup(Obackup_under_load¹) on the master (this will take a while as master is loaded up); 2.3. rsync data/global/pg_control to the standby; 2.4. rsync all other data/ (without pg_xlog) to the standby; 2.5. pg_stop_backup() on the master; 2.6. Wait to receive all WAL files, generated during the backup, on the standby; 2.6. Start the standby PG instance. The last step will, usually, fail with a similar error: 2011-09-21 13:41:05 CEST LOG: database system was interrupted; last known up at 2011-09-21 13:40:50 CEST Restoring 00000014.history mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000014.history': No such file or directory Restoring 00000013.history 2011-09-21 13:41:05 CEST LOG: restored log file "00000013.history" from archive 2011-09-21 13:41:05 CEST LOG: entering standby mode Restoring 0000001300000006000000DC 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DC" from archive Restoring 0000001300000006000000DB 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DB" from archive 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at offset 32768: Success. 2011-09-21 13:41:05 CEST LOG: startup process (PID 13819) exited with exit code 1 2011-09-21 13:41:05 CEST LOG: aborting startup due to startup process failure The procedure works very reliably if there is little or no load on the master, but fails very often with the pg_clog error when load generator (few thousands of SELECTs, ~60 INSERTs, ~60 DELETEs and ~60 UPDATES per second) is started up. I assumed that a file system backup taken during pg_start_backup and pg_stop_backup is guaranteed to be consistent and that missing pieces will be taken from the WAL files, generated & shipped during the backup, but is it really? Is this procedure missing some steps? Or maybe this a known issue? Thank you, Linas [1] http://www.postgresql.org/docs/current/static/continuous-archiving.html [2] http://archives.postgresql.org/pgsql-hackers/2011-04/msg01132.php
On 21-09-2011 11:44, Linas Virbalas wrote: [This question doesn't belong to -hackers. Please post it in -general or -admin] > Procedure: > > 1. Start load generator on the master (WAL archiving enabled). > 2. Prepare a Streaming Replication standby (accepting WAL files too): > 2.1. pg_switch_xlog() on the master; You don't need this. > 2.2. pg_start_backup(Obackup_under_load¹) on the master (this will take a > while as master is loaded up); No. if you use pg_start_backup('foo', true) it will be fast. Check the manual. > 2.3. rsync data/global/pg_control to the standby; Why are you doing this? If ... > 2.4. rsync all other data/ (without pg_xlog) to the standby; you will copy it again or no? Don't understand your point. > 2.5. pg_stop_backup() on the master; > 2.6. Wait to receive all WAL files, generated during the backup, on the > standby; > 2.6. Start the standby PG instance. > > The last step will, usually, fail with a similar error: > The problem could be that the minimum recovery point (step 2.3) is different from the end of rsync if you are under load. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On Sep21, 2011, at 16:44 , Linas Virbalas wrote: > After searching the archives, the only more discussed and similar issue I > found hit was by Daniel Farina in a thread "hot backups: am I doing it > wrong, or do we have a problem with pg_clog?" [2], but, it seems, the issue > was discarded because of a non-standard backup procedure Deniel used. That's not the way I read that thread. In fact, Robert Haas confirmed that Daniel's backup procedure was sound in theory. The open question was whether the error occurred because of a Bug in Daniel's backup code or postgresql's restore code. The thread then petered out without that question being answered. > Procedure: > > 1. Start load generator on the master (WAL archiving enabled). > 2. Prepare a Streaming Replication standby (accepting WAL files too): > 2.1. pg_switch_xlog() on the master; > 2.2. pg_start_backup(Obackup_under_load¹) on the master (this will take a > while as master is loaded up); > 2.3. rsync data/global/pg_control to the standby; > 2.4. rsync all other data/ (without pg_xlog) to the standby; > 2.5. pg_stop_backup() on the master; > 2.6. Wait to receive all WAL files, generated during the backup, on the > standby; > 2.6. Start the standby PG instance. Looks good. (2.1) and (2.3) seem redundant (as Euler already noticed), but shouldn't cause any errors. Could you provide us with the exact rsync version and parameters you use? > The last step will, usually, fail with a similar error: > > 2011-09-21 13:41:05 CEST LOG: database system was interrupted; last known > up at 2011-09-21 13:40:50 CEST > Restoring 00000014.history > mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000014.history': No such file > or directory > Restoring 00000013.history > 2011-09-21 13:41:05 CEST LOG: restored log file "00000013.history" from > archive > 2011-09-21 13:41:05 CEST LOG: entering standby mode > Restoring 0000001300000006000000DC > 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DC" > from archive > Restoring 0000001300000006000000DB > 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DB" > from archive > 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction > 1188673 > 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at > offset 32768: Success. Whats the size of the file (pg_clog/0001) on both the master and the slave? best regards, Florian Pflug
On Wed, Sep 21, 2011 at 12:22 PM, Euler Taveira de Oliveira <euler@timbira.com> wrote: > [This question doesn't belong to -hackers. Please post it in -general or > -admin] -hackers or -bugs seems appropriate to me; I think this is a bug. >> 2.2. pg_start_backup(Obackup_under_loadš) on the master (this will take a >> while as master is loaded up); > > No. if you use pg_start_backup('foo', true) it will be fast. Check the > manual. If the server is sufficiently heavily loaded that a checkpoint takes a nontrivial amount of time, the OP is correct that this will be not fast, regardless of whether you choose to force an immediate checkpoint. >> 2.3. rsync data/global/pg_control to the standby; > > Why are you doing this? If ... > >> 2.4. rsync all other data/ (without pg_xlog) to the standby; > > you will copy it again or no? Don't understand your point. His point is that exercising the bug depends on doing the copying in a certain order. Any order of copying the data theoretically ought to be OK, as long as it's all between starting the backup and stopping the backup, but apparently it isn't. > The problem could be that the minimum recovery point (step 2.3) is different > from the end of rsync if you are under load. It seems pretty clear that some relevant chunk of WAL isn't getting replayed, but it's not at all clear to me why not. It seems like it would be useful to compare the LSN returned by pg_start_backup() with the location at which replay begins when you fire up the clone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>>> 2.2. pg_start_backup(Obackup_under_loadš) on the master (this will take a >>> while as master is loaded up); >> >> No. if you use pg_start_backup('foo', true) it will be fast. Check the >> manual. > > If the server is sufficiently heavily loaded that a checkpoint takes a > nontrivial amount of time, the OP is correct that this will be not > fast, regardless of whether you choose to force an immediate > checkpoint. In order to check more cases, I have changed the procedure to force an immediate checkpoint, i.e. pg_start_backup('backup_under_load', true). With the same load generator running, pg_start_backup returned almost instantaneously compared to how long it took previously. Most importantly, after doing this change, I cannot reproduce the pg_clog error message anymore. In other words, with immediate checkpoint hot backup succeeds under this load! >>> 2.3. rsync data/global/pg_control to the standby; >> >> Why are you doing this? If ... >> >>> 2.4. rsync all other data/ (without pg_xlog) to the standby; >> >> you will copy it again or no? Don't understand your point. > > His point is that exercising the bug depends on doing the copying in a > certain order. Any order of copying the data theoretically ought to > be OK, as long as it's all between starting the backup and stopping > the backup, but apparently it isn't. Please note that in the past I was able to reproduce the same pg_clog error even with taking a singular rsync of the whole data/ folder (i.e. without splitting it into two steps). >> The problem could be that the minimum recovery point (step 2.3) is different >> from the end of rsync if you are under load. Do you have ideas why does the Hot Backup operation with pg_start_backup('backup_under_load', true) succeed while pg_start_backup('backup_under_load') fails under the same load? Originally, I was using pg_start_backup('backup_under_load') in order not to clog the master server during the I/O required for the checkpoint. Of course, now, it seems, this should be sacrificed for the sake of a successful backup under load. > It seems pretty clear that some relevant chunk of WAL isn't getting > replayed, but it's not at all clear to me why not. It seems like it > would be useful to compare the LSN returned by pg_start_backup() with If needed, I could do that, if I had the exact procedure... Currently, during the start of the backup I take the following information: pg_xlogfile_name(pg_start_backup(...)) > the location at which replay begins when you fire up the clone. As you have seen in my original message, in the pg_log I get only the restored WAL file names after starting up the standby. Can I tune the postgresql.conf to include the location at which replay begins in the log? > Could you provide us with the exact rsync version and parameters you use? rsync -azv version 2.6.8 protocol version 29 -- Sincerely, Linas Virbalas http://flyingclusters.blogspot.com/
On 22-09-2011 11:24, Linas Virbalas wrote: > In order to check more cases, I have changed the procedure to force an > immediate checkpoint, i.e. pg_start_backup('backup_under_load', true). With > the same load generator running, pg_start_backup returned almost > instantaneously compared to how long it took previously. > > Most importantly, after doing this change, I cannot reproduce the pg_clog > error message anymore. In other words, with immediate checkpoint hot backup > succeeds under this load! > Interesting. I remembered someone reporting this same problem but it was not reproducible by some of us. > Do you have ideas why does the Hot Backup operation with > pg_start_backup('backup_under_load', true) succeed while > pg_start_backup('backup_under_load') fails under the same load? > I don't but if you show us the output of the steps above... > If needed, I could do that, if I had the exact procedure... Currently, > during the start of the backup I take the following information: > Just show us the output of pg_start_backup and part of the standby log with the following message 'redo starts at' and the subsequent messages up to the failure. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
2011/9/22 Euler Taveira de Oliveira <euler@timbira.com>: > On 22-09-2011 11:24, Linas Virbalas wrote: >> >> In order to check more cases, I have changed the procedure to force an >> immediate checkpoint, i.e. pg_start_backup('backup_under_load', true). >> With >> the same load generator running, pg_start_backup returned almost >> instantaneously compared to how long it took previously. >> >> Most importantly, after doing this change, I cannot reproduce the pg_clog >> error message anymore. In other words, with immediate checkpoint hot >> backup >> succeeds under this load! >> > Interesting. I remembered someone reporting this same problem but it was not > reproducible by some of us. So maybe there's some action that has to happen between the time the redo pointer is set and the time the checkpoint is WAL-logged to tickle the bug. Like... CLOG extension, maybe? *grep grep grep* OK, so ExtendCLOG() just zeroes the page in memory, writes the WAL record, and calls it good. All the interesting stuff is done while holding CLogControlLock. So, at checkpoint time, we'd better make sure to flush those pages out to disk before writing the checkpoint record. Otherwise, the redo pointer might advance past the CLOG-extension record before the corresponding page hits the disk. That's the job of CheckPointCLOG(), which is called from CheckPointGuts(), which is called just from CreateCheckPoint() just after setting the redo pointer. Now, there is some funny business with the locking here as we're writing the dirty pages (CheckPointCLOG() calls SimpleLruFlush()). We release and reacquire the control lock many times. But I don't see how that can cause a problem, because it's all being done after the redo pointer has already been said. We could end up having buffers get dirtied again after they are flushed, but that shouldn't matter either as long as each buffer is written out at least once. And if the write fails we throw an error. So I don't see any holes there. Anybody else have an idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9/22/11 6:59 PM, "Euler Taveira de Oliveira" <euler@timbira.com> wrote: >> If needed, I could do that, if I had the exact procedure... Currently, >> during the start of the backup I take the following information: >> > Just show us the output of pg_start_backup and part of the standby log with > the following message 'redo starts at' and the subsequent messages up to the > failure. Unfortunately, it's impossible, because the error message "Could not read from file "pg_clog/0001" at offset 32768: Success" is shown (and startup aborted) before the turn for "redo starts at" message arrives. For comparison purposes, here is the log from the original message: 2011-09-21 13:41:05 CEST LOG: database system was interrupted; last known up at 2011-09-21 13:40:50 CEST Restoring 00000014.history mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000014.history': No such file or directory Restoring 00000013.history 2011-09-21 13:41:05 CEST LOG: restored log file "00000013.history" from archive ! 2011-09-21 13:41:05 CEST LOG: entering standby mode Restoring 0000001300000006000000DC 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DC" from archive Restoring 0000001300000006000000DB 2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DB" from archive 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at offset 32768: Success. 2011-09-21 13:41:05 CEST LOG: startup process (PID 13819) exited with exit code 1 2011-09-21 13:41:05 CEST LOG: aborting startup due to startup process Failure As you can see, there is no "redo starts at" message. If we compare this to a healthy standby startup, we can see that the "pg_clog" error appears after the "entering standby mode", but before the "redo starts at", hence the latter is not reached. Healthy log example: 2011-09-23 09:52:31 CEST LOG: database system was interrupted; last known up at 2011-09-23 09:52:15 CEST Restoring 00000007.history mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000007.history': No such file or directory Restoring 00000006.history 2011-09-23 09:52:31 CEST LOG: restored log file "00000006.history" from archive 2011-09-23 09:52:31 CEST LOG: entering standby mode Restoring 000000060000000000000065 2011-09-23 09:52:31 CEST LOG: restored log file "000000060000000000000065" from archive 2011-09-23 09:52:31 CEST LOG: redo starts at 0/65000020 2011-09-23 09:52:31 CEST LOG: consistent recovery state reached at 0/66000000 2011-09-23 09:52:31 CEST LOG: database system is ready to accept read only connections Restoring 000000060000000000000066 mv: cannot stat `/opt/PostgreSQL/9.1/archive/000000060000000000000066': No such file or directory Restoring 00000007.history mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000007.history': No such file or directory 2011-09-23 09:52:31 CEST LOG: streaming replication successfully connected to primary -- Best regards, Linas Virbalas http://flyingclusters.blogspot.com/
On 23.09.2011 11:02, Linas Virbalas wrote: > On 9/22/11 6:59 PM, "Euler Taveira de Oliveira"<euler@timbira.com> wrote: > >>> If needed, I could do that, if I had the exact procedure... Currently, >>> during the start of the backup I take the following information: >>> >> Just show us the output of pg_start_backup and part of the standby log with >> the following message 'redo starts at' and the subsequent messages up to the >> failure. > > Unfortunately, it's impossible, because the error message "Could not read > from file "pg_clog/0001" at offset 32768: Success" is shown (and startup > aborted) before the turn for "redo starts at" message arrives. It looks to me that pg_clog/0001 exists, but it shorter than recovery expects. Which shouldn't happen, of course, because the start-backup checkpoint should flush all the clog that's needed by recovery to disk before the backup procedure begins to them. Can you do "ls -l pg_clog" in the master and the backup, and pg_controldata on the backup dir, and post the results, please? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sep23, 2011, at 10:41 , Heikki Linnakangas wrote: > On 23.09.2011 11:02, Linas Virbalas wrote: >> On 9/22/11 6:59 PM, "Euler Taveira de Oliveira"<euler@timbira.com> wrote: >> >>>> If needed, I could do that, if I had the exact procedure... Currently, >>>> during the start of the backup I take the following information: >>>> >>> Just show us the output of pg_start_backup and part of the standby log with >>> the following message 'redo starts at' and the subsequent messages up to the >>> failure. >> >> Unfortunately, it's impossible, because the error message "Could not read >> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup >> aborted) before the turn for "redo starts at" message arrives. > > It looks to me that pg_clog/0001 exists, but it shorter than recovery expects. Which shouldn't happen, of course, becausethe start-backup checkpoint should flush all the clog that's needed by recovery to disk before the backup procedurebegins to them. Yeah. What confuses me though is that we fail while trying to read from the clog. When do we do that during normal (non-standby)recovery? One other possibility would be that the problem is somehow triggered by vacuum running while the start-backup checkpointis commencing. That'd explain why the problem goes away with immediate checkpoints, because those make the windowsmach smaller. But I haven't got a concrete theory of whats happening.. best regards, Florian Pflug
On 23.09.2011 11:48, Florian Pflug wrote: > On Sep23, 2011, at 10:41 , Heikki Linnakangas wrote: >> On 23.09.2011 11:02, Linas Virbalas wrote: >>> On 9/22/11 6:59 PM, "Euler Taveira de Oliveira"<euler@timbira.com> wrote: >>> >>>>> If needed, I could do that, if I had the exact procedure... Currently, >>>>> during the start of the backup I take the following information: >>>>> >>>> Just show us the output of pg_start_backup and part of the standby log with >>>> the following message 'redo starts at' and the subsequent messages up to the >>>> failure. >>> >>> Unfortunately, it's impossible, because the error message "Could not read >>> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup >>> aborted) before the turn for "redo starts at" message arrives. >> >> It looks to me that pg_clog/0001 exists, but it shorter than recovery expects. Which shouldn't happen, of course, becausethe start-backup checkpoint should flush all the clog that's needed by recovery to disk before the backup procedurebegins to them. > > Yeah. What confuses me though is that we fail while trying to read from the clog. When do we do that during normal (non-standby)recovery? I believe the OP is setting up a standby. He didn't say if it's a hot standby, though. That changes the startup sequence a bit. At the end of recovery, we read the "last" clog page, containing the next XID that will be assigned to a transaction, and zero out the future part of that clog file (StartupCLOG). In hot standby, we do that earlier, after reading the checkpoint record but before we start replaying records. > One other possibility would be that the problem is somehow triggered by vacuum running while the start-backup checkpointis commencing. That'd explain why the problem goes away with immediate checkpoints, because those make the windowsmach smaller. But I haven't got a concrete theory of whats happening.. Hmm, that's a good theory. The logic in SimpleLruPhysicakReadPage() tolerates non-existent files, because the old clog files might've been truncated away by a vacuum after the backup started. That's OK, because they will be recreated, and later deleted again, during the WAL replay. But what if something like this happens: 0. pg_start_backup(). 1. rsync starts. It gets a list of all files. It notes that pg_clog/0001 exists. 2. vacuum runs, and deletes pg_clog/0001 3. rsync starts to copy pg_clog/0001. Oops, it's gone. Instead of leaving it out altogether, it includes it as an empty file. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 9/23/11 12:05 PM, "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com> wrote: >>> It looks to me that pg_clog/0001 exists, but it shorter than recovery >>> expects. Which shouldn't happen, of course, because the start-backup >>> checkpoint should flush all the clog that's needed by recovery to disk >>> before the backup procedure begins to them. >> >> Yeah. What confuses me though is that we fail while trying to read from the >> clog. When do we do that during normal (non-standby) recovery? > > I believe the OP is setting up a standby. He didn't say if it's a hot > standby, though. That changes the startup sequence a bit. I confirm that it is a Hot Standby with Streaming Replication. >> One other possibility would be that the problem is somehow triggered by >> vacuum running while the start-backup checkpoint is commencing. That'd >> explain why the problem goes away with immediate checkpoints, because those >> make the windows mach smaller. But I haven't got a concrete theory of whats >> happening.. > > Hmm, that's a good theory. The logic in SimpleLruPhysicakReadPage() > tolerates non-existent files, because the old clog files might've been > truncated away by a vacuum after the backup started. That's OK, because > they will be recreated, and later deleted again, during the WAL replay. > But what if something like this happens: > > 0. pg_start_backup(). > 1. rsync starts. It gets a list of all files. It notes that pg_clog/0001 > exists. > 2. vacuum runs, and deletes pg_clog/0001 > 3. rsync starts to copy pg_clog/0001. Oops, it's gone. Instead of > leaving it out altogether, it includes it as an empty file. I cannot confirm that. I have tried this scenario one more time and was observing how pg_clog/ looked on the master and standby. Here's an overview: 1. pg_start_backup('base_backup') - waiting... 2. Checking the size of pg_clog periodically on the master: -bash-3.2$ ls -l pg_clog total 8 -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 -bash-3.2$ ls -l pg_clog total 8 -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 -bash-3.2$ ls -l pg_clog total 16 3. Somewhere just before (1) step releases and we get into rsync phase, the pg_clog size changes: -rw------- 1 postgres postgres 16384 Sep 23 14:33 0000 -bash-3.2$ ls -l pg_clog total 16 -rw------- 1 postgres postgres 16384 Sep 23 14:33 0000 -bash-3.2$ ls -l pg_clog total 16 -rw------- 1 postgres postgres 16384 Sep 23 14:33 0000 4. The rsync does transfer the file: ... INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,881 INFO management.script.ScriptPlugin base/16394/16405_fsm INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,881 INFO management.script.ScriptPlugin base/16394/pg_internal.init INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO management.script.ScriptPlugin global/ INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO management.script.ScriptPlugin global/12691 INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO management.script.ScriptPlugin global/12696 INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO management.script.ScriptPlugin global/12697 INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO management.script.ScriptPlugin global/pg_internal.init INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,887 INFO management.script.ScriptPlugin pg_clog/0000 INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,887 INFO management.script.ScriptPlugin pg_multixact/offsets/0000 INFO | jvm 1 | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,887 INFO management.script.ScriptPlugin pg_notify/ ... But on the standby its size is the old one (thus, it seems, that the size changed after the rsync transfer and before the pg_stop_backup() was called): ls -l pg_clog/ total 8 -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 5. Hence, the server doesn't start and pg_log complains: 2011-09-23 14:33:45 CEST LOG: streaming replication successfully connected to primary 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: the database system is starting up 2011-09-23 14:33:46 CEST FATAL: could not access status of transaction 37206 2011-09-23 14:33:46 CEST DETAIL: Could not read from file "pg_clog/0000" at offset 8192: Success. 6. Now, if I do something that, of course, should never be done, and copy this 0000 file from master to the standby soon enough (without even starting/stopping backup), the standby starts up successfully. -- Hope this helps, Linas Virbalas http://flyingclusters.blogspot.com/
On Fri, Sep 23, 2011 at 8:47 AM, Linas Virbalas <linas.virbalas@continuent.com> wrote: > But on the standby its size is the old one (thus, it seems, that the size > changed after the rsync transfer and before the pg_stop_backup() was > called): Now that seems pretty weird - I don't think that file should ever shrink. Are you sure that the rsync isn't starting until after pg_start_backup() completes? Because if you were starting it just a tiny bit early, that would explain the observed symptoms perfectly, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> But on the standby its size is the old one (thus, it seems, that the size >> changed after the rsync transfer and before the pg_stop_backup() was >> called): > Now that seems pretty weird - I don't think that file should ever shrink. It seems, I was not clear in my last example. The pg_clog file doesn't shrink. On the contrary, when rsync kicks in it isstill small, but after it completes and somewhere around the pg_stop_backup() call, the pg_clog file grows on the master.Hence, it is left smaller on the standby. > Are you sure that the rsync isn't starting until after > pg_start_backup() completes? 100% - the procedure is scripted in a single threaded application, so rsync is started only after pg_start_backup(...) returns. > Because if you were starting it just a > tiny bit early, that would explain the observed symptoms perfectly, I > think. I agree, but that is not the case.
On Fri, Sep 23, 2011 at 4:41 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> Unfortunately, it's impossible, because the error message "Could not read >> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup >> aborted) before the turn for "redo starts at" message arrives. > > It looks to me that pg_clog/0001 exists, but it shorter than recovery > expects. Which shouldn't happen, of course, because the start-backup > checkpoint should flush all the clog that's needed by recovery to disk > before the backup procedure begins to them. I think the point here is that recover *never starts*. Something in the standby startup is looking for a value in a clog block that recovery hadn't had a chance to replay (produce) yet. So the standby is looking into the data directory *before* recovery has had a chance to run, and based on that,goes to look for something in clog page that wasn't guarenteed to exists at the start of the backup period, and bombing out before recovery has a chance to start replaying WAL and write the new clog page. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Excerpts from Linas Virbalas's message of vie sep 23 09:47:20 -0300 2011: > On 9/23/11 12:05 PM, "Heikki Linnakangas" > <heikki.linnakangas@enterprisedb.com> wrote: > But on the standby its size is the old one (thus, it seems, that the size > changed after the rsync transfer and before the pg_stop_backup() was > called): > > ls -l pg_clog/ > total 8 > -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 Sounds like rsync is caching the file size at the start of the run, and then copying that many bytes, ignoring the growth that occurred after it started. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
<p><br /> On Sep 23, 2011 5:59 PM, "Alvaro Herrera" <<a href="mailto:alvherre@commandprompt.com">alvherre@commandprompt.com</a>>wrote:<br /> ><br /> ><br /> > Excerptsfrom Linas Virbalas's message of vie sep 23 09:47:20 -0300 2011:<br /> > > On 9/23/11 12:05 PM, "Heikki Linnakangas"<br/> > > <<a href="mailto:heikki.linnakangas@enterprisedb.com">heikki.linnakangas@enterprisedb.com</a>>wrote:<br /> ><br /> >> But on the standby its size is the old one (thus, it seems, that the size<br /> > > changed after the rsynctransfer and before the pg_stop_backup() was<br /> > > called):<br /> > ><br /> > > ls -l pg_clog/<br/> > > total 8<br /> > > -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000<br /> ><br /> >Sounds like rsync is caching the file size at the start of the run, and<br /> > then copying that many bytes, ignoringthe growth that occurred after it<br /> > started.<br /> ><p>That pretty much matches what Daniel does whenhe got the same failure case. <p>Is this not allowed? Shouldn't wal replay fix this? <p>/Magnus
Hi, On Wednesday 21 Sep 2011 16:44:30 Linas Virbalas wrote: > 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" > at offset 32768: Success. Any chance you can attach gdb to the startup process and provide a backtrace from the place where this message is printed? Greetings, Andres
On 23.09.2011 19:03, Magnus Hagander wrote: > On Sep 23, 2011 5:59 PM, "Alvaro Herrera"<alvherre@commandprompt.com> > wrote: >> >> >> Excerpts from Linas Virbalas's message of vie sep 23 09:47:20 -0300 2011: >>> On 9/23/11 12:05 PM, "Heikki Linnakangas" >>> <heikki.linnakangas@enterprisedb.com> wrote: >> >>> But on the standby its size is the old one (thus, it seems, that the > size >>> changed after the rsync transfer and before the pg_stop_backup() was >>> called): >>> >>> ls -l pg_clog/ >>> total 8 >>> -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 >> >> Sounds like rsync is caching the file size at the start of the run, and >> then copying that many bytes, ignoring the growth that occurred after it >> started. > > That pretty much matches what Daniel does when he got the same failure case. > > Is this not allowed? Shouldn't wal replay fix this? That's OK. The effect is the same as if rsync had copied the file at the start. What's not OK is to store the file as empty or truncated file in the backup, when the file is deleted from pg_clog while rsync is running. The file must have length >= the length the file had when backup started, or the file must be omitted from the backup altogether. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 23, 2011 at 11:43 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: > On Fri, Sep 23, 2011 at 4:41 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: > >>> Unfortunately, it's impossible, because the error message "Could not read >>> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup >>> aborted) before the turn for "redo starts at" message arrives. >> >> It looks to me that pg_clog/0001 exists, but it shorter than recovery >> expects. Which shouldn't happen, of course, because the start-backup >> checkpoint should flush all the clog that's needed by recovery to disk >> before the backup procedure begins to them. > > I think the point here is that recover *never starts*. Something in > the standby startup is looking for a value in a clog block that > recovery hadn't had a chance to replay (produce) yet. Ah. I think you are right - Heikki made the same point. Maybe some of the stuff that happens just after this comment: /* * Initialize for Hot Standby, if enabled. We won't let backends in * yet, not until we've reachedthe min recovery point specified in * control file and we've established a recovery snapshot from a * running-xacts WAL record. */ ...actually needs to be postponed until after we've reached consistency? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sep23, 2011, at 18:03 , Magnus Hagander wrote: > On Sep 23, 2011 5:59 PM, "Alvaro Herrera" <alvherre@commandprompt.com> wrote: > > Sounds like rsync is caching the file size at the start of the run, and > > then copying that many bytes, ignoring the growth that occurred after it > > started. > > That pretty much matches what Daniel does when he got the same failure case. > > Is this not allowed? Shouldn't wal replay fix this? I don't see how it could be forbidden. ISTM that we absolutely *have* to be able to deal with each byte of a file's date, including it's meta-data, being in any state it was between the time pg_start_backup() returned and pg_stop_backup() was issued. With the individual states being in no way synchronized. (OK, in reality restricting this to individual 512-byte blocks is probably enough, but still...). This, I think, is also the reasons we need to force full_page_writes to on during a hot backup. If a page was modified at all between pg_start_backup() and pg_stop_backup(), we essentially have to assume it's totally garbled. best regards, Florian Pflug
On Sep23, 2011, at 18:45 , Robert Haas wrote: > Ah. I think you are right - Heikki made the same point. Maybe some > of the stuff that happens just after this comment: > > /* > * Initialize for Hot Standby, if enabled. We won't let backends in > * yet, not until we've reached the min recovery point specified in > * control file and we've established a recovery snapshot from a > * running-xacts WAL record. > */ > > > ...actually needs to be postponed until after we've reached consistency? I came the the same conclusion. It seems the before we've reached consistency, we shouldn't attempt to read anything because the data can be pretty arbitrarily garbled. best regards, Florian Pflug
On 23.09.2011 19:49, Florian Pflug wrote: > On Sep23, 2011, at 18:45 , Robert Haas wrote: >> Ah. I think you are right - Heikki made the same point. Maybe some >> of the stuff that happens just after this comment: >> >> /* >> * Initialize for Hot Standby, if enabled. We won't let backends in >> * yet, not until we've reached the min recovery point specified in >> * control file and we've established a recovery snapshot from a >> * running-xacts WAL record. >> */ >> >> >> ...actually needs to be postponed until after we've reached consistency? > > I came the the same conclusion. It seems the before we've reached consistency, > we shouldn't attempt to read anything because the data can be pretty arbitrarily > garbled. There are pretty clear rules on what state clog can be in. When you launch postmaster in a standby: * Any clog preceding the nextXid from the checkpoint record we start recovery from, must either be valid, or the clog file must be missing altogether (which can happen when it was vacuumed away while the backup in progress - if the clog is still needed at the end of backup it must not be missing, of course). * Any clog following nextXid can be garbled or missing. Recovery will overwrite any clog after nextXid from the WAL, but not the clog before it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 23, 2011 at 12:58 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > There are pretty clear rules on what state clog can be in. When you launch > postmaster in a standby: > > * Any clog preceding the nextXid from the checkpoint record we start > recovery from, must either be valid, or the clog file must be missing > altogether (which can happen when it was vacuumed away while the backup in > progress - if the clog is still needed at the end of backup it must not be > missing, of course). > * Any clog following nextXid can be garbled or missing. > > Recovery will overwrite any clog after nextXid from the WAL, but not the > clog before it. So the actual error message in the last test was: 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 ...but we can't tell if that was before or after nextXid, which seems like it would be useful to know. If Linas can rerun his experiment, but also capture the output of pg_controldata before firing up the standby for the first time, then we'd able to see that information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 23, 2011 at 9:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 23, 2011 at 11:43 AM, Aidan Van Dyk <aidan@highrise.ca> wrote: >> On Fri, Sep 23, 2011 at 4:41 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >> >>>> Unfortunately, it's impossible, because the error message "Could not read >>>> from file "pg_clog/0001" at offset 32768: Success" is shown (and startup >>>> aborted) before the turn for "redo starts at" message arrives. >>> >>> It looks to me that pg_clog/0001 exists, but it shorter than recovery >>> expects. Which shouldn't happen, of course, because the start-backup >>> checkpoint should flush all the clog that's needed by recovery to disk >>> before the backup procedure begins to them. >> >> I think the point here is that recover *never starts*. Something in >> the standby startup is looking for a value in a clog block that >> recovery hadn't had a chance to replay (produce) yet. > > Ah. I think you are right - Heikki made the same point. Maybe some > of the stuff that happens just after this comment: > > /* > * Initialize for Hot Standby, if enabled. We won't let backends in > * yet, not until we've reached the min recovery point specified in > * control file and we've established a recovery snapshot from a > * running-xacts WAL record. > */ > > > ...actually needs to be postponed until after we've reached consistency? We have a number of backups that are like this, and the problem is entirely reproducible for those. We always get around it by disabling hot standby for a while (until consistency is reached) I poked at xlog.c a bit, and to me seems entirely likely that StartupCLOG is being called early -- way too early, or at least parts of it. Presumably(?) it is being called so early in the hot standby path so that the status of transactions can be known for the purposes of querying, but it's happening before consistency is reached, ergo not many invariants (outside of checkpointed things like pg_controldata) are likely to hold...such as clog being the right size to locate the transaction status of a page. Anyway, sorry for dropping the ball on pushing that one; we've been using this workaround for a while after taking a look at the mechanism and deciding it was probably not a problem (except for a sound night's sleep). We've now seen this dozens of times. -- fdr
On Sep23, 2011, at 21:10 , Robert Haas wrote: > So the actual error message in the last test was: > > 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 > > ...but we can't tell if that was before or after nextXid, which seems > like it would be useful to know. > > If Linas can rerun his experiment, but also capture the output of > pg_controldata before firing up the standby for the first time, then > we'd able to see that information. Hm, wouldn't pg_controldata quite certainly show a nextId beyond the clog if copied after pg_clog/*? Linas, could you capture the output of pg_controldata *and* increase the log level to DEBUG1 on the standby? We should then see nextXid value of the checkpoint the recovery is starting from. FWIW, I've had a few more theories about what's going on, but none survived after looking at the code. My first guess was that there maybe are circumstances under which the nextId from the control file, instead of the one from the pre-backup checkpoint, ends up becoming the standby's nextXid. But there doesn't seem to be a way for that to happen. My next theory was that something increments nextIdx before StartupCLOG(). The only possible candidate seems to be PrescanPreparedTransactions(), which does increment nextXid if it's smaller than some sub-xid of a prepared xact. But we only call that before StartupCLOG() if we're starting from a shutdown checkpoint, which shouldn't be the case for the OP. I also checked what rsync does when a file vanishes after rsync computed the file list, but before it is sent. rsync 3.0.7 on OSX, at least, complains loudly, and doesn't sync the file. It BTW also exits non-zero, with a special exit code for precisely that failure case. best regards, Florian Pflug
> Linas, could you capture the output of pg_controldata *and* increase the > log level to DEBUG1 on the standby? We should then see nextXid value of > the checkpoint the recovery is starting from. I'll try to do that whenever I'm in that territory again... Incidentally, recently there was a lot of unrelated-to-this-post work to polish things up for a talk being given at PGWest 2011 Today :) > I also checked what rsync does when a file vanishes after rsync computed the > file list, but before it is sent. rsync 3.0.7 on OSX, at least, complains > loudly, and doesn't sync the file. It BTW also exits non-zero, with a special > exit code for precisely that failure case. To be precise, my script has logic to accept the exit code 24, just as stated in PG manual: Docs> For example, some versions of rsync return a separate exit code for Docs> "vanished source files", and you can write a driver script to accept Docs> this exit code as a non-error case. -- Sincerely, Linas Virbalas http://flyingclusters.blogspot.com/
On Sep29, 2011, at 17:44 , Linas Virbalas wrote: >> I also checked what rsync does when a file vanishes after rsync computed the >> file list, but before it is sent. rsync 3.0.7 on OSX, at least, complains >> loudly, and doesn't sync the file. It BTW also exits non-zero, with a special >> exit code for precisely that failure case. > > To be precise, my script has logic to accept the exit code 24, just as > stated in PG manual: > > Docs> For example, some versions of rsync return a separate exit code for > Docs> "vanished source files", and you can write a driver script to accept > Docs> this exit code as a non-error case. Oh, cool. I was about to suggest that we add something along these lines to the docs - guess I should RTFM from time to time ;-) best regards, Florian Pflug > > -- > Sincerely, > Linas Virbalas > http://flyingclusters.blogspot.com/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
> > Linas, could you capture the output of pg_controldata *and* increase the
> > log level to DEBUG1 on the standby? We should then see nextXid value of
> > the checkpoint the recovery is starting from.
>
> I'll try to do that whenever I'm in that territory again... Incidentally,
> recently there was a lot of unrelated-to-this-post work to polish things up
> for a talk being given at PGWest 2011 Today :)
>
> > I also checked what rsync does when a file vanishes after rsync computed the
> > file list, but before it is sent. rsync 3.0.7 on OSX, at least, complains
> > loudly, and doesn't sync the file. It BTW also exits non-zero, with a special
> > exit code for precisely that failure case.
>
> To be precise, my script has logic to accept the exit code 24, just as
> stated in PG manual:
>
> Docs> For example, some versions of rsync return a separate exit code for
> Docs> "vanished source files", and you can write a driver script to accept
> Docs> this exit code as a non-error case.
I also am running into this issue and can reproduce it very reliably. For me, however, it happens even when doing the "fast backup" like so: pg_start_backup('whatever', true)...my traffic is more write-heavy than linas's tho, so that might have something to do with it. Yesterday it reliably errored out on pg_clog every time, but today it is failing sporadically on pg_subtrans (which seems to be past where the pg_clog error was)....the only thing that has changed is that I've changed the log level to debug1....I wouldn't think that could be related though. I've linked the requested pg_controldata and debug1 logs for both errors. Both links contain the output from pg_start_backup, rsync, pg_stop_backup, pg_controldata, and then the postgres debug1 log produced from a subsequent startup attempt.
pg_clog: http://pastebin.com/mTfdcjwH
pg_subtrans: http://pastebin.com/qAXEHAQt
Any workarounds would be very appreciated.....would copying clog+subtrans before or after the rest of the data directory (or something like that) make any difference?
Thanks!
I can confirm that both the pg_clog and pg_subtrans errors do occur when using pg_basebackup instead of rsync. The data itself seems to be fine because using the exact same data I can start up a warm standby no problem, it is just the hot standby that will not start up.
On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop <chris@replicon.com> wrote:
> > Linas, could you capture the output of pg_controldata *and* increase the> > log level to DEBUG1 on the standby? We should then see nextXid value of> > the checkpoint the recovery is starting from.>> I'll try to do that whenever I'm in that territory again... Incidentally,> recently there was a lot of unrelated-to-this-post work to polish things up> for a talk being given at PGWest 2011 Today :)>> > I also checked what rsync does when a file vanishes after rsync computed the> > file list, but before it is sent. rsync 3.0.7 on OSX, at least, complains> > loudly, and doesn't sync the file. It BTW also exits non-zero, with a special> > exit code for precisely that failure case.>> To be precise, my script has logic to accept the exit code 24, just as> stated in PG manual:>> Docs> For example, some versions of rsync return a separate exit code for> Docs> "vanished source files", and you can write a driver script to accept> Docs> this exit code as a non-error case.I also am running into this issue and can reproduce it very reliably. For me, however, it happens even when doing the "fast backup" like so: pg_start_backup('whatever', true)...my traffic is more write-heavy than linas's tho, so that might have something to do with it. Yesterday it reliably errored out on pg_clog every time, but today it is failing sporadically on pg_subtrans (which seems to be past where the pg_clog error was)....the only thing that has changed is that I've changed the log level to debug1....I wouldn't think that could be related though. I've linked the requested pg_controldata and debug1 logs for both errors. Both links contain the output from pg_start_backup, rsync, pg_stop_backup, pg_controldata, and then the postgres debug1 log produced from a subsequent startup attempt.pg_clog: http://pastebin.com/mTfdcjwHpg_subtrans: http://pastebin.com/qAXEHAQtAny workarounds would be very appreciated.....would copying clog+subtrans before or after the rest of the data directory (or something like that) make any difference?Thanks!
Well, on the other hand maybe there is something wrong with the data. Here's the test/steps I just did -
1. I do the pg_basebackup when the master is under load, hot slave now will not start up but warm slave will.
2. I start a warm slave and let it catch up to current
3. On the slave I change 'hot_standby=on' and do a 'service postgresql restart'
4. The postgres fails to restart with the same error.
5. I turn hot_standby back off and postgres starts back up fine as a warm slave
6. I then turn off the load, the slave is all caught up, master and slave are both sitting idle
7. I, again, change 'hot_standby=on' and do a service restart
8. Again it fails, with the same error, even though there is no longer any load.
9. I repeat this warmstart/hotstart cycle a couple more times until to my surprise, instead of failing, it successfully starts up as a hot standby (this is after maybe 5 minutes or so of sitting idle)
So...given that it continued to fail even after the load had been turned of, that makes me believe that the data which was copied over was invalid in some way. And when a checkpoint/logrotation/somethingelse occurred when not under load it cleared itself up....I'm shooting in the dark here
Anyone have any suggestions/ideas/things to try?
On Mon, Oct 17, 2011 at 2:13 PM, Chris Redekop <chris@replicon.com> wrote:
I can confirm that both the pg_clog and pg_subtrans errors do occur when using pg_basebackup instead of rsync. The data itself seems to be fine because using the exact same data I can start up a warm standby no problem, it is just the hot standby that will not start up.On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop <chris@replicon.com> wrote:> > Linas, could you capture the output of pg_controldata *and* increase the> > log level to DEBUG1 on the standby? We should then see nextXid value of> > the checkpoint the recovery is starting from.>> I'll try to do that whenever I'm in that territory again... Incidentally,> recently there was a lot of unrelated-to-this-post work to polish things up> for a talk being given at PGWest 2011 Today :)>> > I also checked what rsync does when a file vanishes after rsync computed the> > file list, but before it is sent. rsync 3.0.7 on OSX, at least, complains> > loudly, and doesn't sync the file. It BTW also exits non-zero, with a special> > exit code for precisely that failure case.>> To be precise, my script has logic to accept the exit code 24, just as> stated in PG manual:>> Docs> For example, some versions of rsync return a separate exit code for> Docs> "vanished source files", and you can write a driver script to accept> Docs> this exit code as a non-error case.I also am running into this issue and can reproduce it very reliably. For me, however, it happens even when doing the "fast backup" like so: pg_start_backup('whatever', true)...my traffic is more write-heavy than linas's tho, so that might have something to do with it. Yesterday it reliably errored out on pg_clog every time, but today it is failing sporadically on pg_subtrans (which seems to be past where the pg_clog error was)....the only thing that has changed is that I've changed the log level to debug1....I wouldn't think that could be related though. I've linked the requested pg_controldata and debug1 logs for both errors. Both links contain the output from pg_start_backup, rsync, pg_stop_backup, pg_controldata, and then the postgres debug1 log produced from a subsequent startup attempt.pg_clog: http://pastebin.com/mTfdcjwHpg_subtrans: http://pastebin.com/qAXEHAQtAny workarounds would be very appreciated.....would copying clog+subtrans before or after the rest of the data directory (or something like that) make any difference?Thanks!
On Mon, Oct 17, 2011 at 11:30 PM, Chris Redekop <chris@replicon.com> wrote: > Well, on the other hand maybe there is something wrong with the data. > Here's the test/steps I just did - > 1. I do the pg_basebackup when the master is under load, hot slave now will > not start up but warm slave will. > 2. I start a warm slave and let it catch up to current > 3. On the slave I change 'hot_standby=on' and do a 'service postgresql > restart' > 4. The postgres fails to restart with the same error. > 5. I turn hot_standby back off and postgres starts back up fine as a warm > slave > 6. I then turn off the load, the slave is all caught up, master and slave > are both sitting idle > 7. I, again, change 'hot_standby=on' and do a service restart > 8. Again it fails, with the same error, even though there is no longer any > load. > 9. I repeat this warmstart/hotstart cycle a couple more times until to my > surprise, instead of failing, it successfully starts up as a hot standby > (this is after maybe 5 minutes or so of sitting idle) > So...given that it continued to fail even after the load had been turned of, > that makes me believe that the data which was copied over was invalid in > some way. And when a checkpoint/logrotation/somethingelse occurred when not > under load it cleared itself up....I'm shooting in the dark here > Anyone have any suggestions/ideas/things to try? Having digged at this a little -- but not too much -- the problem seems to be that postgres is reading the commit logs way, way too early, that is to say, before it has played enough WAL to be 'consistent' (the WAL between pg_start and pg_stop backup). I have not been able to reproduce this problem (I think) after the message from postgres suggesting it has reached a consistent state; at that time I am able to go into hot-standby mode. The message is like: "consistent recovery state reached at %X/%X". (this is the errmsg) It doesn't seem meaningful for StartupCLOG (or, indeed, any of the hot-standby path functionality) to be called before that code is executed, but it is anyway right now. I'm not sure if this oversight is simply an oversight, or indicative of a misplaced assumption somewhere. Basically, my thoughts for a fix are to suppress hot_standby = on (in spirit) before the consistent recovery state is reached. -- fdr
On Sun, Oct 23, 2011 at 9:48 PM, Daniel Farina <daniel@heroku.com> wrote: > On Mon, Oct 17, 2011 at 11:30 PM, Chris Redekop <chris@replicon.com> wrote: >> Well, on the other hand maybe there is something wrong with the data. >> Here's the test/steps I just did - >> 1. I do the pg_basebackup when the master is under load, hot slave now will >> not start up but warm slave will. >> 2. I start a warm slave and let it catch up to current >> 3. On the slave I change 'hot_standby=on' and do a 'service postgresql >> restart' >> 4. The postgres fails to restart with the same error. >> 5. I turn hot_standby back off and postgres starts back up fine as a warm >> slave >> 6. I then turn off the load, the slave is all caught up, master and slave >> are both sitting idle >> 7. I, again, change 'hot_standby=on' and do a service restart >> 8. Again it fails, with the same error, even though there is no longer any >> load. >> 9. I repeat this warmstart/hotstart cycle a couple more times until to my >> surprise, instead of failing, it successfully starts up as a hot standby >> (this is after maybe 5 minutes or so of sitting idle) >> So...given that it continued to fail even after the load had been turned of, >> that makes me believe that the data which was copied over was invalid in >> some way. And when a checkpoint/logrotation/somethingelse occurred when not >> under load it cleared itself up....I'm shooting in the dark here >> Anyone have any suggestions/ideas/things to try? > > Having digged at this a little -- but not too much -- the problem > seems to be that postgres is reading the commit logs way, way too > early, that is to say, before it has played enough WAL to be > 'consistent' (the WAL between pg_start and pg_stop backup). I have > not been able to reproduce this problem (I think) after the message > from postgres suggesting it has reached a consistent state; at that > time I am able to go into hot-standby mode. > > The message is like: "consistent recovery state reached at %X/%X". > (this is the errmsg) > > It doesn't seem meaningful for StartupCLOG (or, indeed, any of the > hot-standby path functionality) to be called before that code is > executed, but it is anyway right now. I'm not sure if this oversight > is simply an oversight, or indicative of a misplaced assumption > somewhere. Basically, my thoughts for a fix are to suppress > hot_standby = on (in spirit) before the consistent recovery state is > reached. Not sure about that, but I'll look at where this comes from. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Oct 16, 2011 at 2:33 AM, Chris Redekop <chris@replicon.com> wrote: > pg_subtrans: http://pastebin.com/qAXEHAQt I confirm this as a HS issue and will investigate from here. FATAL: could not access status of transaction 21110784 which, in pg_subtrans, is the first xid on a new subtrans page. So we have missed zeroing a page. pg_control shows ... Latest checkpoint's oldestActiveXID: 21116666 which shows quite clearly that the pg_control file is later than it should be. Chris, can you rearrange the backup so you copy the pg_control file as the first act after the pg_start_backup? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Oct 23, 2011 at 9:48 PM, Daniel Farina <daniel@heroku.com> wrote: > Having digged at this a little -- but not too much -- the problem > seems to be that postgres is reading the commit logs way, way too > early, that is to say, before it has played enough WAL to be > 'consistent' (the WAL between pg_start and pg_stop backup). I have > not been able to reproduce this problem (I think) after the message > from postgres suggesting it has reached a consistent state; at that > time I am able to go into hot-standby mode. The WAL appears too early because the other control info is later than it should be. So this is approx backwards and nothing related to consistent state, but thanks for drawing my attention to this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Oct24, 2011, at 01:27 , Simon Riggs wrote: > FATAL: could not access status of transaction 21110784 > which, in pg_subtrans, is the first xid on a new subtrans page. So we > have missed zeroing a page. > > pg_control shows ... Latest checkpoint's oldestActiveXID: 21116666 > which shows quite clearly that the pg_control file is later than it should be. But shouldn't pg_control be largely irrelevant in a hot backup scenario? Most (all?) of the information contained therein should be overwritten with the contents of the checkpoint referenced by the backup label, shouldn't it? best regards, Florian Pflug
On Oct23, 2011, at 22:48 , Daniel Farina wrote: > It doesn't seem meaningful for StartupCLOG (or, indeed, any of the > hot-standby path functionality) to be called before that code is > executed, but it is anyway right now. I think the idea is to check that the CLOG part which recovery *won't* overwrite is consistent (or rather, given the simplicity of the check, at least accessible) Heikki said the following somewhere else in this thread when I suggested something similar to your proposal: >> There are pretty clear rules on what state clog can be in. When you launch postmaster in a standby: >> >> * Any clog preceding the nextXid from the checkpoint record we start recovery from, must either be valid, or the clogfile must be missing altogether (which can happen when it was vacuumed away while the backup in progress - if the clogis still needed at the end of backup it must not be missing, of course). >> * Any clog following nextXid can be garbled or missing. >> >> Recovery will overwrite any clog after nextXid from the WAL, but not the clog before it. I think Simon's theory that we're starting recovery from the wrong place, i.e. should start with an earlier WAL location, is probably correct. The question is, why? best regards, Florian Pflug
On Mon, Oct 24, 2011 at 7:13 AM, Florian Pflug <fgp@phlo.org> wrote: > I think Simon's theory that we're starting recovery from the wrong place, > i.e. should start with an earlier WAL location, is probably correct. The > question is, why? Err, that's not what I said and I don't mean that. Having said that, what I said about pg_control being invalid would imply that, so is wrong also. We are starting recovery at the right place but we are initialising the clog and subtrans incorrectly. Precisely, the oldestActiveXid is being derived later than it should be, which can cause problems if this then means that whole pages are unitialised in subtrans. The bug only shows up if you do enough transactions (2048 is always enough) to move to the next subtrans page between the redo pointer and the checkpoint record while at the same time we do not have a long running transaction that spans those two points. That's just enough to happen reasonably frequently on busy systems and yet just enough to have slipped through testing. We must either 1. During CreateCheckpoint() we should derive oldestActiveXid before we derive the redo location 2. Change the way subtrans pages are initialized during recovery so we don't rely on oldestActiveXid I need to think some more before a decision on this in my own mind, but I lean towards doing (1) as a longer term fix and doing (2) as a short term fix for existing releases. I expect to have a fix later today. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > We are starting recovery at the right place but we are initialising > the clog and subtrans incorrectly. Precisely, the oldestActiveXid is > being derived later than it should be, which can cause problems if > this then means that whole pages are unitialised in subtrans. The bug > only shows up if you do enough transactions (2048 is always enough) to > move to the next subtrans page between the redo pointer and the > checkpoint record while at the same time we do not have a long running > transaction that spans those two points. That's just enough to happen > reasonably frequently on busy systems and yet just enough to have > slipped through testing. > > We must either > > 1. During CreateCheckpoint() we should derive oldestActiveXid before > we derive the redo location > > 2. Change the way subtrans pages are initialized during recovery so we > don't rely on oldestActiveXid > > I need to think some more before a decision on this in my own mind, > but I lean towards doing (1) as a longer term fix and doing (2) as a > short term fix for existing releases. I expect to have a fix later > today. (1) looks the best way forwards in all cases. Patch attached. Will be backpatched to 9.0 I think it is possible to avoid taking XidGenLock during GetRunningTransactions() now, but I haven't included that change in this patch. Any other comments before commit? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Oct25, 2011, at 11:13 , Simon Riggs wrote: > On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> We are starting recovery at the right place but we are initialising >> the clog and subtrans incorrectly. Precisely, the oldestActiveXid is >> being derived later than it should be, which can cause problems if >> this then means that whole pages are unitialised in subtrans. The bug >> only shows up if you do enough transactions (2048 is always enough) to >> move to the next subtrans page between the redo pointer and the >> checkpoint record while at the same time we do not have a long running >> transaction that spans those two points. That's just enough to happen >> reasonably frequently on busy systems and yet just enough to have >> slipped through testing. >> >> We must either >> >> 1. During CreateCheckpoint() we should derive oldestActiveXid before >> we derive the redo location > (1) looks the best way forwards in all cases. Let me see if I understand this The probem seems to be that we currently derive oldestActiveXid end the end of the checkpoint, just before writing the checkpoint record. Since we use oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before that checkpoint record (but after the REDO location, of course) may very well contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt to touch these XID's SUBTRANS state, we error out. Your patch seems sensible, because the checkpoint "logically" occurs at the REDO location not the checkpoint's location, so we ought to log an oldestActiveXID corresponding to that location. What I don't understand is how this affects the CLOG. How does oldestActiveXID factor into CLOG initialization? best regards, Florian Pflug
On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote: > What I don't understand is how this affects the CLOG. How does oldestActiveXID > factor into CLOG initialization? It is an entirely different error. Chris' clog error was caused by a file read error. The file was opened, we did a seek within the file and then the call to read() failed to return a complete page from the file. The xid shown is 22811359, which is the nextxid in the control file. So StartupClog() must have failed trying to read the clog page from disk. That isn't a Hot Standby problem, a recovery problem nor is it certain its a PostgreSQL problem. OTOH SlruPhysicalReadPage() does cope gracefully with missing clog files during recovery, so maybe we can think of a way to make recovery cope with a SLRU_READ_FAILED error gracefully also. Any ideas? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Chris, can you rearrange the backup so you copy the pg_control file as
> the first act after the pg_start_backup?
> the first act after the pg_start_backup?
I tried this and it doesn't seem to make any difference. I also tried the patch and I can no longer reproduce the subtrans error, however instead it now it starts up, but never gets to the point where it'll accept connections. It starts up but if I try to do anything I always get "FATAL: the database system is starting up"...even if the load is removed from the primary, the standby still never finishes "starting up". Attached below is a log of one of these startup attempts. In my testing with the patch applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go into the "db is starting up" state....the pg_clog error is still there, but seems much harder to reproduce now....I've seen it only once since applying the patch (out of probably 50 or 60 under-load startup attempts). It does seem to be "moody" like that tho....it will be very difficult to reproduce for a while, and then it will happen damn-near every time for a while...weirdness
On a bit of a side note, I've been thinking of changing my scripts so that they perform an initial rsync prior to doing the startbackup-rsync-stopbackup just so that the second rsync will be faster....so that the backup is in progress for a shorter period of time, as while it is running it will stop other standbys from starting up....this shouldn't cause any issues eh?
2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG: database system was interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG: creating missing WAL directory "pg_xlog/archive_status"
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG: entering standby mode
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 2/CF000000
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG: streaming replication successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL: the database system is starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG: checkpoint record is at 2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG: redo record is at 2/CF000020; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG: next transaction ID: 0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG: next MultiXactId: 839; next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG: oldest unfrozen transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG: transaction ID wrap limit is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG: resetting unlogged relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG: initializing for hot standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG: consistent recovery state reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG: redo starts at 2/CF000020
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG: consistent state delayed because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT: xlog redo running xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130 xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344 4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354 4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364 4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517 4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527 4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537 4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388 4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398 4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408 4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418 4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586 4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596 4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606 4634607;
subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL: the database system is starting up
DEBUG: standby "sync_rep_test" has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL: the database system is starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL: the database system is starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL: the database system is starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL: the database system is starting up
and so on...
On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote:It is an entirely different error.
> What I don't understand is how this affects the CLOG. How does oldestActiveXID
> factor into CLOG initialization?
Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.
The xid shown is 22811359, which is the nextxid in the control file.
So StartupClog() must have failed trying to read the clog page from disk.
That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.
OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.
Do you have any theories on this that I could help investigate? It happens even when using pg_basebackup and it persists until another sync is performed, so the files must be in some state that that it can't recover from....without understanding the internals just viewing from an outside perspective, I don't really see how this could not be a PostgreSQL problem....
On Tue, Oct 25, 2011 at 10:06 PM, Chris Redekop <chris@replicon.com> wrote: >> Chris, can you rearrange the backup so you copy the pg_control file as >> the first act after the pg_start_backup? > I tried this and it doesn't seem to make any difference. It won't, that was a poor initial diagnosis on my part. > I also tried the > patch and I can no longer reproduce the subtrans error Good >, however instead it > now it starts up, but never gets to the point where it'll accept > connections. It starts up but if I try to do anything I always get "FATAL: > the database system is starting up"...even if the load is removed from the > primary, the standby still never finishes "starting up". ... > 2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG: consistent state delayed > because recovery snapshot incomplete ... This is a different problem and has already been reported by one of your colleagues in a separate thread, and answered in detail by me there. There is no bug related to this error message. From here, it looks like the published fixes the originally reported problem. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Oct25, 2011, at 14:51 , Simon Riggs wrote: > On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote: > >> What I don't understand is how this affects the CLOG. How does oldestActiveXID >> factor into CLOG initialization? > > It is an entirely different error. Ah, OK. I assumed that you believe the wrong oldestActiveXID computation solved both the SUBTRANS-related *and* the CLOG-related errors, since you said "We are starting recovery at the right place but we are initialising the clog and subtrans incorrectly" at the start of the mail. > Chris' clog error was caused by a file read error. The file was > opened, we did a seek within the file and then the call to read() > failed to return a complete page from the file. > > The xid shown is 22811359, which is the nextxid in the control file. > > So StartupClog() must have failed trying to read the clog page from disk. Yep. > That isn't a Hot Standby problem, a recovery problem nor is it certain > its a PostgreSQL problem. It's very likely that it's a PostgreSQL problem, though. It's probably not a pilot error since it happens even for backups taken with pg_basebackup(), so the only explanation other than a PostgreSQL bug is broken hardware or a pretty serious kernel/filesystem bug. > OTOH SlruPhysicalReadPage() does cope gracefully with missing clog > files during recovery, so maybe we can think of a way to make recovery > cope with a SLRU_READ_FAILED error gracefully also. Any ideas? As long as we don't understand how the CLOG-related errors happen in the first place, I think it's a bad idea to silence them. best regards, Florian Pflug
On Oct25, 2011, at 13:39 , Florian Pflug wrote: > On Oct25, 2011, at 11:13 , Simon Riggs wrote: >> On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> We are starting recovery at the right place but we are initialising >>> the clog and subtrans incorrectly. Precisely, the oldestActiveXid is >>> being derived later than it should be, which can cause problems if >>> this then means that whole pages are unitialised in subtrans. The bug >>> only shows up if you do enough transactions (2048 is always enough) to >>> move to the next subtrans page between the redo pointer and the >>> checkpoint record while at the same time we do not have a long running >>> transaction that spans those two points. That's just enough to happen >>> reasonably frequently on busy systems and yet just enough to have >>> slipped through testing. >>> >>> We must either >>> >>> 1. During CreateCheckpoint() we should derive oldestActiveXid before >>> we derive the redo location > >> (1) looks the best way forwards in all cases. > > Let me see if I understand this > > The probem seems to be that we currently derive oldestActiveXid end the end of > the checkpoint, just before writing the checkpoint record. Since we use > oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before > that checkpoint record (but after the REDO location, of course) may very well > contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt > to touch these XID's SUBTRANS state, we error out. > > Your patch seems sensible, because the checkpoint "logically" occurs at the > REDO location not the checkpoint's location, so we ought to log an oldestActiveXID > corresponding to that location. Thinking about this some more (and tracing through the code), I realized that things are a bit more complicated. What we actually need to ensure, I think, is that the XID we pass to StartupSUBTRANS() is earlier than any top-level XID in XLOG_XACT_ASSIGNMENT records. Which, at first glance, implies that we ought to use the nextId at the *beginning* of the checkpoint for SUBTRANS initialization. At second glace, however, that'd be wrong, because backends emit XLOG_XACT_ASSIGNMENT only every PGPROC_MAX_CACHED_SUBXIDS sub-xid assignment. Thus, an XLOG_XACT_ASSIGNMENT written *after* the checkpoint has started may contain sub-XIDs which were assigned *before* the checkpoint has started. Using oldestActiveXID works around that because we guarantee that sub-XIDs are always larger than their parent XIDs and because only active transactions can produce XLOG_XACT_ASSIGNMENT records. So your patch is fine, but I think the reasoning about why oldestActiveXID is the correct value for StartupSUBTRANS deserves an explanation somewhere. best regards, Florian Pflug
On Wed, Oct 26, 2011 at 12:16 PM, Florian Pflug <fgp@phlo.org> wrote: >> Chris' clog error was caused by a file read error. The file was >> opened, we did a seek within the file and then the call to read() >> failed to return a complete page from the file. >> >> The xid shown is 22811359, which is the nextxid in the control file. >> >> So StartupClog() must have failed trying to read the clog page from disk. > > Yep. > >> That isn't a Hot Standby problem, a recovery problem nor is it certain >> its a PostgreSQL problem. > > It's very likely that it's a PostgreSQL problem, though. It's probably > not a pilot error since it happens even for backups taken with pg_basebackup(), > so the only explanation other than a PostgreSQL bug is broken hardware or > a pretty serious kernel/filesystem bug. The way forwards here is for someone to show the clog file that causes the error and find out why the call to read() fails. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 26, 2011 at 7:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> It's very likely that it's a PostgreSQL problem, though. It's probably >> not a pilot error since it happens even for backups taken with pg_basebackup(), >> so the only explanation other than a PostgreSQL bug is broken hardware or >> a pretty serious kernel/filesystem bug. > > The way forwards here is for someone to show the clog file that causes > the error and find out why the call to read() fails. Sorry, I thought the problem was obvious. Either that, of I've completely missed something in these threads... I'll admit to not following this one very closely anymore... When the backup started, the clog was small. So on the "recovering instance", the clog is small. PostgreSQL is supposed to be able to deal with any file as it was when the backup starts. When the backup is stopped, clog is big. But that file was copied after the backup was started, not after the backup finished. So its size is only guarenteed to be as big as it was when the backup started. Recovery is responsible for extending it as it was extended during the backup period on the master. The read fails because their is no data at the location it's trying to read from, because clog hasn't been extended yet by recovery. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wed, Oct 26, 2011 at 12:54 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > The read fails because their is no data at the location it's trying to > read from, because clog hasn't been extended yet by recovery. You don't actually know that, though I agree it seems a reasonable guess and was my first thought also. The error is very specifically referring to 22811359, which is the nextxid from pg_control and updated by checkpoint. 22811359 is mid-way through a clog page, so prior xids will already have been allocated, pages extended and then those pages fsyncd before the end of pg_start_backup(). So it shouldn't be possible for that page to be absent from the base backup, unless the base backup was taken without a preceding checkpoint, which seems is not the case from the script output. Note that if you are correct, then the solution is to extend clog, which Florian disagrees with as a solution. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Oct26, 2011, at 15:12 , Simon Riggs wrote: > On Wed, Oct 26, 2011 at 12:54 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: > >> The read fails because their is no data at the location it's trying to >> read from, because clog hasn't been extended yet by recovery. > > You don't actually know that, though I agree it seems a reasonable > guess and was my first thought also. The actual error message also supports that theory. Here's the relevant snippet from the OP's log (Found in CA9FD2FE.1D8D2%linas.virbalas@continuent.com) 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at offset 32768: Success. Note that it says "Success" at the end of the second log entry. That can only happen, I think, if we're trying to read the page adjacent to the last page in the file. The seek would be successfull, and the subsequent read() would indicate EOF by returning zero bytes. None of the calls would set errno. If there was a real IO error, read() would set errno, and if the page wasn't adjacent to the last page in the file, seek() would set errno. In both cases we'd see the corresponding error messag, not "Success". > The error is very specifically referring to 22811359, which is the > nextxid from pg_control and updated by checkpoint. Where does that XID come from? The reference to that XID in the archives that I can find is in your message CA+U5nMKUUoA8kRG=itfsO5nZuE3x_KDJz78EaUN3_FKmq-uMJA@mail.gmail.com > 22811359 is mid-way through a clog page, so prior xids will already > have been allocated, pages extended and then those pages fsyncd before > the end of pg_start_backup(). So it shouldn't be possible for that > page to be absent from the base backup, unless the base backup was > taken without a preceding checkpoint, which seems is not the case from > the script output. Or unless the nextId we store in the checkpoint is for some reason higher than it should be. Or unless nextId somehow gets mangled during recovery. Or unless there's some interaction between VACUUM and CHECKPOINTS that we're overlooking... > Note that if you are correct, then the solution is to extend clog, > which Florian disagrees with as a solution. That's not what I said. As you said, the CLOG page corresponding to nextId *should* always be accessible at the start of recovery (Unless whole file has been removed by VACUUM, that is). So we shouldn't need to extends CLOG. Yet the error suggest that the CLOG is, in fact, too short. What I said is that we shouldn't apply any fix (for the CLOG problem) before we understand the reason for that apparent contradiction. Doing it nevertheless to get rid of this seems dangerous. What happens, for example, to the CLOG state of transactions earlier than the checkpoint's nextId? There COMMIT record may very well lie before the checkpoint's REDO pointer, so the CLOG we copied better contained their correct state. Yet if it does, then why isn't the nextId's CLOG page accessible? best regards, Florian Pflug
> And I think they also reported that if they didn't run hot standby,
> but just normal recovery into a new master, it didn't have the problem
> either, i.e. without hotstandby, recovery ran, properly extended the
> clog, and then ran as a new master fine.
> but just normal recovery into a new master, it didn't have the problem
> either, i.e. without hotstandby, recovery ran, properly extended the
> clog, and then ran as a new master fine.
Yes this is correct...attempting to start as hotstandby will produce the pg_clog error repeatedly and then without changing anything else, just turning hot standby off it will start up successfully.
> This fits the OP's observation ob the
> problem vanishing when pg_start_backup() does an immediate checkpoint.
Note that this is *not* the behaviour I'm seeing....it's possible it happens more frequently without the immediate checkpoint, but I am seeing it happen even with the immediate checkpoint.
> This is a different problem and has already been reported by one of
> your colleagues in a separate thread, and answered in detail by me
> there. There is no bug related to this error message.
> your colleagues in a separate thread, and answered in detail by me
> there. There is no bug related to this error message.
Excellent...I will continue this discussion in that thread.
On Oct26, 2011, at 15:57 , Florian Pflug wrote: > As you said, the CLOG page corresponding to nextId > *should* always be accessible at the start of recovery (Unless whole file > has been removed by VACUUM, that is). So we shouldn't need to extends CLOG. > Yet the error suggest that the CLOG is, in fact, too short. What I said > is that we shouldn't apply any fix (for the CLOG problem) before we understand > the reason for that apparent contradiction. Ha! I think I've got a working theory. In CreateCheckPoint(), we determine the nextId that'll go into the checkpoint record, and then call CheckPointGuts() which does the actual writing and fsyncing. So far, that fine. If a transaction ID is assigned before we compute the checkpoint's nextXid, we'll extend the CLOG accordingly, and CheckPointGuts() will make sure the new CLOG page goes to disk. But, if wal_level = hot_standby, we also call LogStandbySnapshot() in CreateCheckPoint(), and we do that *after* CheckPointGuts(). Which would be fine too, except that LogStandbySnapshot() re-assigned the *current* value of ShmemVariableCache->nextXid to the checkpoint's nextXid field. Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), but before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint whose CLOG page hasn't necessarily made it to the disk yet. The longer CheckPointGuts() takes to finish it's work the more likely it becomes (assuming that CLOG writing and syncing doesn't happen at the very end). This fits the OP's observation ob the problem vanishing when pg_start_backup() does an immediate checkpoint. I dunno how to this fix, though, since I don't really understand why LogStandbySnapshot() needs to modify the checkpoint's nextXid.Simon, is there some documentation on what assumptions the hot standby code makes about the various XID fields included in a checkpoint? best regards, Florian Pflug
On Wed, Oct 26, 2011 at 9:57 AM, Florian Pflug <fgp@phlo.org> wrote: > On Oct26, 2011, at 15:12 , Simon Riggs wrote: >> On Wed, Oct 26, 2011 at 12:54 PM, Aidan Van Dyk <aidan@highrise.ca> wrote: >> >>> The read fails because their is no data at the location it's trying to >>> read from, because clog hasn't been extended yet by recovery. >> >> You don't actually know that, though I agree it seems a reasonable >> guess and was my first thought also. > > The actual error message also supports that theory. Here's the relevant > snippet from the OP's log (Found in CA9FD2FE.1D8D2%linas.virbalas@continuent.com) > > 2011-09-21 13:41:05 CEST FATAL: could not access status of transaction 1188673 > 2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at offset 32768: Success. > > Note that it says "Success" at the end of the second log entry. That > can only happen, I think, if we're trying to read the page adjacent to > the last page in the file. The seek would be successfull, and the subsequent > read() would indicate EOF by returning zero bytes. None of the calls would > set errno. If there was a real IO error, read() would set errno, and if the > page wasn't adjacent to the last page in the file, seek() would set errno. > In both cases we'd see the corresponding error messag, not "Success". And even more pointedly, in the original go around on this: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/174056 He reported that clog/0000 after pg_start_backup call: -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 Changed during the rsync phase to this: -rw------- 1 postgres postgres 16384 Sep 23 14:33 0000 But on the slave, of course, it was copied before it was extend so it was the original size (that's ok, that's the point of recovery after the backup): -rw------- 1 postgres postgres 8192 Sep 23 14:31 0000 With the error: 2011-09-23 14:33:46 CEST FATAL: could not access status of transaction 37206 2011-09-23 14:33:46 CEST DETAIL: Could not read from file "pg_clog/0000" at offset 8192: Success. And that error happens *before* recovery even can get attempted. And that if he copied the "recent" clog/0000 from the master, it did start up. And I think they also reported that if they didn't run hot standby, but just normal recovery into a new master, it didn't have the problem either, i.e. without hotstandby, recovery ran, properly extended the clog, and then ran as a new master fine. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
The way forwards here is for someone to show the clog file that causes
the error and find out why the call to read() fails.
Well let me hook ya up :) Here's the log, controldata, and erroring clog attached
2011-10-26 08:44:15.419 MDT [1544]: [1-1] LOG: database system was interrupted; last known up at 2011-10-26 08:43:54 MDT
2011-10-26 08:44:15.419 MDT [1544]: [2-1] LOG: creating missing WAL directory "pg_xlog/archive_status"
2011-10-26 08:44:15.421 MDT [1544]: [3-1] LOG: entering standby mode
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 6/CE000000
2011-10-26 08:44:15.425 MDT [1545]: [1-1] LOG: streaming replication successfully connected to primary
2011-10-26 08:44:16.246 MDT [1547]: [1-1] FATAL: the database system is starting up
2011-10-26 08:44:17.501 MDT [1544]: [4-1] DEBUG: checkpoint record is at 6/D0BC36A0
2011-10-26 08:44:17.501 MDT [1544]: [5-1] DEBUG: redo record is at 6/CE0010E0; shutdown FALSE
2011-10-26 08:44:17.501 MDT [1544]: [6-1] DEBUG: next transaction ID: 0/10846655; next OID: 2826628
2011-10-26 08:44:17.501 MDT [1544]: [7-1] DEBUG: next MultiXactId: 1741; next MultiXactOffset: 3498
2011-10-26 08:44:17.501 MDT [1544]: [8-1] DEBUG: oldest unfrozen transaction ID: 1669, in database 1
2011-10-26 08:44:17.501 MDT [1544]: [9-1] DEBUG: transaction ID wrap limit is 2147485316, limited by database with OID 1
2011-10-26 08:44:17.742 MDT [1544]: [10-1] DEBUG: resetting unlogged relations: cleanup 1 init 0
2011-10-26 08:44:17.751 MDT [1544]: [11-1] DEBUG: initializing for hot standby
2011-10-26 08:44:17.751 MDT [1544]: [12-1] FATAL: could not access status of transaction 10846655
2011-10-26 08:44:17.751 MDT [1544]: [13-1] DETAIL: Could not read from file "pg_clog/000A" at offset 90112: Success.
2011-10-26 08:44:17.751 MDT [1537]: [2-1] LOG: startup process (PID 1544) exited with exit code 1
2011-10-26 08:44:17.751 MDT [1537]: [3-1] LOG: aborting startup due to startup process failure
2011-10-26 08:44:17.843 MDT [1543]: [1-1] DEBUG: logger shutting down
pg_control version number: 903
Catalog version number: 201105231
Database system identifier: 5667259861501982685
Database cluster state: in production
pg_control last modified: Wed 26 Oct 2011 08:43:54 AM MDT
Latest checkpoint location: 6/D0BC36A0
Prior checkpoint location: 6/877B9950
Latest checkpoint's REDO location: 6/CE0010E0
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/10846655
Latest checkpoint's NextOID: 2826628
Latest checkpoint's NextMultiXactId: 1741
Latest checkpoint's NextMultiOffset: 3498
Latest checkpoint's oldestXID: 1669
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 10845042
Time of latest checkpoint: Wed 26 Oct 2011 08:43:44 AM MDT
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: hot_standby
Current max_connections setting: 600
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 256
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Вложения
On Oct26, 2011, at 17:36 , Chris Redekop wrote: > > And I think they also reported that if they didn't run hot standby, > > but just normal recovery into a new master, it didn't have the problem > > either, i.e. without hotstandby, recovery ran, properly extended the > > clog, and then ran as a new master fine. > > Yes this is correct...attempting to start as hotstandby will produce the > pg_clog error repeatedly and then without changing anything else, just > turning hot standby off it will start up successfully. Yup, because with hot standby disabled (on the client side), StartupCLOG() happens after recovery has completed. That, at the very least, makes the problem very unlikely to occur in the non-hot-standby case. I'm not sure it's completely impossible, though. Per my theory about the cause of the problem in my other mail, I think you might see StartupCLOG failures even during crash recovery, provided that wal_level was set to hot_standby when the primary crashed. Here's how 1) We start a checkpoint, and get as far as LogStandbySnapshot() 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId(). The assigned XID requires CLOG extension. 3) The checkpoint continues, and LogStandbySnapshot () advances the checkpoint's nextXid to the XID assigned in (2). 4) We crash after writing the checkpoint record, but before the CLOG extension makes it to the disk, and before any traceof the XID assigned in (2) makes it to the xlog. Then StartupCLOG() would fail at the end of recovery, because we'd end up with a nextXid whose corresponding CLOG page doesn't exist. > > This fits the OP's observation ob the > > problem vanishing when pg_start_backup() does an immediate checkpoint. > > Note that this is *not* the behaviour I'm seeing....it's possible it happens > more frequently without the immediate checkpoint, but I am seeing it happen > even with the immediate checkpoint. Yeah, I should have said "of the problem's likelihood decreasing" instead of "vanishing". The point is, the longer the checkpoint takes, the higher the chance the nextId is advanced far enough to require a CLOG extension. That alone isn't enough to trigger the error - the CLOG extension must also *not* make it to the disk before the checkpoint completes - but it's a required precondition for the error to occur. best regards, Florian Pflug
On Wed, Oct 26, 2011 at 3:47 PM, Florian Pflug <fgp@phlo.org> wrote: > On Oct26, 2011, at 15:57 , Florian Pflug wrote: >> As you said, the CLOG page corresponding to nextId >> *should* always be accessible at the start of recovery (Unless whole file >> has been removed by VACUUM, that is). So we shouldn't need to extends CLOG. >> Yet the error suggest that the CLOG is, in fact, too short. What I said >> is that we shouldn't apply any fix (for the CLOG problem) before we understand >> the reason for that apparent contradiction. > > Ha! I think I've got a working theory. > > In CreateCheckPoint(), we determine the nextId that'll go into the checkpoint > record, and then call CheckPointGuts() which does the actual writing and fsyncing. > So far, that fine. If a transaction ID is assigned before we compute the > checkpoint's nextXid, we'll extend the CLOG accordingly, and CheckPointGuts() will > make sure the new CLOG page goes to disk. > > But, if wal_level = hot_standby, we also call LogStandbySnapshot() in > CreateCheckPoint(), and we do that *after* CheckPointGuts(). Which would be > fine too, except that LogStandbySnapshot() re-assigned the *current* value of > ShmemVariableCache->nextXid to the checkpoint's nextXid field. > > Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), but > before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint > whose CLOG page hasn't necessarily made it to the disk yet. The longer CheckPointGuts() > takes to finish it's work the more likely it becomes (assuming that CLOG writing > and syncing doesn't happen at the very end). This fits the OP's observation ob the > problem vanishing when pg_start_backup() does an immediate checkpoint. This is the correct explanation. I've just come back into Wifi range, so I was just writing to you with this explanation but your original point that nextxid must be wrong deserves credit. OTOH I was just waiting to find out what the reason for the physical read was, rather than guessing. Notice that the nextxid value isn't wrong, its just not the correct value to use for starting clog. As it turns out the correct fix is actually just to skip StartupClog() until the end of recovery because it does nothing useful when executed at that time. When I wrote the original code I remember thinking that StartupClog() is superfluous at that point. Brewing a patch now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 26, 2011 at 5:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Brewing a patch now. Latest thinking... confirmations or other error reports please. This fixes both the subtrans and clog bugs in one patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Wed, Oct 26, 2011 at 5:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Oct 26, 2011 at 5:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> Brewing a patch now. > > Latest thinking... confirmations or other error reports please. > > This fixes both the subtrans and clog bugs in one patch. I'll be looking to commit that tomorrow afternoon as two separate patches with appropriate credits. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
FYI I have given this patch a good test and can now no longer reproduce either the subtrans nor the clog error. Thanks guys!
On Wed, Oct 26, 2011 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Oct 26, 2011 at 5:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:I'll be looking to commit that tomorrow afternoon as two separate
> On Wed, Oct 26, 2011 at 5:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> Brewing a patch now.
>
> Latest thinking... confirmations or other error reports please.
>
> This fixes both the subtrans and clog bugs in one patch.
patches with appropriate credits.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Oct26, 2011, at 18:08 , Simon Riggs wrote: > On Wed, Oct 26, 2011 at 3:47 PM, Florian Pflug <fgp@phlo.org> wrote: >> On Oct26, 2011, at 15:57 , Florian Pflug wrote: >> Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), but >> before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint >> whose CLOG page hasn't necessarily made it to the disk yet. The longer CheckPointGuts() >> takes to finish it's work the more likely it becomes (assuming that CLOG writing >> and syncing doesn't happen at the very end). This fits the OP's observation ob the >> problem vanishing when pg_start_backup() does an immediate checkpoint. > > As it turns out the correct fix is actually just to skip StartupClog() > until the end of recovery because it does nothing useful when executed > at that time. When I wrote the original code I remember thinking that > StartupClog() is superfluous at that point. Hm, that fixes the problem in the hot standby case, but as I said in my reply to Chris Redekop, normal crash recovery is also at risk (although the probability of hitting the bug is much smaller there). Here's my reasoning from that other mail: Per my theory about the cause of the problem in my other mail, I think you might see StartupCLOG failures even during crash recovery, provided that wal_level was set to hot_standby when the primary crashed. Here's how 1) We start a checkpoint, and get as far as LogStandbySnapshot() 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId(). The assigned XID requires CLOG extension. 3) The checkpoint continues, and LogStandbySnapshot () advances the checkpoint's nextXid to the XID assigned in (2). 4) We crash after writing the checkpoint record, but before the CLOG extension makes it to the disk, and before any traceof the XID assigned in (2) makes it to the xlog. Then StartupCLOG() would fail at the end of recovery, because we'd end up with a nextXid whose corresponding CLOG page doesn't exist. Quite aside from that concern, I think it's probably not a good idea for the nextXid value of a checkpoint to depend on whether wal_level was set to hot_standby or not. Our recovery code is already quite complex and hard to test, and this just adds one more combination that has to be thought about while coding and that needs to be tested. My suggestion is to fix the CLOG problem in that same way that you fixed the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before CheckPointGuts(). Here's what I image CreateCheckPoint() should look like: 1) LogStandbySnapshot() and fill out oldestActiveXid 2) Fill out REDO 3) Wait for concurrent commits 4) Fill out nextXid and the other fields 5) CheckPointGuts() 6) Rest It's then no longer necessary for LogStandbySnapshot() do modify the nextXid, since we fill out nextXid after LogStandbySnapshot() and will thus derive a higher value than LogStandbySnapshot() would have. We could then also fold GetOldestActiveTransactionId() back into your proposed LogStandbySnapshot() and thus don't need two ProcArray traversals. best regards, Florian Pflug
On Wed, Oct 26, 2011 at 12:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Oct 26, 2011 at 5:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> Brewing a patch now. > > Latest thinking... confirmations or other error reports please. > > This fixes both the subtrans and clog bugs in one patch. I don't see the point of changing StartupCLOG() to be an empty function and adding a new function TrimCLOG() that does everything StartupCLOG() used to do. Seems simpler to just move the calls to StartupCLOG() wherever they need to be - i.e. remove the one that happens before WAL replay, and extricate the one at end-of-recovery from the if block which currently contains it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 26, 2011 at 12:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> This fixes both the subtrans and clog bugs in one patch. > I don't see the point of changing StartupCLOG() to be an empty > function and adding a new function TrimCLOG() that does everything > StartupCLOG() used to do. +1 ... I found that overly cute also. regards, tom lane
On 27.10.2011 02:29, Florian Pflug wrote: > Per my theory about the cause of the problem in my other mail, I think you > might see StartupCLOG failures even during crash recovery, provided that > wal_level was set to hot_standby when the primary crashed. Here's how > > 1) We start a checkpoint, and get as far as LogStandbySnapshot() > 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId(). > The assigned XID requires CLOG extension. > 3) The checkpoint continues, and LogStandbySnapshot () advances the > checkpoint's nextXid to the XID assigned in (2). > 4) We crash after writing the checkpoint record, but before the CLOG > extension makes it to the disk, and before any trace of the XID assigned > in (2) makes it to the xlog. > > Then StartupCLOG() would fail at the end of recovery, because we'd end up > with a nextXid whose corresponding CLOG page doesn't exist. No, clog extension is WAL-logged while holding the XidGenLock. At step 3, LogStandbySnapshot() would block until the clog-extension record is written to WAL, so crash recovery would see and replay that record before calling StartupCLOG(). That can happen during hot standby, though, because StartupCLOG() is called earlier. > My suggestion is to fix the CLOG problem in that same way that you fixed > the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before > CheckPointGuts(). > > Here's what I image CreateCheckPoint() should look like: > > 1) LogStandbySnapshot() and fill out oldestActiveXid > 2) Fill out REDO > 3) Wait for concurrent commits > 4) Fill out nextXid and the other fields > 5) CheckPointGuts() > 6) Rest > > It's then no longer necessary for LogStandbySnapshot() do modify > the nextXid, since we fill out nextXid after LogStandbySnapshot() and > will thus derive a higher value than LogStandbySnapshot() would have. Hmm, I don't think that fully fixes the problem. Even if you're certain that CheckPointGuts() has fsync'd the clog page to disk, VACUUM might decide to truncate it away again while the checkpoint is running. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 27.10.2011 09:57, Heikki Linnakangas wrote: >> My suggestion is to fix the CLOG problem in that same way that you fixed >> the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before >> CheckPointGuts(). >> >> Here's what I image CreateCheckPoint() should look like: >> >> 1) LogStandbySnapshot() and fill out oldestActiveXid >> 2) Fill out REDO >> 3) Wait for concurrent commits >> 4) Fill out nextXid and the other fields >> 5) CheckPointGuts() >> 6) Rest >> >> It's then no longer necessary for LogStandbySnapshot() do modify >> the nextXid, since we fill out nextXid after LogStandbySnapshot() and >> will thus derive a higher value than LogStandbySnapshot() would have. > > Hmm, I don't think that fully fixes the problem. Even if you're certain > that CheckPointGuts() has fsync'd the clog page to disk, VACUUM might > decide to truncate it away again while the checkpoint is running. Oh, scratch that. During recovery, we merrily treat missing slru files as they were filled with zeros. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Oct 27, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Oct 26, 2011 at 12:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> This fixes both the subtrans and clog bugs in one patch. > >> I don't see the point of changing StartupCLOG() to be an empty >> function and adding a new function TrimCLOG() that does everything >> StartupCLOG() used to do. > > +1 ... I found that overly cute also. It would have been even easier to move StartupCLOG() later, but then we'd need a big comment explaining why CLOG starts up at one point and subtrans starts up at another point, since that is very confusing way of doing things. I wrote it that way first and it definitely looks strange. It's much easier to understand that StartupCLOG() is actually a no-op and that we need to trim the clog at the end of recovery in all cases. The patch isn't meant to be cute, just a better of way of expressing what needs to be done, so I think the patch should stay that way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 27, 2011 at 5:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Oct 27, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Wed, Oct 26, 2011 at 12:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> This fixes both the subtrans and clog bugs in one patch. >> >>> I don't see the point of changing StartupCLOG() to be an empty >>> function and adding a new function TrimCLOG() that does everything >>> StartupCLOG() used to do. >> >> +1 ... I found that overly cute also. > > It would have been even easier to move StartupCLOG() later, but then > we'd need a big comment explaining why CLOG starts up at one point and > subtrans starts up at another point, since that is very confusing way > of doing things. I wrote it that way first and it definitely looks > strange. > > It's much easier to understand that StartupCLOG() is actually a no-op > and that we need to trim the clog at the end of recovery in all cases. If it's a no-op, why have it at all? I know we have a bunch of places in the code where we have empty stubs where there used to be initialization or cleanup code, but I've never found that particularly good style. If something no longer requires initialization in a certain place, I think we should nuke the whole function. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Oct27, 2011, at 08:57 , Heikki Linnakangas wrote: > On 27.10.2011 02:29, Florian Pflug wrote: >> Per my theory about the cause of the problem in my other mail, I think you >> might see StartupCLOG failures even during crash recovery, provided that >> wal_level was set to hot_standby when the primary crashed. Here's how >> >> 1) We start a checkpoint, and get as far as LogStandbySnapshot() >> 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId(). >> The assigned XID requires CLOG extension. >> 3) The checkpoint continues, and LogStandbySnapshot () advances the >> checkpoint's nextXid to the XID assigned in (2). >> 4) We crash after writing the checkpoint record, but before the CLOG >> extension makes it to the disk, and before any trace of the XID assigned >> in (2) makes it to the xlog. >> >> Then StartupCLOG() would fail at the end of recovery, because we'd end up >> with a nextXid whose corresponding CLOG page doesn't exist. > > No, clog extension is WAL-logged while holding the XidGenLock. At step 3, > LogStandbySnapshot() would block until the clog-extension record is written > to WAL, so crash recovery would see and replay that record before calling > StartupCLOG(). Hm, true. But it still seems wrong for LogStandbySnapshot() to modify the checkpoint's nextXid, and even more wrong to do that only if wal_mode = hot_standby. Plus, I think it's a smart idea to verify that the required parts of the CLOG are available at the start of recovery. Because if they're missing, the data on the standby *will* be corrupted. Is there any argument against doiing LogStandbySnapshot() earlier (i.e., at the time oldestActiveXid is computed)? best regards, Florian Pflug
On Thu, Oct 27, 2011 at 12:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Oct 27, 2011 at 5:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Thu, Oct 27, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> On Wed, Oct 26, 2011 at 12:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>> This fixes both the subtrans and clog bugs in one patch. >>> >>>> I don't see the point of changing StartupCLOG() to be an empty >>>> function and adding a new function TrimCLOG() that does everything >>>> StartupCLOG() used to do. >>> >>> +1 ... I found that overly cute also. >> >> It would have been even easier to move StartupCLOG() later, but then >> we'd need a big comment explaining why CLOG starts up at one point and >> subtrans starts up at another point, since that is very confusing way >> of doing things. I wrote it that way first and it definitely looks >> strange. >> >> It's much easier to understand that StartupCLOG() is actually a no-op >> and that we need to trim the clog at the end of recovery in all cases. > > If it's a no-op, why have it at all? I know we have a bunch of places > in the code where we have empty stubs where there used to be > initialization or cleanup code, but I've never found that particularly > good style. If something no longer requires initialization in a > certain place, I think we should nuke the whole function. It is a no-op for exactly the same reason other similar functions are no-ops - it used to do something but now does not. Anyone seeing StartupSubtrans and StartupMultiXact but no StartupClog will immediately ask "why?". IMHO it's easier to have an obviously named function than a comment - its less invasive for a backpatch as well. I'm following current code style. If you wish to change that, feel free to change this and all other locations that do this. Until then, doing this makes most sense and follows current coding style. If I had done it the way you suggest, I don't doubt someone would say in about 6 months "Which idiot removed StartupClog()?". -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 27, 2011 at 12:29 AM, Florian Pflug <fgp@phlo.org> wrote: > Per my theory about the cause of the problem in my other mail, I think you > might see StartupCLOG failures even during crash recovery, provided that > wal_level was set to hot_standby when the primary crashed. Here's how > > 1) We start a checkpoint, and get as far as LogStandbySnapshot() > 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId(). > The assigned XID requires CLOG extension. > 3) The checkpoint continues, and LogStandbySnapshot () advances the > checkpoint's nextXid to the XID assigned in (2). > 4) We crash after writing the checkpoint record, but before the CLOG > extension makes it to the disk, and before any trace of the XID assigned > in (2) makes it to the xlog. > > Then StartupCLOG() would fail at the end of recovery, because we'd end up > with a nextXid whose corresponding CLOG page doesn't exist. Clog extension holds XidGenLock, as does LogStandbySnapshot, which specifically excludes the above scenario. > Quite aside from that concern, I think it's probably not a good idea > for the nextXid value of a checkpoint to depend on whether wal_level > was set to hot_standby or not. Our recovery code is already quite complex > and hard to test, and this just adds one more combination that has to > be thought about while coding and that needs to be tested. > > My suggestion is to fix the CLOG problem in that same way that you fixed > the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before > CheckPointGuts(). > > Here's what I image CreateCheckPoint() should look like: > > 1) LogStandbySnapshot() and fill out oldestActiveXid > 2) Fill out REDO > 3) Wait for concurrent commits > 4) Fill out nextXid and the other fields > 5) CheckPointGuts() > 6) Rest > > It's then no longer necessary for LogStandbySnapshot() do modify > the nextXid, since we fill out nextXid after LogStandbySnapshot() and > will thus derive a higher value than LogStandbySnapshot() would have. > > We could then also fold GetOldestActiveTransactionId() back into > your proposed LogStandbySnapshot() and thus don't need two ProcArray > traversals. I think you make a good case for doing this. However, I'm concerned that moving LogStandbySnapshot() in a backpatch seems more risky than it's worth. We could easily introduce a new bug into what we would all agree is a complex piece of code. Minimal change seems best in this case. And also, 2 ProcArray traversals is not a problem there. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Oct27, 2011, at 15:51 , Simon Riggs wrote: > On Thu, Oct 27, 2011 at 12:29 AM, Florian Pflug <fgp@phlo.org> wrote: >> Here's what I image CreateCheckPoint() should look like: >> >> 1) LogStandbySnapshot() and fill out oldestActiveXid >> 2) Fill out REDO >> 3) Wait for concurrent commits >> 4) Fill out nextXid and the other fields >> 5) CheckPointGuts() >> 6) Rest >> >> It's then no longer necessary for LogStandbySnapshot() do modify >> the nextXid, since we fill out nextXid after LogStandbySnapshot() and >> will thus derive a higher value than LogStandbySnapshot() would have. >> >> We could then also fold GetOldestActiveTransactionId() back into >> your proposed LogStandbySnapshot() and thus don't need two ProcArray >> traversals. > > I think you make a good case for doing this. > > However, I'm concerned that moving LogStandbySnapshot() in a backpatch > seems more risky than it's worth. We could easily introduce a new bug > into what we would all agree is a complex piece of code. Minimal > change seems best in this case. OTOH, we currently compute oldestActiveXid within LogStandbySnapshot(). Your proposed patch changes that, which also carries a risk since something could depend on these values being in sync. Especially since both the logged snapshot and oldestActiveXid influence the snapshot tracking on the slave. But since you wrote most of that code, your judgement about the relative risks of these two approaches obviously out-weights mine. best regards, Florian Pflug
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, Oct 27, 2011 at 12:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Oct 27, 2011 at 5:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> It's much easier to understand that StartupCLOG() is actually a no-op >>> and that we need to trim the clog at the end of recovery in all cases. >> If it's a no-op, why have it at all? > It is a no-op for exactly the same reason other similar functions are > no-ops - it used to do something but now does not. > Anyone seeing StartupSubtrans and StartupMultiXact but no StartupClog > will immediately ask "why?". I think it's a good point that StartupCLog doesn't exist in a vacuum but should be parallel to the init functions for the other SLRU modules. So at this point I think I agree with Simon's approach. However, the obvious next question is whether those other modules don't need to be changed also, and if not why not. Another issue is that if StartupCLog is left as a no-op, what will happen if someone mistakenly tries to access clog before the trim function is called? It would be a good idea to make sure that such a thing results in an easily-identifiable failure. regards, tom lane
On Thu, Oct 27, 2011 at 3:03 PM, Florian Pflug <fgp@phlo.org> wrote: >> I think you make a good case for doing this. >> >> However, I'm concerned that moving LogStandbySnapshot() in a backpatch >> seems more risky than it's worth. We could easily introduce a new bug >> into what we would all agree is a complex piece of code. Minimal >> change seems best in this case. > > OTOH, we currently compute oldestActiveXid within LogStandbySnapshot(). > Your proposed patch changes that, which also carries a risk since something > could depend on these values being in sync. Especially since both the logged > snapshot and oldestActiveXid influence the snapshot tracking on the slave. > > But since you wrote most of that code, your judgement about the relative > risks of these two approaches obviously out-weights mine. We must move oldestActiveXid since that is the source of a bug. There is no need to move LogStandbySnapshot(), so I am suggesting we don't do that for the backpatch. I was going to implement it the way you suggest in HEAD, since I agree that is a cleaner way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Oct 27, 2011 at 3:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > However, the > obvious next question is whether those other modules don't need to be > changed also, and if not why not. Good point. StartupSubtrans() is also changed by this patch, since it will be supplied with an earlier initialisation value. StartupMultiXact() didn't need changing, I thought, but I will review further. > Another issue is that if StartupCLog is left as a no-op, what will > happen if someone mistakenly tries to access clog before the trim > function is called? It would be a good idea to make sure that such > a thing results in an easily-identifiable failure. The old StartupCLOG() didn't do anything that was essential to using the clog, which is why its a no-op. You can still use the clog, just with zero startup. Maybe setting the current page should go in at startup, will think. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Oct27, 2011, at 16:30 , Simon Riggs wrote: > On Thu, Oct 27, 2011 at 3:03 PM, Florian Pflug <fgp@phlo.org> wrote: > >>> I think you make a good case for doing this. >>> >>> However, I'm concerned that moving LogStandbySnapshot() in a backpatch >>> seems more risky than it's worth. We could easily introduce a new bug >>> into what we would all agree is a complex piece of code. Minimal >>> change seems best in this case. >> >> OTOH, we currently compute oldestActiveXid within LogStandbySnapshot(). >> Your proposed patch changes that, which also carries a risk since something >> could depend on these values being in sync. Especially since both the logged >> snapshot and oldestActiveXid influence the snapshot tracking on the slave. >> >> But since you wrote most of that code, your judgement about the relative >> risks of these two approaches obviously out-weights mine. > > We must move oldestActiveXid since that is the source of a bug. There > is no need to move LogStandbySnapshot(), so I am suggesting we don't > do that for the backpatch. I was going to implement it the way you > suggest in HEAD, since I agree that is a cleaner way. Sound good. best regards, Florian Pflug
On Thu, Oct 27, 2011 at 4:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > StartupMultiXact() didn't need changing, I thought, but I will review further. Good suggestion. On review, StartupMultiXact() could also suffer similar error to the clog failure. This was caused *because* MultiXact is not maintained by recovery, which I had thought meant it was protected from such failure. Revised patch attached. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
looks like the v3 patch re-introduces the pg_subtrans issue...
On Tue, Nov 1, 2011 at 9:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Oct 27, 2011 at 4:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:Good suggestion.
> StartupMultiXact() didn't need changing, I thought, but I will review further.
On review, StartupMultiXact() could also suffer similar error to the
clog failure. This was caused *because* MultiXact is not maintained by
recovery, which I had thought meant it was protected from such
failure.
Revised patch attached.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 2, 2011 at 2:40 AM, Chris Redekop <chris@replicon.com> wrote: > looks like the v3 patch re-introduces the pg_subtrans issue... No, I just separated the patches to be clearer about the individual changes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 2, 2011 at 7:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Nov 2, 2011 at 2:40 AM, Chris Redekop <chris@replicon.com> wrote: > >> looks like the v3 patch re-introduces the pg_subtrans issue... > > No, I just separated the patches to be clearer about the individual changes. 3 bug fixes committed and back patched. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
okay, sorry I'm a little confused then. Should I be able to apply both the v2 patch as well as the v3 patch? or is it expected that I'd have to manually do the merge?
On Wed, Nov 2, 2011 at 1:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, Nov 2, 2011 at 2:40 AM, Chris Redekop <chris@replicon.com> wrote:No, I just separated the patches to be clearer about the individual changes.
> looks like the v3 patch re-introduces the pg_subtrans issue...
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
I was curious if anyone running into these problems has gotten a chance to test the 3 fixes committed here. It sounded like Linas even had a repeatable test case? For easier reference the commits are: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2f55c535e1f026929cf20855b3790d3632062d42 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ff8451aa14c8513e429cbef09ddc72e79da366a5 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=656bba95af3950b26e9e97d86d29787d89e2b423 -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us