Обсуждение: BUG #4565: nextval not updated during wal replication, leading to pk violations
BUG #4565: nextval not updated during wal replication, leading to pk violations
От
"Marc Schablewski"
Дата:
The following bug has been logged online: Bug reference: 4565 Logged by: Marc Schablewski Email address: ms@clickware.de PostgreSQL version: 8.3.5 Operating system: Debian, 2.6.24-19 kernel Description: nextval not updated during wal replication, leading to pk violations Details: Hi! We are using "Continuous Archiving" of WAL to keep a warm standby database server. We recently switch over to that backup database for testing purpose. We first took an online backup of the master database, let the WAL shipping run for some days and finally started the backup for normal use. A job tried to insert some records into a few tables and complained about a PK violation. PK are of type bigint. The column is populated by a sequence and a "default nextval()" on the PK column. We found that the sequence's currval was lower than the maximum value in the table and that this value was already present. Further investigation showed us that the last records in the table were inserted on the former master server while taking the initial online backup for the replication. It seems that the records got replicated but not the currval/nextval of the sequence. When running "select nextval()" on the backup database it returned the PK value of the first record inserted during that last run on the former master server. I couldn't reproduce the problem with two different servers and a simple database containing one table. Right now, I don't have any idea how to narrow down the problem or what to check next. So any hint would be helpful. Marc
Re: BUG #4565: nextval not updated during wal replication, leading to pk violations
От
Marc Schablewski
Дата:
FYI: We found another anomaly in that database (duplicated record with a unique index on a column (non PK)) and so we did some more research. As before, the record got replicated, but this time, the index entry was not. Later, a job tried to check if that record was already there by querying the column (thus, using the index) and couldn't find it so it was inserted again. We first thought it could be a WAL that wasn't restored. If I understand pg_backup_start() and _stop() right, the first WAL that should have been requested after the online backup had finished should have been the one mentioned in the "START WAL LOCATION" line of the backup history file, but the first file that actually gets requested is that in the "STOP WAL LOCATION" line. 000000020000004B00000027.00396450.backup: START WAL LOCATION: 4B/27396450 (file 000000020000004B00000027) STOP WAL LOCATION: 4B/28DDF968 (file 000000020000004B00000028) CHECKPOINT LOCATION: 4B/27396450 START TIME: 2008-11-28 16:29:17 CET LABEL: backup-20081128-162916 STOP TIME: 2008-11-28 17:49:09 CET Both WAL files were copied to the backup system. The log switch occurred before 17:06, because at that time the file was copied to the backup server by the archive command. But sadly, the records were all part of a single transaction which ran between 17:32 and 17:33, so obviously this is not a part of the problem. Though it seems a bit strange to me that the ...27 WAL didn't get recovered, but maybe I just don't fully understand that part of the docs (Chapter 24.3.2., fifth paragraph after the ordered list). Right now, I run a fresh online backup to check if we can reproduce the problem. I made some minor corrections on my scripts which prepare the backup database for replication, which might have caused the problem. I now omit the pg_xlog directory when taking the initial backup, after rereading chapter 24.3.2. saying there is a "risk of mistakes" if one copies those files, whatever that means. Maybe someone could clarify what the risks and mistakes could be? Speaking of chapter 24.3.2.: The last sentence in second paragraph after the ordered list ("You must ensure that these steps are carried out in sequence without any possible overlap, or you will invalidate the backup.") seems a bit misplaced. As far as I understand, it makes a reference to the ordered list ("steps"), but I had to read it several times to get it. I think, a better place would be right after the list. ...But then, maybe it's just me, not sleeping well last night. ;) Marc Marc Schablewski wrote: > The following bug has been logged online: > > Bug reference: 4565 > Logged by: Marc Schablewski > Email address: ms@clickware.de > PostgreSQL version: 8.3.5 > Operating system: Debian, 2.6.24-19 kernel > Description: nextval not updated during wal replication, leading to > pk violations > Details: > > Hi! > > We are using "Continuous Archiving" of WAL to keep a warm standby database > server. We recently switch over to that backup database for testing purpose. > We first took an online backup of the master database, let the WAL shipping > run for some days and finally started the backup for normal use. > > A job tried to insert some records into a few tables and complained about a > PK violation. PK are of type bigint. The column is populated by a sequence > and a "default nextval()" on the PK column. We found that the sequence's > currval was lower than the maximum value in the table and that this value > was already present. Further investigation showed us that the last records > in the table were inserted on the former master server while taking the > initial online backup for the replication. It seems that the records got > replicated but not the currval/nextval of the sequence. When running "select > nextval()" on the backup database it returned the PK value of the first > record inserted during that last run on the former master server. > > I couldn't reproduce the problem with two different servers and a simple > database containing one table. Right now, I don't have any idea how to > narrow down the problem or what to check next. So any hint would be > helpful. > > Marc > >
"Marc Schablewski" <ms@clickware.de> writes: > A job tried to insert some records into a few tables and complained about a > PK violation. PK are of type bigint. The column is populated by a sequence > and a "default nextval()" on the PK column. We found that the sequence's > currval was lower than the maximum value in the table and that this value > was already present. Further investigation showed us that the last records > in the table were inserted on the former master server while taking the > initial online backup for the replication. It seems that the records got > replicated but not the currval/nextval of the sequence. When running "select > nextval()" on the backup database it returned the PK value of the first > record inserted during that last run on the former master server. How many sequence values "overlapped" here, exactly? If more than one, are you by any chance running that sequence with a CACHE setting larger than one? (See the sequence's cache_value column if you're unsure.) I can see what might be a race condition between nextval()'s WAL logging and checkpoint start, but it doesn't seem like it could account for more than the CACHE setting worth of overlap. regards, tom lane
Re: BUG #4565: nextval not updated during wal replication, leading to pk violations
От
Marc Schablewski
Дата:
Tom Lane wrote: > "Marc Schablewski" <ms@clickware.de> writes: > >> A job tried to insert some records into a few tables and complained about a >> PK violation. PK are of type bigint. The column is populated by a sequence >> and a "default nextval()" on the PK column. We found that the sequence's >> currval was lower than the maximum value in the table and that this value >> was already present. Further investigation showed us that the last records >> in the table were inserted on the former master server while taking the >> initial online backup for the replication. It seems that the records got >> replicated but not the currval/nextval of the sequence. When running "select >> nextval()" on the backup database it returned the PK value of the first >> record inserted during that last run on the former master server. >> > > How many sequence values "overlapped" here, exactly? If more than one, > are you by any chance running that sequence with a CACHE setting larger > than one? (See the sequence's cache_value column if you're unsure.) > > I can see what might be a race condition between nextval()'s WAL logging > and checkpoint start, but it doesn't seem like it could account for more > than the CACHE setting worth of overlap. > > regards, tom lane > > Thanks for your reply. cache_value is set to one for all sequences. As far as I can tell, they were all created by a plain CREATE SEQUENCE seq_name and no other settings changed. And as we found out later this "loss of information" hit some indexes as well. We took a fresh backup yesterday and again we saw that only the last WAL generated during the backup run was restored on our backup system. I'm sure that this is the real problem. START WAL LOCATION: 4E/36F2C280 (file 000000020000004E00000036) STOP WAL LOCATION: 4E/3944C988 (file 000000020000004E00000039) CHECKPOINT LOCATION: 4E/36F2C280 START TIME: 2008-12-08 17:17:34 CET LABEL: backup-20081208-171733 STOP TIME: 2008-12-08 18:22:25 CET File 000000020000004E00000039 and all WAL that were generated afterwards were restored. Files ...36 to ...38 weren't even requested. After starting up the database, we had a table with records that were not present it's PK index. But the were in the index on the former master system. I ran an explicit pg_switch_xlog() on the master database before switching to the backup to make sure all changes got transferred. I'll check our scripts that do the backup/archiving/recovery again, there still might be some kind of bug. But beside that, I'd be thankful for any hints what else I could check. Marc
Marc Schablewski <ms@clickware.de> writes: > cache_value is set to one for all sequences. As far as I can tell, they > were all created by a plain CREATE SEQUENCE seq_name and no other > settings changed. And as we found out later this "loss of information" > hit some indexes as well. > We took a fresh backup yesterday and again we saw that only the last WAL > generated during the backup run was restored on our backup system. I'm > sure that this is the real problem. This is beginning to sound like an error in your backup/restore procedures. Please describe exactly what you're doing. regards, tom lane
Re: BUG #4565: nextval not updated during wal replication, leading to pk violations
От
Marc Schablewski
Дата:
Tom Lane wrote: > Marc Schablewski <ms@clickware.de> writes: > >> cache_value is set to one for all sequences. As far as I can tell, they >> were all created by a plain CREATE SEQUENCE seq_name and no other >> settings changed. And as we found out later this "loss of information" >> hit some indexes as well. >> > > >> We took a fresh backup yesterday and again we saw that only the last WAL >> generated during the backup run was restored on our backup system. I'm >> sure that this is the real problem. >> > > This is beginning to sound like an error in your backup/restore > procedures. Please describe exactly what you're doing. > > regards, tom lane > > I'm sorry. It was all my fault. If been a bit to picky about the files I copy, and I didn't copy the files directly under the cluster directory. I thought, there are just the config files, so what's the point. But this way of course, I didn't copy the backup_label either. It was never there when I looked, because it was either before or after running a backup. Reading and especially understanding the manual sometimes helps. doh! Now everything seems to work fine, but we are still testing. At least what we see in the logs is more reasonable. Now the .backup file is requested first, then the WALs. There is one strange thing left, though. The server first requests the second WAL, then the first one, then the second again and then it processes them in order (second, third, fourth, ...). Is this normal? I hope I didn't distract you from your regular work too much. Thanks again. Marc
Marc Schablewski <ms@clickware.de> writes: > Now everything seems to work fine, but we are still testing. At least > what we see in the logs is more reasonable. Now the .backup file is > requested first, then the WALs. There is one strange thing left, though. > The server first requests the second WAL, then the first one, then the > second again and then it processes them in order (second, third, fourth, > ...). Is this normal? It doesn't surprise me if something like that happens during PITR startup. regards, tom lane