Обсуждение: Re: Bug #613: Sequence values fall back to previously chec
I don't fully understand the xlog files or WAL records but... Why isn't the writing of the WAL record based on the CACHE value of the sequence? If a request to nextval() can't be satisfied by the cache, the sequence on disk should be updated resulting in a WAL record being written. If two sessions are accessing the sequence, they will likely end up not writing sequential values as they have both taken a chunk of values by calling nextval() but the effect of this could be controlled by the user by selecting an acceptable value for CACHE. If I don't mind having session 1 write records 1-10 while session 2 interleaves those with records 11-20, I should set my cache to 10. If I want my id's to map to insertion time as closely as possible I should set the cache lower (or NOCACHE, is that an option?). I'm concerned that the discussion here has been of the opinion that since no records were written to the database using the value retrieved from the sequence that no damage has been done. We are using database sequences to get unique identifiers for things outside the database. If a sequence could ever under any circumstances reissue a value, this could be damaging to the integrity of our software.
On Thu, 14 Mar 2002, Tom Pfau wrote: > I don't fully understand the xlog files or WAL records but... > > Why isn't the writing of the WAL record based on the CACHE value of the > sequence? If a request to nextval() can't be satisfied by the cache, > the sequence on disk should be updated resulting in a WAL record being > written. > > If two sessions are accessing the sequence, they will likely end up not > writing sequential values as they have both taken a chunk of values by > calling nextval() but the effect of this could be controlled by the user > by selecting an acceptable value for CACHE. I was thinking that too, just leave it up to the end user to decide the level of performance gain they want. But the cool thing about writing ahead to the WAL when compared to CACHE is that the on disk copy is advanced ahead of the cached value so that if you have a cache value of 1 you still get time ordered sequences from multiple backends AND you're only writing to the WAL once every 32 nextval's -- though the write ahead should really be based on a multiple of CACHE since if your cache value is 32 then you're not getting any benefit from the WAL savings. > If I don't mind having > session 1 write records 1-10 while session 2 interleaves those with > records 11-20, I should set my cache to 10. If I want my id's to map to > insertion time as closely as possible I should set the cache lower (or > NOCACHE, is that an option?). The CACHE value defaults to 1 which means no cache (each time nextval is called it only grabs 1 value). > I'm concerned that the discussion here has been of the opinion that > since no records were written to the database using the value retrieved > from the sequence that no damage has been done. We are using database > sequences to get unique identifiers for things outside the database. If > a sequence could ever under any circumstances reissue a value, this > could be damaging to the integrity of our software. Absolutely, we use sequences the same way. And the problem exhibits itself regardless of whether data is being inserted or not, and independantly of CACHE value. So this has to be fixed for both scenarios. -- Ben
"Tom Pfau" <T.Pfau@emCrit.com> writes: > I'm concerned that the discussion here has been of the opinion that > since no records were written to the database using the value retrieved > from the sequence that no damage has been done. Um, you certainly didn't hear me saying that ;-) There are two different bugs involved here. One is the no-WAL-flush- if-transaction-is-only-nextval problem. AFAIK everyone agrees we must fix that. The other issue is this business about "logging ahead" (to reduce the number of WAL records written) not interacting correctly with checkpoints. What we're arguing about is exactly how to fix that part. > We are using database > sequences to get unique identifiers for things outside the database. If > a sequence could ever under any circumstances reissue a value, this > could be damaging to the integrity of our software. If you do a SELECT nextval() and then use the returned value externally *without waiting for a commit acknowledgement*, then I think you are risking trouble; there's no guarantee that the WAL record (if one is needed) has hit disk yet, and so a crash could roll back the sequence. This isn't an issue for a SELECT nextval() standing on its own --- AFAIK the result will not be transmitted to the client until after the commit happens. But it would be an issue for a select executed inside a transaction block (begin/commit). regards, tom lane
On Thu, 14 Mar 2002, Tom Lane wrote: > > If you do a SELECT nextval() and then use the returned value externally > *without waiting for a commit acknowledgement*, then I think you are > risking trouble; there's no guarantee that the WAL record (if one is > needed) has hit disk yet, and so a crash could roll back the sequence. > > This isn't an issue for a SELECT nextval() standing on its own --- > AFAIK the result will not be transmitted to the client until after the > commit happens. But it would be an issue for a select executed inside > a transaction block (begin/commit). > The behavior of SELECT nextval() should not be conditional on being in or out of a transaction block. What you implying by saying that the data is that it would be possible to rollback an uncommited call to nextval(). Am I missing some terminology? I think I finally realized why my old patch that forces a log right off the bat works to fix at least part of the problem. When the database is shutdown properly all of the sequences that are in memory are written back to disk in their state at that time. But the problem with that is that their state at that time can have log_cnt > 0. This is why after startup that the sequence in memory is 'behind' the one on disk, the code sees log > fetch and doesn't log. When you really think about it log_cnt should not be part of the sequence record at all since there is never a valid case for storing a log_cnt on disk with a value other than 0. Maybe the purpose for the on disk value of log_cnt should be changed? It could be the value used in place of the static SEQ_LOG_VALS, which could then be definable on a per sequence basis. And then log_cnt could be moved into elm->log_cnt. Anyway, that's just a thought. Here's my latest patch to work around the problem. If there is a way to prevent log_cnt from being written out with a value greater than zero, that would be better than this. With this behavior log_cnt is reset to 0 each time a backend accesses a sequence for the first time. That's probably overkill... But I still believe that the XLogFlush after XLogInsert is necessary to ensure that the WAL value is written to disk immediately. In my testing this patch works fine, YMMV. -- Ben *** src/backend/commands/sequence.c.orig Tue Mar 12 18:58:55 2002 --- src/backend/commands/sequence.c Thu Mar 14 17:34:25 2002 *************** *** 62,67 **** --- 62,68 ---- int64 cached; int64 last; int64 increment; + bool reset_logcnt; struct SeqTableData *next; } SeqTableData; *************** *** 270,275 **** --- 271,277 ---- PageSetLSN(page, recptr); PageSetSUI(page, ThisStartUpID); + XLogFlush(recptr); } END_CRIT_SECTION(); *************** *** 314,321 **** PG_RETURN_INT64(elm->last); } ! seq = read_info("nextval", elm, &buf); /* lock page' buffer and ! * read tuple */ last = next = result = seq->last_value; incby = seq->increment_by; --- 316,322 ---- PG_RETURN_INT64(elm->last); } ! seq = read_info("nextval", elm, &buf); /* lock page' buffer and read tuple */ last = next = result = seq->last_value; incby = seq->increment_by; *************** *** 331,339 **** log--; } ! if (log < fetch) { ! fetch = log = fetch - log + SEQ_LOG_VALS; logit = true; } --- 332,340 ---- log--; } ! if (log < fetch) { ! fetch = log = fetch - log + SEQ_LOG_VALS * cache; logit = true; } *************** *** 403,409 **** rdata[0].data = (char *) &xlrec; rdata[0].len = sizeof(xl_seq_rec); rdata[0].next = &(rdata[1]); ! seq->last_value = next; seq->is_called = true; seq->log_cnt = 0; --- 404,410 ---- rdata[0].data = (char *) &xlrec; rdata[0].len = sizeof(xl_seq_rec); rdata[0].next = &(rdata[1]); ! seq->last_value = next; seq->is_called = true; seq->log_cnt = 0; *************** *** 417,423 **** PageSetLSN(page, recptr); PageSetSUI(page, ThisStartUpID); ! if (fetch) /* not all numbers were fetched */ log -= fetch; } --- 418,425 ---- PageSetLSN(page, recptr); PageSetSUI(page, ThisStartUpID); ! XLogFlush(recptr); ! if (fetch) /* not all numbers were fetched */ log -= fetch; } *************** *** 507,513 **** XLogRecPtr recptr; XLogRecData rdata[2]; Page page = BufferGetPage(buf); ! xlrec.node = elm->rel->rd_node; rdata[0].buffer = InvalidBuffer; rdata[0].data= (char *) &xlrec; --- 509,516 ---- XLogRecPtr recptr; XLogRecData rdata[2]; Page page = BufferGetPage(buf); ! ! xlrec.node = elm->rel->rd_node; rdata[0].buffer = InvalidBuffer; rdata[0].data= (char *) &xlrec; *************** *** 527,532 **** --- 530,536 ---- PageSetLSN(page, recptr); PageSetSUI(page, ThisStartUpID); + XLogFlush(recptr); } /* save info in sequence relation */ seq->last_value = next; /* last fetched number */ *************** *** 660,665 **** --- 664,674 ---- seq = (Form_pg_sequence) GETSTRUCT(&tuple); + if (elm->reset_logcnt) + { + seq->log_cnt = 0; + elm->reset_logcnt = false; + } elm->increment = seq->increment_by; return seq; *************** *** 703,708 **** --- 712,718 ---- name, caller); elm->relid = RelationGetRelid(seqrel); elm->cached = elm->last = elm->increment = 0; + elm->reset_logcnt = true; } } else *************** *** 721,726 **** --- 731,737 ---- elm->relid = RelationGetRelid(seqrel); elm->cached = elm->last = elm->increment= 0; elm->next = (SeqTable) NULL; + elm->reset_logcnt = true; if (seqtab == (SeqTable) NULL) seqtab = elm;
Ben Grimm <bgrimm@zaeon.com> writes: > The behavior of SELECT nextval() should not be conditional on being in or > out of a transaction block. Nonsense. The behavior of INSERT or UPDATE is "conditional" in exactly the same way: you should not rely on the reported result until it's committed. Given Vadim's performance concerns, I doubt he'll hold still for forcing an XLogFlush immediately every time a sequence XLOG record is written -- but AFAICS that'd be the only way to guarantee durability of a nextval result in advance of commit. Since I don't think that's an appropriate goal for the system to have, I don't care for it either. I'm planning to try coding up Vadim's approach (pay attention to page's old LSN to see if a WAL record must be generated) tonight or tomorrow and see if it seems reasonable. regards, tom lane
I noticed a message asking if this scenario was consistent with the other reports, and yes it is. We have seen this occuring on our system with versions as old as 7.0. Glad to see someone has finally nailed this one. Dave