Обсуждение: Sequence's value can be rollback after a crashed recovery.
Hi: Should we guarantee the sequence's nextval should never be rolled back even in a crashed recovery case? I can produce the rollback in the following case: Session 1: CREATE SEQUENCE s; BEGIN; SELECT nextval('s'); \watch 0.01 Session 2: kill -9 {sess1.pid} After the restart, the nextval('s') may be rolled back (less than the last value from session 1). The reason is because we never flush the xlog for the nextval_internal for the above case. So if the system crashes, there is nothing to redo from. It can be fixed with the following online change code. @@ -810,6 +810,8 @@ nextval_internal(Oid relid, bool check_permissions) recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); PageSetLSN(page, recptr); + + XLogFlush(recptr); } If a user uses sequence value for some external systems, the rollbacked value may surprise them. [I didn't run into this issue in any real case, I just studied xlog / sequence stuff today and found this case]. -- Best Regards Andy Fan
On Sunday, November 21, 2021, Andy Fan <zhihui.fan1213@gmail.com> wrote:
Should we guarantee the sequence's nextval should never be rolled back
even in a crashed recovery case?
I can produce the rollback in the following case:
This seems to be the same observation that was made a little over a year ago.
I don’t think the suggested documentation ever got written but haven’t looked for it either.
David J.
On Mon, 2021-11-22 at 14:57 +0800, Andy Fan wrote: > Should we guarantee the sequence's nextval should never be rolled back > even in a crashed recovery case? > I can produce the rollback in the following case: > > Session 1: > CREATE SEQUENCE s; > BEGIN; > SELECT nextval('s'); \watch 0.01 > > Session 2: > kill -9 {sess1.pid} > > After the restart, the nextval('s') may be rolled back (less than the > last value from session 1). > > The reason is because we never flush the xlog for the nextval_internal > for the above case. So if > the system crashes, there is nothing to redo from. It can be fixed > with the following online change > code. > > @@ -810,6 +810,8 @@ nextval_internal(Oid relid, bool check_permissions) > recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); > > PageSetLSN(page, recptr); > + > + XLogFlush(recptr); > } > > > If a user uses sequence value for some external systems, the > rollbacked value may surprise them. > [I didn't run into this issue in any real case, I just studied xlog / > sequence stuff today and found this case]. I think that is a bad idea. It will have an intolerable performance impact on OLTP queries, doubling the number of I/O requests for many cases. Perhaps it would make sense to document that you should never rely on sequence values from an uncommitted transaction. Yours, Laurenz Albe
> > The reason is because we never flush the xlog for the nextval_internal > > for the above case. So if > > the system crashes, there is nothing to redo from. It can be fixed > > with the following online change > > code. > > > > @@ -810,6 +810,8 @@ nextval_internal(Oid relid, bool check_permissions) > > recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); > > > > PageSetLSN(page, recptr); > > + > > + XLogFlush(recptr); > > } > > > > > > If a user uses sequence value for some external systems, the > > rollbacked value may surprise them. > > [I didn't run into this issue in any real case, I just studied xlog / > > sequence stuff today and found this case]. > > I think that is a bad idea. > It will have an intolerable performance impact on OLTP queries, doubling > the number of I/O requests for many cases. > The performance argument was expected before this writing. If we look at the nextval_interval more carefully, we can find it would not flush the xlog every time even the sequence's cachesize is 1. Currently It happens every 32 times on the nextval_internal at the worst case. > Perhaps it would make sense to document that you should never rely on > sequence values from an uncommitted transaction. I am OK with this if more people think this is the solution. -- Best Regards Andy Fan
On Mon, 2021-11-22 at 15:43 +0800, Andy Fan wrote: > > > The reason is because we never flush the xlog for the nextval_internal > > > for the above case. So if > > > the system crashes, there is nothing to redo from. It can be fixed > > > with the following online change > > > code. > > > > > > @@ -810,6 +810,8 @@ nextval_internal(Oid relid, bool check_permissions) > > > recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); > > > > > > PageSetLSN(page, recptr); > > > + > > > + XLogFlush(recptr); > > > } > > > > > > > > > If a user uses sequence value for some external systems, the > > > rollbacked value may surprise them. > > > [I didn't run into this issue in any real case, I just studied xlog / > > > sequence stuff today and found this case]. > > > > I think that is a bad idea. > > It will have an intolerable performance impact on OLTP queries, doubling > > the number of I/O requests for many cases. > > The performance argument was expected before this writing. If we look at the > nextval_interval more carefully, we can find it would not flush the xlog every > time even the sequence's cachesize is 1. Currently It happens every 32 times > on the nextval_internal at the worst case. Right, I didn't think of that. Still, I'm -1 on this performance regression. Yours, Laurenz Albe
On 11/22/21, 5:10 AM, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote: > On Mon, 2021-11-22 at 15:43 +0800, Andy Fan wrote: >> The performance argument was expected before this writing. If we look at the >> nextval_interval more carefully, we can find it would not flush the xlog every >> time even the sequence's cachesize is 1. Currently It happens every 32 times >> on the nextval_internal at the worst case. > > Right, I didn't think of that. Still, I'm -1 on this performance regression. I periodically hear rumblings about this behavior as well. At the very least, it certainly ought to be documented if it isn't yet. I wouldn't mind trying my hand at that. Perhaps we could also add a new configuration parameter if users really want to take the performance hit. Nathan
"Bossart, Nathan" <bossartn@amazon.com> writes: > I periodically hear rumblings about this behavior as well. At the > very least, it certainly ought to be documented if it isn't yet. I > wouldn't mind trying my hand at that. Perhaps we could also add a new > configuration parameter if users really want to take the performance > hit. A sequence's cache length is already configurable, no? regards, tom lane
On 11/22/21 12:31, Tom Lane wrote: > "Bossart, Nathan" <bossartn@amazon.com> writes: >> I periodically hear rumblings about this behavior as well. At the >> very least, it certainly ought to be documented if it isn't yet. I >> wouldn't mind trying my hand at that. Perhaps we could also add a new >> configuration parameter if users really want to take the performance >> hit. > > A sequence's cache length is already configurable, no? > Cache length isn't related to the problem here. The problem is that PostgreSQL sequences are entirely unsafe to use from a durability perspective, unless there's DML in the same transaction. Users might normally think that "commit" makes things durable. Unfortunately, IIUC, that's not true for sequences in PostgreSQL. -Jeremy PS. my bad on the documentation thing... I just noticed that I said a year ago I'd take a swing at a doc update, and I never did that!! Between Nate and I we'll get something proposed. -- http://about.me/jeremy_schneider
On Tue, Nov 23, 2021 at 4:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Bossart, Nathan" <bossartn@amazon.com> writes:
> I periodically hear rumblings about this behavior as well. At the
> very least, it certainly ought to be documented if it isn't yet. I
> wouldn't mind trying my hand at that. Perhaps we could also add a new
> configuration parameter if users really want to take the performance
> hit.
A sequence's cache length is already configurable, no?
We can hit this issue even cache=1. And even if we added the XLogFlush,
with _cachesize=1_, the Xlog is still recorded/flushed every 32 values.
I know your opinion about this at [1], IIUC you probably miss the SEQ_LOG_VALS
design, it was designed for the performance reason to avoid frequent xlog updates already.
But after that, the XLogSync is still not called which caused this issue.
Best Regards
Andy Fan
On 11/22/21 9:42 PM, Jeremy Schneider wrote: > On 11/22/21 12:31, Tom Lane wrote: >> "Bossart, Nathan" <bossartn@amazon.com> writes: >>> I periodically hear rumblings about this behavior as well. At the >>> very least, it certainly ought to be documented if it isn't yet. I >>> wouldn't mind trying my hand at that. Perhaps we could also add a new >>> configuration parameter if users really want to take the performance >>> hit. >> >> A sequence's cache length is already configurable, no? >> > > Cache length isn't related to the problem here. > > The problem is that PostgreSQL sequences are entirely unsafe to use from > a durability perspective, unless there's DML in the same transaction. > > Users might normally think that "commit" makes things durable. > Unfortunately, IIUC, that's not true for sequences in PostgreSQL. > That's not what the example in this thread demonstrates, though. There's no COMMIT in that example, so it shows that we may discard the nextval() in uncommitted transactions. I fail to see how that's less durable than any other DML (e.g. we don't complain about INSERT not being durable if you don't commit the change). If you can show that the sequence goes back after a commit, that'd be an actual durability issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 23, 2021 at 9:30 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 11/22/21 9:42 PM, Jeremy Schneider wrote: > > On 11/22/21 12:31, Tom Lane wrote: > >> "Bossart, Nathan" <bossartn@amazon.com> writes: > >>> I periodically hear rumblings about this behavior as well. At the > >>> very least, it certainly ought to be documented if it isn't yet. I > >>> wouldn't mind trying my hand at that. Perhaps we could also add a new > >>> configuration parameter if users really want to take the performance > >>> hit. > >> > >> A sequence's cache length is already configurable, no? > >> > > > > Cache length isn't related to the problem here. > > > > The problem is that PostgreSQL sequences are entirely unsafe to use from > > a durability perspective, unless there's DML in the same transaction. > > > > Users might normally think that "commit" makes things durable. > > Unfortunately, IIUC, that's not true for sequences in PostgreSQL. > > > > That's not what the example in this thread demonstrates, though. There's > no COMMIT in that example, so it shows that we may discard the nextval() > in uncommitted transactions. I fail to see how that's less durable than > any other DML (e.g. we don't complain about INSERT not being durable if > you don't commit the change). > > If you can show that the sequence goes back after a commit, that'd be an > actual durability issue. I think at this thread[1], which claimed to get this issue even after commit, I haven't tried it myself though. [1] https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 11/23/21 5:22 AM, Dilip Kumar wrote: > On Tue, Nov 23, 2021 at 9:30 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> On 11/22/21 9:42 PM, Jeremy Schneider wrote: >>> On 11/22/21 12:31, Tom Lane wrote: >>>> "Bossart, Nathan" <bossartn@amazon.com> writes: >>>>> I periodically hear rumblings about this behavior as well. At the >>>>> very least, it certainly ought to be documented if it isn't yet. I >>>>> wouldn't mind trying my hand at that. Perhaps we could also add a new >>>>> configuration parameter if users really want to take the performance >>>>> hit. >>>> >>>> A sequence's cache length is already configurable, no? >>>> >>> >>> Cache length isn't related to the problem here. >>> >>> The problem is that PostgreSQL sequences are entirely unsafe to use from >>> a durability perspective, unless there's DML in the same transaction. >>> >>> Users might normally think that "commit" makes things durable. >>> Unfortunately, IIUC, that's not true for sequences in PostgreSQL. >>> >> >> That's not what the example in this thread demonstrates, though. There's >> no COMMIT in that example, so it shows that we may discard the nextval() >> in uncommitted transactions. I fail to see how that's less durable than >> any other DML (e.g. we don't complain about INSERT not being durable if >> you don't commit the change). >> >> If you can show that the sequence goes back after a commit, that'd be an >> actual durability issue. > > I think at this thread[1], which claimed to get this issue even after > commit, I haven't tried it myself though. > > [1] https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915 > I did try, and I haven't been able to reproduce that behavior (on master, at least). I see Tom speculated we may not flush WAL if a transaction only does nextval() in that other thread, but I don't think that's true. AFAICS if the nextval() call writes stuff to WAL, the RecordTransactionCommit will have wrote_xlog=true and valid XID. And so it'll do the usual usual XLogFlush() etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I think at this thread[1], which claimed to get this issue even after
> commit, I haven't tried it myself though.
>
> [1] https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915
>
I did try, and I haven't been able to reproduce that behavior (on
master, at least).
I agree with this, the commit would flush the xlog and persist the change.
I see Tom speculated we may not flush WAL if a transaction only does
nextval() in that other thread, but I don't think that's true. AFAICS if
the nextval() call writes stuff to WAL, the RecordTransactionCommit will
have wrote_xlog=true and valid XID. And so it'll do the usual usual
XLogFlush() etc.
I agree with this as well. or else, how can we replicate it to standby if
user only runs the SELECT nextval('s') in a transaction.
> I fail to see how that's less durable than any other DML (e.g. we don't
> complain about INSERT not being durable if you don't commit the change).
> If you can show that the sequence goes back after a commit, that'd be an
actual durability issue.
> If you can show that the sequence goes back after a commit, that'd be an
actual durability issue.
This can't be called a tranaction's durability issue, but people usually think
the value of sequence will not rollback. so it may surprise people if that happens.
Best Regards
Andy Fan
On 11/23/21 05:49, Andy Fan wrote: > > > I think at this thread[1], which claimed to get this issue even after > > commit, I haven't tried it myself though. > > > > [1] > https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915 > <https://www.postgresql.org/message-id/flat/ea6485e3-98d0-24a7-094c-87f9d5f9b18f%40amazon.com#4cfe7217c829419b769339465e8c2915> > > > > I did try, and I haven't been able to reproduce that behavior (on > master, at least). > > > I agree with this, the commit would flush the xlog and persist the change. On that older thread, there were exact reproductions in the first email from Vini - two of them - available here: https://gist.github.com/vinnix/2fe148e3c42e11269bac5fcc5c78a8d1 Nathan help me realize a mistake I've made here. The second reproduction involved having psql run nextval() inside of an explicit transaction. I had assumed that the transaction would be committed when psql closed the session without error. This is because in Oracle SQLPlus (my original RDBMS background), the "exitcommit" setting has a default value giving this behavior. This was a silly mistake on my part. When PostgreSQL psql closes the connection with an open transaction, it turns out that the PostgreSQL server will abort the transaction rather than committing it. (Oracle DBAs be-aware!) Nonetheless, Vini's first reproduction did not make this same mistake. It involved 10 psql sessions in parallel using implicit transactions, suspending I/O (via the linux device mapper), and killing PG while the I/O is suspended. Given my mistake on the second repro, I want to look a little closer at this first reproduction and revisit whether it's actually demonstrating a corner case where one could claim that durability isn't being handled correctly - that "COMMIT" is returning successfully to the application, and yet the sequence numbers are being repeated. Maybe there's something involving the linux I/O path coming into play here. -Jeremy -- http://about.me/jeremy_schneider
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > I see Tom speculated we may not flush WAL if a transaction only does > nextval() in that other thread, but I don't think that's true. AFAICS if > the nextval() call writes stuff to WAL, the RecordTransactionCommit will > have wrote_xlog=true and valid XID. And so it'll do the usual usual > XLogFlush() etc. Yeah. I didn't look at the code during last year's discussion, but now I have, and I see that if nextval_internal() decides it needs to make a WAL entry, it is careful to ensure the xact has an XID: /* * If something needs to be WAL logged, acquire an xid, so this * transaction's commit will trigger a WAL flush and wait for syncrep. * It's sufficient to ensure the toplevel transaction has an xid, no need * to assign xids subxacts, that'll already trigger an appropriate wait. * (Have to do that here, so we're outside the critical section) */ if (logit && RelationNeedsWAL(seqrel)) GetTopTransactionId(); So that eliminates my worry that RecordTransactionCommit would decide it doesn't need to do anything. If there was a WAL record, we will flush it at commit (and not before). As you say, this is exactly as durable as any other DML operation. So I don't feel a need to change the code behavior. The problematic situation seems to be where an application gets a nextval() result and uses it for some persistent outside-the-DB state, without having made sure that the nextval() was committed. You could say that that's the same rookie error as relying on the persistence of any other uncommitted DML ... except that at [1] we say To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. It's not so unreasonable to read that as promising persistence over crashes as well as xact aborts. So I think we need to improve the docs here. A minimal fix would be to leave the existing text alone and add a separate para to the <caution> block, along the lines of However, the above statements do not apply if the database cluster crashes before committing the transaction containing the nextval operation. In that case the sequence advance might not have made its way to persistent storage, so that it is uncertain whether the same value can be returned again after the cluster restarts. If you wish to use a nextval result for persistent outside-the-database purposes, make sure that the nextval has been committed before doing so. I wonder though if we shouldn't try to improve the existing text. The phrasing "never rolled back" seems like it's too easily misinterpreted. Maybe rewrite the <caution> block like To avoid blocking concurrent transactions that obtain numbers from the same sequence, the value obtained by nextval is not reclaimed for re-use if the calling transaction later aborts. This means that transaction aborts or database crashes can result in gaps in the sequence of assigned values. That can happen without a transaction abort, too. -- this text is unchanged: -- For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused “holes” in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain “gapless” sequences. Likewise, any sequence state changes made by setval are not undone if the transaction rolls back. -- end unchanged text -- If the database cluster crashes before committing the transaction containing a nextval operation, the sequence advance might not yet have made its way to persistent storage, so that it is uncertain whether the same value can be returned again after the cluster restarts. This is harmless for usage of the nextval value within that transaction, since its other effects will not be visible either. However, if you wish to use a nextval result for persistent outside-the-database purposes, make sure that the nextval operation has been committed before doing so. Thoughts? regards, tom lane [1] https://www.postgresql.org/docs/current/functions-sequence.html
I wrote: > I wonder though if we shouldn't try to improve the existing text. > The phrasing "never rolled back" seems like it's too easily > misinterpreted. Maybe rewrite the <caution> block like > ... A bit of polishing later, maybe like the attached. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 74d3087a72..9e5ce3163a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17645,11 +17645,11 @@ SELECT setval('myseq', 42, false); <lineannotation>Next <function>nextval</fu <caution> <para> To avoid blocking concurrent transactions that obtain numbers from - the same sequence, a <function>nextval</function> operation is never - rolled back; that is, once a value has been fetched it is considered - used and will not be returned again. This is true even if the - surrounding transaction later aborts, or if the calling query ends - up not using the value. For example an <command>INSERT</command> with + the same sequence, the value obtained by <function>nextval</function> + is not reclaimed for re-use if the calling transaction later aborts. + This means that transaction aborts or database crashes can result in + gaps in the sequence of assigned values. That can happen without a + transaction abort, too. For example an <command>INSERT</command> with an <literal>ON CONFLICT</literal> clause will compute the to-be-inserted tuple, including doing any required <function>nextval</function> calls, before detecting any conflict that would cause it to follow @@ -17661,8 +17661,22 @@ SELECT setval('myseq', 42, false); <lineannotation>Next <function>nextval</fu </para> <para> - Likewise, any sequence state changes made by <function>setval</function> - are not undone if the transaction rolls back. + Likewise, sequence state changes made by <function>setval</function> + are immediately visible to other transactions, and are not undone if + the calling transaction rolls back. + </para> + + <para> + If the database cluster crashes before committing a transaction + containing a <function>nextval</function> + or <function>setval</function> call, the sequence state change might + not have made its way to persistent storage, so that it is uncertain + whether the sequence will have its original or updated state after the + cluster restarts. This is harmless for usage of the sequence within + the database, since other effects of uncommitted transactions will not + be visible either. However, if you wish to use a sequence value for + persistent outside-the-database purposes, make sure that the + <function>nextval</function> call has been committed before doing so. </para> </caution>
On 11/23/21, 1:41 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I wonder though if we shouldn't try to improve the existing text. >> The phrasing "never rolled back" seems like it's too easily >> misinterpreted. Maybe rewrite the <caution> block like >> ... > > A bit of polishing later, maybe like the attached. The doc updates look good to me. Yesterday I suggested possibly adding a way to ensure that nextval() called in an uncommitted transaction was persistent, but I think we'd have to also ensure that synchronous replication waits for those records, too. Anyway, I don't think it is unreasonable to require the transaction to be committed to avoid duplicates from nextval(). Nathan
You could say that that's the same rookie error as relying on the
persistence of any other uncommitted DML ...
IIUC, This is not the same as uncommitted DML exactly. For any uncommitted
DML, it is a rollback for sure. But as for sequence, The xmax is not changed
during sequence's value update by design and we didn't maintain the multi versions
for sequence, so sequence can't be rolled back clearly. The fact is a dirty data page flush
can persist the change no matter the txn is committed or aborted. The below example
can show the difference:
SELECT nextval('s'); -- 1
begin;
SELECT nextval('s'); \watch 0.1 for a while, many checkpointer or data flush happened.
-- crashed.
If we run nextval('s') from the recovered system, we probably will _not_ get
the 2 (assume cachesize=1) like uncommitted DML.
Best Regards
Andy Fan
On Tue, 2021-11-23 at 16:41 -0500, Tom Lane wrote: > I wrote: > > I wonder though if we shouldn't try to improve the existing text. > > The phrasing "never rolled back" seems like it's too easily > > misinterpreted. Maybe rewrite the <caution> block like > > ... > > A bit of polishing later, maybe like the attached. That looks good to me. Yours, Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Tue, 2021-11-23 at 16:41 -0500, Tom Lane wrote: >> A bit of polishing later, maybe like the attached. > That looks good to me. Pushed, thanks. regards, tom lane