Обсуждение: BUG #15967: Sequence generation using NEXTVAL() fails on 64bit systems
BUG #15967: Sequence generation using NEXTVAL() fails on 64bit systems
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 15967 Logged by: Christoph Ziegenberg Email address: ziegenberg@web.de PostgreSQL version: 11.5 Operating system: Windows / Linux Description: Problem: Parallel API requests to an application which result in inserts into the database cause an duplicate key error. The column is the primary key, which value is generated by a sequence.The number of requests required to cause this varies in dependence of the database server/load, in my tests it took between 5 and 20 parallel requests. Details: For the inserts the next sequence value is selected using NEXTVAL(), done in a separate query (not directly in the INSERT). I did many tests: - with/without transactions (which shouldn't affect the sequence), - with/without a DEFAULT of NEXTVAL() for the primary key column - several Postgres versions... and could reduce the error to 64bit Postgres 10 / 11 installations on Windows and Ubuntu (didn't test older versions). Test results with different Postgres versions: - Postgres 10.10, 64bit (Ubuntu): failed - Postgres 10.10, 64bit (Windows): failed - Postgres 11.5, 64bit (Windows): failed - Postgres 10.10 32bit (Windows): works
I did many tests:
Can you include the code which implements those tests?
I can't reproduce with something like this:
perl -e 'use DBI; fork; fork; fork; my $dbh=DBI->connect("dbi:Pg:host=/tmp"); for (1..1e6) {my $seq = $dbh->selectrow_array("select nextval('\''foo_seq'\'')"); $dbh->do("insert into foo values (?)",undef,$seq);}'
Cheers,
Jeff
On Tue, Aug 20, 2019 at 9:14 AM Jeff Janes <jeff.janes@gmail.com> wrote: >> >> >> I did many tests: > > > Can you include the code which implements those tests? > > I can't reproduce with something like this: +1 this. This would be a very embarrassing and serious problem should it be proven to be valid. We would however need to rule out issues in the test code itself. The best possible approach in terms of verification would be a .sql only test with reproduction via pgbench -f (IMO). merlin
Re: BUG #15967: Sequence generation using NEXTVAL() fails on 64bit systems
От
Christoph Ziegenberg
Дата:
New info:
We could also reproduce it with Postgres 10.10 32bit on Windows Server 2016 and 9.6.8 64bit on Ubuntu (18.x). So also the 32bit version is affected, as well as 9.x versions.
Current test state:
We (2 of my colleagues and me) are currently trying to create an independent test, but at the moment it only occurrs within the specific application.
Of course we search for an error in the application first, checked the connection settings, tracked all statements... there is nothing special, only a handful of SELECTS and UPDATES, then the sequence generation and the (sometimes failing) INSERT, and another INSERT (no sequence generate here) and UPDATE following. In all cases we tested with the Postgres default configuration.
We could see that wrapping the selection of NEXTVAL() in a transaction (which shouldn't have any effect?) reduced the problem in one case, but it was still possible to reproduce the error with an increased number of requests.
At the moment we don't have a real idea how to proceed. Next step planned is to simulate all of the aforementioned requests done by the application, because they seem to influence the behavior.
Background:
The application is written in PHP (different versions used in the tests, 7.2 - 7.4 I guess), runs on Apache/IIS (FastCGI) and uses the PDO extension for the communication with Postgres.
Christoph
We could also reproduce it with Postgres 10.10 32bit on Windows Server 2016 and 9.6.8 64bit on Ubuntu (18.x). So also the 32bit version is affected, as well as 9.x versions.
Current test state:
We (2 of my colleagues and me) are currently trying to create an independent test, but at the moment it only occurrs within the specific application.
Of course we search for an error in the application first, checked the connection settings, tracked all statements... there is nothing special, only a handful of SELECTS and UPDATES, then the sequence generation and the (sometimes failing) INSERT, and another INSERT (no sequence generate here) and UPDATE following. In all cases we tested with the Postgres default configuration.
We could see that wrapping the selection of NEXTVAL() in a transaction (which shouldn't have any effect?) reduced the problem in one case, but it was still possible to reproduce the error with an increased number of requests.
At the moment we don't have a real idea how to proceed. Next step planned is to simulate all of the aforementioned requests done by the application, because they seem to influence the behavior.
Background:
The application is written in PHP (different versions used in the tests, 7.2 - 7.4 I guess), runs on Apache/IIS (FastCGI) and uses the PDO extension for the communication with Postgres.
Christoph
Am 20. August 2019 16:21:08 MESZ schrieb Merlin Moncure <mmoncure@gmail.com>:
On Tue, Aug 20, 2019 at 9:14 AM Jeff Janes <jeff.janes@gmail.com> wrote:
I did many tests:
Can you include the code which implements those tests?
I can't reproduce with something like this:
+1 this. This would be a very embarrassing and serious problem should
it be proven to be valid. We would however need to rule out issues
in the test code itself. The best possible approach in terms of
verification would be a .sql only test with reproduction via pgbench
-f (IMO).
merlin
Re: BUG #15967: Sequence generation using NEXTVAL() fails on 64bit systems
От
Christoph Ziegenberg
Дата:
We also had problems to reproduce it outside of the application, but are still working on an independent test script.
One special case is, that the application uses prepared statements for all queries, also for the NEXTVAL() selection.
Christoph
One special case is, that the application uses prepared statements for all queries, also for the NEXTVAL() selection.
Christoph
Am 20. August 2019 15:44:55 MESZ schrieb Jeff Janes <jeff.janes@gmail.com>:
I did many tests:Can you include the code which implements those tests?I can't reproduce with something like this:perl -e 'use DBI; fork; fork; fork; my $dbh=DBI->connect("dbi:Pg:host=/tmp"); for (1..1e6) {my $seq = $dbh->selectrow_array("select nextval('\''foo_seq'\'')"); $dbh->do("insert into foo values (?)",undef,$seq);}'Cheers,Jeff
On Tue, Aug 20, 2019 at 9:56 AM Christoph Ziegenberg <ziegenberg@web.de> wrote: > New info: > We could also reproduce it with Postgres 10.10 32bit on Windows Server 2016 and 9.6.8 64bit on Ubuntu (18.x). So also the32bit version is affected, as well as 9.x versions. Ok, that makes sense; if we don't have arch specific reproduction I'd say that lowers the likelihood of having a serious problem within postgres itself (just a hunch). > Current test state: > We (2 of my colleagues and me) are currently trying to create an independent test, but at the moment it only occurrs withinthe specific application. This of course is key. Most people reading this list are going to be suspicious of your application rather than postgres internals causing the issue. Having said that, let's see if we can isolate the problem. > Of course we search for an error in the application first, checked the connection settings, tracked all statements... thereis nothing special, only a handful of SELECTS and UPDATES, then the sequence generation and the (sometimes failing)INSERT, and another INSERT (no sequence generate here) and UPDATE following. In all cases we tested with the Postgresdefault configuration. > We could see that wrapping the selection of NEXTVAL() in a transaction (which shouldn't have any effect?) reduced the problemin one case, but it was still possible to reproduce the error with an increased number of requests. Being in a transaction makes shouldn't make any difference with regards to behavior. > At the moment we don't have a real idea how to proceed. Next step planned is to simulate all of the aforementioned requestsdone by the application, because they seem to influence the behavior. Isolating in a separate test would be proof. Sans that, we need to be suspicious of your test environment. Hm, how about this: perhaps we can tease the problem out with logging. One possible trick is to do this: CREATE OR REPLACE FUNCTION log_nextval(_Sequence TEXT, s OUT) RETURNS BIGINT AS $$ BEGIN s := nextval(_Sequence); RAISE WARNING 'Got value % from %', s, _Sequence; RETURN s; END; $$ LANGUAGE PLGSQL; Then, as an experiment, you can move all your nextval() generation to this function (note: there will be significant performance hit) and attempt to reproduce the problem in your application. If when you did, we could then interrogate the database log to see if the same value >1 times, this would be a smoking gun. If you did not see the value returned >1 times, maybe turn on full statement logging (also a big performance hit) and see where your application might be jacking things up. > Background: > The application is written in PHP (different versions used in the tests, 7.2 - 7.4 I guess), runs on Apache/IIS (FastCGI)and uses the PDO extension for the communication with Postgres. merlin
Re: BUG #15967: Sequence generation using NEXTVAL() fails on 64bit systems
От
Christoph Ziegenberg
Дата:
Good idea with the replacement of the nextval function. This confirms, that the problem is not an application or PHP problem.
Here an extract from the logs, showing the error:
2019-08-21 06:57:38.812 CEST [5824] WARNUNG: Got value 2799 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.812 CEST [5824] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.872 CEST [16236] WARNUNG: Got value 2800 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.872 CEST [16236] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.873 CEST [1436] WARNUNG: Got value 2801 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.873 CEST [1436] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.910 CEST [12480] WARNUNG: Got value 2801 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.910 CEST [12480] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.915 CEST [12480] FEHLER: doppelter Schlüsselwert verletzt Unique-Constraint »p_e2_ws_log_login«
2019-08-21 06:57:38.915 CEST [12480] DETAIL: Schlüssel »(log_login_id)=(2801)« existiert bereits.
2019-08-21 06:57:38.915 CEST [12480] ANWEISUNG: INSERT INTO e2_ws_log_login (log_login_id, client_id, log_action, log_datetime, log_ip_address, log_data) VALUES ($1, $2, $3, $4, $5, $6)
2019-08-21 06:57:39.031 CEST [10672] WARNUNG: Got value 2802 from public.e2_ws_log_login_seq
2019-08-21 06:57:39.031 CEST [10672] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
BTW: I had to change the function a little bit:
CREATE OR REPLACE FUNCTION log_nextval(_Sequence TEXT) RETURNS BIGINT AS
$$
DECLARE
s BIGINT;
BEGIN
s := nextval(_Sequence);
RAISE WARNING 'Got value % from %', s, _Sequence;
RETURN s;
END;
$$ LANGUAGE 'plpgsql';
Christoph
Here an extract from the logs, showing the error:
2019-08-21 06:57:38.812 CEST [5824] WARNUNG: Got value 2799 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.812 CEST [5824] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.872 CEST [16236] WARNUNG: Got value 2800 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.872 CEST [16236] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.873 CEST [1436] WARNUNG: Got value 2801 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.873 CEST [1436] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.910 CEST [12480] WARNUNG: Got value 2801 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.910 CEST [12480] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.915 CEST [12480] FEHLER: doppelter Schlüsselwert verletzt Unique-Constraint »p_e2_ws_log_login«
2019-08-21 06:57:38.915 CEST [12480] DETAIL: Schlüssel »(log_login_id)=(2801)« existiert bereits.
2019-08-21 06:57:38.915 CEST [12480] ANWEISUNG: INSERT INTO e2_ws_log_login (log_login_id, client_id, log_action, log_datetime, log_ip_address, log_data) VALUES ($1, $2, $3, $4, $5, $6)
2019-08-21 06:57:39.031 CEST [10672] WARNUNG: Got value 2802 from public.e2_ws_log_login_seq
2019-08-21 06:57:39.031 CEST [10672] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
BTW: I had to change the function a little bit:
CREATE OR REPLACE FUNCTION log_nextval(_Sequence TEXT) RETURNS BIGINT AS
$$
DECLARE
s BIGINT;
BEGIN
s := nextval(_Sequence);
RAISE WARNING 'Got value % from %', s, _Sequence;
RETURN s;
END;
$$ LANGUAGE 'plpgsql';
Christoph
Am 20. August 2019 18:06:42 MESZ schrieb Merlin Moncure <mmoncure@gmail.com>:
On Tue, Aug 20, 2019 at 9:56 AM Christoph Ziegenberg <ziegenberg@web.de> wrote:New info:
We could also reproduce it with Postgres 10.10 32bit on Windows Server 2016 and 9.6.8 64bit on Ubuntu (18.x). So also the 32bit version is affected, as well as 9.x versions.
Ok, that makes sense; if we don't have arch specific reproduction I'd
say that lowers the likelihood of having a serious problem within
postgres itself (just a hunch).Current test state:
We (2 of my colleagues and me) are currently trying to create an independent test, but at the moment it only occurrs within the specific application.
This of course is key. Most people reading this list are going to be
suspicious of your application rather than postgres internals causing
the issue. Having said that, let's see if we can isolate the problem.Of course we search for an error in the application first, checked the connection settings, tracked all statements... there is nothing special, only a handful of SELECTS and UPDATES, then the sequence generation and the (sometimes failing) INSERT, and another INSERT (no sequence generate here) and UPDATE following. In all cases we tested with the Postgres default configuration.We could see that wrapping the selection of NEXTVAL() in a transaction (which shouldn't have any effect?) reduced the problem in one case, but it was still possible to reproduce the error with an increased number of requests.
Being in a transaction makes shouldn't make any difference with
regards to behavior.At the moment we don't have a real idea how to proceed. Next step planned is to simulate all of the aforementioned requests done by the application, because they seem to influence the behavior.
Isolating in a separate test would be proof. Sans that, we need to be
suspicious of your test environment. Hm, how about this: perhaps we
can tease the problem out with logging. One possible trick is to do
this:
CREATE OR REPLACE FUNCTION log_nextval(_Sequence TEXT, s OUT) RETURNS BIGINT AS
$$
BEGIN
s := nextval(_Sequence);
RAISE WARNING 'Got value % from %', s, _Sequence;
RETURN s;
END;
$$ LANGUAGE PLGSQL;
Then, as an experiment, you can move all your nextval() generation to
this function (note: there will be significant performance hit) and
attempt to reproduce the problem in your application. If when you
did, we could then interrogate the database log to see if the same
value >1 times, this would be a smoking gun. If you did not see the
value returned >1 times, maybe turn on full statement logging (also a
big performance hit) and see where your application might be jacking
things up.Background:
The application is written in PHP (different versions used in the tests, 7.2 - 7.4 I guess), runs on Apache/IIS (FastCGI) and uses the PDO extension for the communication with Postgres.
merlin
Re: BUG #15967: Sequence generation using NEXTVAL() fails on 64bit systems
От
Christoph Ziegenberg
Дата:
Good news: We finally found the error.
It was a trigger on one table that updated the sequence using setval()... Of course this cannot work and I don't know why this was added (perhaps for a migration script?).
This use of setval() via the trigger was missing in the Postgres statement log (perhaps because it was a PERFORM statement?), so we didn't find it earlier.
Sorry and thanks for your help to all.
Christoph
It was a trigger on one table that updated the sequence using setval()... Of course this cannot work and I don't know why this was added (perhaps for a migration script?).
This use of setval() via the trigger was missing in the Postgres statement log (perhaps because it was a PERFORM statement?), so we didn't find it earlier.
Sorry and thanks for your help to all.
Christoph
Am 21. August 2019 07:11:16 MESZ schrieb Christoph Ziegenberg <ziegenberg@web.de>:
Good idea with the replacement of the nextval function. This confirms, that the problem is not an application or PHP problem.
Here an extract from the logs, showing the error:
2019-08-21 06:57:38.812 CEST [5824] WARNUNG: Got value 2799 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.812 CEST [5824] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.872 CEST [16236] WARNUNG: Got value 2800 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.872 CEST [16236] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.873 CEST [1436] WARNUNG: Got value 2801 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.873 CEST [1436] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.910 CEST [12480] WARNUNG: Got value 2801 from public.e2_ws_log_login_seq
2019-08-21 06:57:38.910 CEST [12480] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
2019-08-21 06:57:38.915 CEST [12480] FEHLER: doppelter Schlüsselwert verletzt Unique-Constraint »p_e2_ws_log_login«
2019-08-21 06:57:38.915 CEST [12480] DETAIL: Schlüssel »(log_login_id)=(2801)« existiert bereits.
2019-08-21 06:57:38.915 CEST [12480] ANWEISUNG: INSERT INTO e2_ws_log_login (log_login_id, client_id, log_action, log_datetime, log_ip_address, log_data) VALUES ($1, $2, $3, $4, $5, $6)
2019-08-21 06:57:39.031 CEST [10672] WARNUNG: Got value 2802 from public.e2_ws_log_login_seq
2019-08-21 06:57:39.031 CEST [10672] ZUSAMMENHANG: PL/pgSQL-Funktion log_nextval(text) Zeile 6 bei RAISE
BTW: I had to change the function a little bit:
CREATE OR REPLACE FUNCTION log_nextval(_Sequence TEXT) RETURNS BIGINT AS
$$
DECLARE
s BIGINT;
BEGIN
s := nextval(_Sequence);
RAISE WARNING 'Got value % from %', s, _Sequence;
RETURN s;
END;
$$ LANGUAGE 'plpgsql';
ChristophAm 20. August 2019 18:06:42 MESZ schrieb Merlin Moncure <mmoncure@gmail.com>:On Tue, Aug 20, 2019 at 9:56 AM Christoph Ziegenberg <ziegenberg@web.de> wrote:New info:
We could also reproduce it with Postgres 10.10 32bit on Windows Server 2016 and 9.6.8 64bit on Ubuntu (18.x). So also the 32bit version is affected, as well as 9.x versions.
Ok, that makes sense; if we don't have arch specific reproduction I'd
say that lowers the likelihood of having a serious problem within
postgres itself (just a hunch).Current test state:
We (2 of my colleagues and me) are currently trying to create an independent test, but at the moment it only occurrs within the specific application.
This of course is key. Most people reading this list are going to be
suspicious of your application rather than postgres internals causing
the issue. Having said that, let's see if we can isolate the problem.Of course we search for an error in the application first, checked the connection settings, tracked all statements... there is nothing special, only a handful of SELECTS and UPDATES, then the sequence generation and the (sometimes failing) INSERT, and another INSERT (no sequence generate here) and UPDATE following. In all cases we tested with the Postgres default configuration.We could see that wrapping the selection of NEXTVAL() in a transaction (which shouldn't have any effect?) reduced the problem in one case, but it was still possible to reproduce the error with an increased number of requests.
Being in a transaction makes shouldn't make any difference with
regards to behavior.At the moment we don't have a real idea how to proceed. Next step planned is to simulate all of the aforementioned requests done by the application, because they seem to influence the behavior.
Isolating in a separate test would be proof. Sans that, we need to be
suspicious of your test environment. Hm, how about this: perhaps we
can tease the problem out with logging. One possible trick is to do
this:
CREATE OR REPLACE FUNCTION log_nextval(_Sequence TEXT, s OUT) RETURNS BIGINT AS
$$
BEGIN
s := nextval(_Sequence);
RAISE WARNING 'Got value % from %', s, _Sequence;
RETURN s;
END;
$$ LANGUAGE PLGSQL;
Then, as an experiment, you can move all your nextval() generation to
this function (note: there will be significant performance hit) and
attempt to reproduce the problem in your application. If when you
did, we could then interrogate the database log to see if the same
value >1 times, this would be a smoking gun. If you did not see the
value returned >1 times, maybe turn on full statement logging (also a
big performance hit) and see where your application might be jacking
things up.Background:
The application is written in PHP (different versions used in the tests, 7.2 - 7.4 I guess), runs on Apache/IIS (FastCGI) and uses the PDO extension for the communication with Postgres.
merlin
On Wed, Aug 21, 2019 at 8:27 AM Christoph Ziegenberg <ziegenberg@web.de> wrote: > > Good news: We finally found the error. > > It was a trigger on one table that updated the sequence using setval()... Of course this cannot work and I don't know whythis was added (perhaps for a migration script?). > > This use of setval() via the trigger was missing in the Postgres statement log (perhaps because it was a PERFORM statement?),so we didn't find it earlier. > > Sorry and thanks for your help to all. No worries, glad you found the issue. I mean, nextval() is probably the most commonly trafficked function in the database and is dead reliable for millions of users, so there had to be another explanation, right? :-). setval() is an extremely dangerous function outside of cases related to restoring data; it modifies state outside of transactional guarantees. if you saw it in code not guarded by some kind of lock you ought to be suspicious. merlin