Обсуждение: Re: Curious buildfarm failures (fwd)
Hi, > Date: Tue, 15 Jan 2013 11:57:07 -0500 > From: Tom Lane <tgl@sss.pgh.pa.us> > To: Andres Freund <andres@2ndquadrant.com> > Cc: math@sai.msu.ru, pgsql-hackers@postgreSQL.org, > Andrew Dunstan <andrew@dunslane.net> > Subject: Re: Curious buildfarm failures > > Well, it could be quite reproducible, if for example what's happening is > that the DROP is failing to wait for the checkpointer at all. > > Is there a way to enable log_checkpoints during a buildfarm run? > It'd be good to get timestamps added to the postmaster log entries, too. Here is the log output from the failing pg_regress after enabling checkpoints and timestamps: 2013-01-15 21:20:19 MSK [50f58fd3.589e:1] LOG: database system was shut down at 2013-01-15 21:20:19 MS K 2013-01-15 21:20:19 MSK [50f58fd3.58a2:1] LOG: autovacuum launcher started 2013-01-15 21:20:19 MSK [50f58fd3.5899:1] LOG: database system is ready to accept connections 2013-01-15 21:20:20 MSK [50f58fd3.589f:1] LOG: checkpoint starting: immediate force wait 2013-01-15 21:20:21 MSK [50f58fd3.589f:2] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transact ion log file(s) added, 0 removed, 0 recycled; write=0.604 s, sync=0.000 s, total=0.605 s; sync files=0, longest=0.000 s,average=0.000 s 2013-01-15 21:20:21 MSK [50f58fd3.589f:3] LOG: checkpoint starting: immediate force wait 2013-01-15 21:20:21 MSK [50f58fd3.589f:4] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transact ion log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s,average=0.000 s 2013-01-15 21:20:21 MSK [50f58fd5.58ac:1] ERROR: unrecognized parameter "some_nonexistent_parameter" 2013-01-15 21:20:21 MSK [50f58fd5.58ac:2] STATEMENT: ALTER TABLESPACE testspace SET (some_nonexistent_ parameter = true); 2013-01-15 21:20:21 MSK [50f58fd5.58ac:3] ERROR: RESET must not include values for parameters 2013-01-15 21:20:21 MSK [50f58fd5.58ac:4] STATEMENT: ALTER TABLESPACE testspace RESET (random_page_cos t = 2.0); 2013-01-15 21:20:21 MSK [50f58fd5.58ac:5] ERROR: duplicate key value violates unique constraint "anind ex" 2013-01-15 21:20:21 MSK [50f58fd5.58ac:6] DETAIL: Key (column1)=(1) already exists. 2013-01-15 21:20:21 MSK [50f58fd5.58ac:7] STATEMENT: INSERT INTO testschema.atable VALUES(1); 2013-01-15 21:20:21 MSK [50f58fd5.58ac:8] ERROR: directory "/no/such/location" does not exist 2013-01-15 21:20:21 MSK [50f58fd5.58ac:9] STATEMENT: CREATE TABLESPACE badspace LOCATION '/no/such/loc ation'; 2013-01-15 21:20:21 MSK [50f58fd5.58ac:10] ERROR: tablespace "nosuchspace" does not exist 2013-01-15 21:20:21 MSK [50f58fd5.58ac:11] STATEMENT: CREATE TABLE bar (i int) TABLESPACE nosuchspace; 2013-01-15 21:20:21 MSK [50f58fd3.589f:5] LOG: checkpoint starting: immediate force wait 2013-01-15 21:20:21 MSK [50f58fd3.589f:6] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 transac tion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0 , longest=0.000 s, average=0.000 s 2013-01-15 21:20:21 MSK [50f58fd5.58ac:12] ERROR: tablespace "testspace" is not empty 2013-01-15 21:20:21 MSK [50f58fd5.58ac:13] STATEMENT: DROP TABLESPACE testspace; 2013-01-15 21:20:21 MSK [50f58fd3.589f:7] LOG: checkpoint starting: immediate force wait 2013-01-15 21:20:21 MSK [50f58fd3.589f:8] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 transact ion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s,average=0.000 s 2013-01-15 21:20:21 MSK [50f58fd5.58ac:14] ERROR: tablespace "testspace" is not empty 2013-01-15 21:20:21 MSK [50f58fd5.58ac:15] STATEMENT: DROP TABLESPACE testspace; And I do see the tblspc file left after the finish of "make check": tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387 Cheers, S PS I wouldn't be surprised that it is a compiler bug though. But I did see the failure with newer icc as well. ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
On 2013-01-15 17:27:50 +0000, Sergey Koposov wrote: > Hi, > > >Date: Tue, 15 Jan 2013 11:57:07 -0500 > >From: Tom Lane <tgl@sss.pgh.pa.us> > >To: Andres Freund <andres@2ndquadrant.com> > >Cc: math@sai.msu.ru, pgsql-hackers@postgreSQL.org, > > Andrew Dunstan <andrew@dunslane.net> > >Subject: Re: Curious buildfarm failures > > > >Well, it could be quite reproducible, if for example what's happening is > >that the DROP is failing to wait for the checkpointer at all. > > > >Is there a way to enable log_checkpoints during a buildfarm run? > >It'd be good to get timestamps added to the postmaster log entries, too. > > Here is the log output from the failing pg_regress after enabling checkpoints and timestamps: > > 2013-01-15 21:20:19 MSK [50f58fd3.589e:1] LOG: database system was shut down at 2013-01-15 21:20:19 MS > K > 2013-01-15 21:20:19 MSK [50f58fd3.58a2:1] LOG: autovacuum launcher started > 2013-01-15 21:20:19 MSK [50f58fd3.5899:1] LOG: database system is ready to accept connections > 2013-01-15 21:20:20 MSK [50f58fd3.589f:1] LOG: checkpoint starting: immediate force wait > 2013-01-15 21:20:21 MSK [50f58fd3.589f:2] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transact > ion log file(s) added, 0 removed, 0 recycled; write=0.604 s, sync=0.000 s, total=0.605 s; sync files=0, > longest=0.000 s, average=0.000 s > 2013-01-15 21:20:21 MSK [50f58fd3.589f:3] LOG: checkpoint starting: immediate force wait > 2013-01-15 21:20:21 MSK [50f58fd3.589f:4] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transact > ion log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, > longest=0.000 s, average=0.000 s > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:1] ERROR: unrecognized parameter "some_nonexistent_parameter" > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:2] STATEMENT: ALTER TABLESPACE testspace SET (some_nonexistent_ > parameter = true); > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:3] ERROR: RESET must not include values for parameters > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:4] STATEMENT: ALTER TABLESPACE testspace RESET (random_page_cos > t = 2.0); > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:5] ERROR: duplicate key value violates unique constraint "anind > ex" > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:6] DETAIL: Key (column1)=(1) already exists. > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:7] STATEMENT: INSERT INTO testschema.atable VALUES(1); > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:8] ERROR: directory "/no/such/location" does not exist > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:9] STATEMENT: CREATE TABLESPACE badspace LOCATION '/no/such/loc > ation'; > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:10] ERROR: tablespace "nosuchspace" does not exist > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:11] STATEMENT: CREATE TABLE bar (i int) TABLESPACE nosuchspace; > 2013-01-15 21:20:21 MSK [50f58fd3.589f:5] LOG: checkpoint starting: immediate force wait > 2013-01-15 21:20:21 MSK [50f58fd3.589f:6] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 transac > tion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0 > , longest=0.000 s, average=0.000 s > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:12] ERROR: tablespace "testspace" is not empty > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:13] STATEMENT: DROP TABLESPACE testspace; > 2013-01-15 21:20:21 MSK [50f58fd3.589f:7] LOG: checkpoint starting: immediate force wait > 2013-01-15 21:20:21 MSK [50f58fd3.589f:8] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 transact > ion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, > longest=0.000 s, average=0.000 s > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:14] ERROR: tablespace "testspace" is not empty > 2013-01-15 21:20:21 MSK [50f58fd5.58ac:15] STATEMENT: DROP TABLESPACE testspace; > > > And I do see the tblspc file left after the finish of "make check": > tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387 > > Cheers, > S > > PS I wouldn't be surprised that it is a compiler bug though. But I did see > the failure with newer icc as well. Any chance you could run make check again but with log_statement=all and log_min_messages=debug2? That might tell us a bit more, whether the dependency code doesn't work right or whether the checkpoint is doing strange things. Thannks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, 15 Jan 2013, Andres Freund wrote: > Any chance you could run make check again but with log_statement=all and > log_min_messages=debug2? That might tell us a bit more, whether the > dependency code doesn't work right or whether the checkpoint is doing > strange things. Here it is : 2013-01-15 23:06:18 MSK [50f5a8aa.1162:1] DEBUG: SlruScanDirectory invoking callback on pg_notify/0000 2013-01-15 23:06:18 MSK [50f5a8aa.1162:2] DEBUG: removing file "pg_notify/0000" 2013-01-15 23:06:18 MSK [50f5a8aa.1162:3] DEBUG: max_safe_fds = 985, usable_fds = 1000, already_open = 5 2013-01-15 23:06:18 MSK [50f5a8aa.1167:1] LOG: database system was shut down at 2013-01-15 23:06:18 MSK 2013-01-15 23:06:18 MSK [50f5a8aa.1167:2] DEBUG: checkpoint record is at 0/17700E0 2013-01-15 23:06:18 MSK [50f5a8aa.1167:3] DEBUG: redo record is at 0/17700E0; shutdown TRUE 2013-01-15 23:06:18 MSK [50f5a8aa.1167:4] DEBUG: next transaction ID: 0/686; next OID: 12031 2013-01-15 23:06:18 MSK [50f5a8aa.1167:5] DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 2013-01-15 23:06:18 MSK [50f5a8aa.1167:6] DEBUG: oldest unfrozen transaction ID: 676, in database 1 2013-01-15 23:06:18 MSK [50f5a8aa.1167:7] DEBUG: transaction ID wrap limit is 2147484323, limited by database with OID 1 2013-01-15 23:06:18 MSK [50f5a8aa.1168:1] DEBUG: checkpointer updated shared memory configuration values 2013-01-15 23:06:18 MSK [50f5a8aa.116b:1] LOG: autovacuum launcher started 2013-01-15 23:06:18 MSK [50f5a8aa.1162:4] LOG: database system is ready to accept connections 2013-01-15 23:06:19 MSK [50f5a8aa.1162:5] DEBUG: forked new backend, pid=4463 socket=8 2013-01-15 23:06:19 MSK [50f5a8aa.1162:6] DEBUG: server process (PID 4463) exited with exit code 0 2013-01-15 23:06:19 MSK [50f5a8aa.1162:7] DEBUG: forked new backend, pid=4465 socket=8 2013-01-15 23:06:19 MSK [50f5a8ab.1171:1] LOG: statement: CREATE DATABASE "regression" TEMPLATE=template0 2013-01-15 23:06:19 MSK [50f5a8aa.1168:2] LOG: checkpoint starting: immediate force wait 2013-01-15 23:06:19 MSK [50f5a8aa.1168:3] DEBUG: SlruScanDirectory invoking callback on pg_multixact/offsets/0000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:4] DEBUG: SlruScanDirectory invoking callback on pg_multixact/members/0000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:5] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:6] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:7] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added,0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s 2013-01-15 23:06:19 MSK [50f5a8aa.1168:8] LOG: checkpoint starting: immediate force wait 2013-01-15 23:06:19 MSK [50f5a8aa.1168:9] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:10] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:11] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added,0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s 2013-01-15 23:06:19 MSK [50f5a8aa.1162:8] DEBUG: server process (PID 4465) exited with exit code 0 2013-01-15 23:06:19 MSK [50f5a8aa.1162:9] DEBUG: forked new backend, pid=4467 socket=8 2013-01-15 23:06:19 MSK [50f5a8ab.1173:1] LOG: statement: ALTER DATABASE "regression" SET lc_messages TO 'C';ALTER DATABASE"regression" SET lc_monetary TO 'C';ALTER DATABASE "regression" SET lc_numeric TO 'C';ALTER DATABASE "regression"SET lc_time TO 'C';ALTER DATABASE "regression" SET timezone_abbreviations TO 'Default'; 2013-01-15 23:06:19 MSK [50f5a8aa.1162:10] DEBUG: server process (PID 4467) exited with exit code 0 2013-01-15 23:06:19 MSK [50f5a8aa.1162:11] DEBUG: forked new backend, pid=4469 socket=8 2013-01-15 23:06:19 MSK [50f5a8ab.1175:1] LOG: statement: CREATE TABLESPACE testspace LOCATION '/home/math/pg_git/src/test/regress/testtablespace'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:2] LOG: statement: ALTER TABLESPACE testspace SET (random_page_cost = 1.0); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:3] LOG: statement: ALTER TABLESPACE testspace SET (some_nonexistent_parameter = true); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:4] ERROR: unrecognized parameter "some_nonexistent_parameter" 2013-01-15 23:06:19 MSK [50f5a8ab.1175:5] STATEMENT: ALTER TABLESPACE testspace SET (some_nonexistent_parameter = true); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:6] LOG: statement: ALTER TABLESPACE testspace RESET (random_page_cost = 2.0); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:7] ERROR: RESET must not include values for parameters 2013-01-15 23:06:19 MSK [50f5a8ab.1175:8] STATEMENT: ALTER TABLESPACE testspace RESET (random_page_cost = 2.0); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:9] LOG: statement: ALTER TABLESPACE testspace RESET (random_page_cost, seq_page_cost); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:10] LOG: statement: CREATE SCHEMA testschema; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:11] LOG: statement: CREATE TABLE testschema.foo (i int) TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:12] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_classc where c.reltablespace = t.oid AND c.relname = 'foo'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:13] LOG: statement: INSERT INTO testschema.foo VALUES(1); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:14] LOG: statement: INSERT INTO testschema.foo VALUES(2); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:15] LOG: statement: CREATE TABLE testschema.asselect TABLESPACE testspace AS SELECT1; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:16] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_classc where c.reltablespace = t.oid AND c.relname = 'asselect'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:17] LOG: statement: PREPARE selectsource(int) AS SELECT $1; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:18] LOG: statement: CREATE TABLE testschema.asexecute TABLESPACE testspace AS EXECUTE selectsource(2); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:19] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_classc where c.reltablespace = t.oid AND c.relname = 'asexecute'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:20] LOG: statement: CREATE INDEX foo_idx on testschema.foo(i) TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:21] DEBUG: building index "foo_idx" on table "foo" 2013-01-15 23:06:19 MSK [50f5a8ab.1175:22] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_classc where c.reltablespace = t.oid AND c.relname = 'foo_idx'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:23] LOG: statement: CREATE TABLE testschema.atable AS VALUES (1), (2); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:24] LOG: statement: CREATE UNIQUE INDEX anindex ON testschema.atable(column1); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:25] DEBUG: building index "anindex" on table "atable" 2013-01-15 23:06:19 MSK [50f5a8ab.1175:26] LOG: statement: ALTER TABLE testschema.atable SET TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:27] LOG: statement: ALTER INDEX testschema.anindex SET TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:28] LOG: statement: INSERT INTO testschema.atable VALUES(3); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:29] LOG: statement: INSERT INTO testschema.atable VALUES(1); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:30] ERROR: duplicate key value violates unique constraint "anindex" 2013-01-15 23:06:19 MSK [50f5a8ab.1175:31] DETAIL: Key (column1)=(1) already exists. 2013-01-15 23:06:19 MSK [50f5a8ab.1175:32] STATEMENT: INSERT INTO testschema.atable VALUES(1); 2013-01-15 23:06:19 MSK [50f5a8ab.1175:33] LOG: statement: SELECT COUNT(*) FROM testschema.atable; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:34] LOG: statement: CREATE TABLESPACE badspace LOCATION '/no/such/location'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:35] ERROR: directory "/no/such/location" does not exist 2013-01-15 23:06:19 MSK [50f5a8ab.1175:36] STATEMENT: CREATE TABLESPACE badspace LOCATION '/no/such/location'; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:37] LOG: statement: CREATE TABLE bar (i int) TABLESPACE nosuchspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:38] ERROR: tablespace "nosuchspace" does not exist 2013-01-15 23:06:19 MSK [50f5a8ab.1175:39] STATEMENT: CREATE TABLE bar (i int) TABLESPACE nosuchspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:40] LOG: statement: DROP TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8aa.1168:12] LOG: checkpoint starting: immediate force wait 2013-01-15 23:06:19 MSK [50f5a8aa.1168:13] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:14] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:15] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 transaction log file(s)added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000s 2013-01-15 23:06:19 MSK [50f5a8ab.1175:41] ERROR: tablespace "testspace" is not empty 2013-01-15 23:06:19 MSK [50f5a8ab.1175:42] STATEMENT: DROP TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:43] LOG: statement: DROP SCHEMA testschema CASCADE; 2013-01-15 23:06:19 MSK [50f5a8ab.1175:44] DEBUG: drop auto-cascades to index testschema.foo_idx 2013-01-15 23:06:19 MSK [50f5a8ab.1175:45] DEBUG: drop auto-cascades to type testschema.foo 2013-01-15 23:06:19 MSK [50f5a8ab.1175:46] DEBUG: drop auto-cascades to type testschema.foo[] 2013-01-15 23:06:19 MSK [50f5a8ab.1175:47] DEBUG: drop auto-cascades to type testschema.asselect 2013-01-15 23:06:19 MSK [50f5a8ab.1175:48] DEBUG: drop auto-cascades to type testschema.asselect[] 2013-01-15 23:06:19 MSK [50f5a8ab.1175:49] DEBUG: drop auto-cascades to type testschema.asexecute 2013-01-15 23:06:19 MSK [50f5a8ab.1175:50] DEBUG: drop auto-cascades to type testschema.asexecute[] 2013-01-15 23:06:19 MSK [50f5a8ab.1175:51] DEBUG: drop auto-cascades to index testschema.anindex 2013-01-15 23:06:19 MSK [50f5a8ab.1175:52] DEBUG: drop auto-cascades to type testschema.atable 2013-01-15 23:06:19 MSK [50f5a8ab.1175:53] DEBUG: drop auto-cascades to type testschema.atable[] 2013-01-15 23:06:19 MSK [50f5a8ab.1175:54] NOTICE: drop cascades to 4 other objects 2013-01-15 23:06:19 MSK [50f5a8ab.1175:55] DETAIL: drop cascades to table testschema.foo drop cascades to table testschema.asselect drop cascades to table testschema.asexecute drop cascades to table testschema.atable 2013-01-15 23:06:19 MSK [50f5a8ab.1175:56] LOG: statement: DROP TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8aa.1168:16] LOG: checkpoint starting: immediate force wait 2013-01-15 23:06:19 MSK [50f5a8aa.1168:17] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:18] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000 2013-01-15 23:06:19 MSK [50f5a8aa.1168:19] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 transaction log file(s) added,0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s, average=0.000 s 2013-01-15 23:06:19 MSK [50f5a8ab.1175:57] ERROR: tablespace "testspace" is not empty 2013-01-15 23:06:19 MSK [50f5a8ab.1175:58] STATEMENT: DROP TABLESPACE testspace; 2013-01-15 23:06:19 MSK [50f5a8aa.1162:12] DEBUG: server process (PID 4469) exited with exit code 0 Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Sergey Koposov <koposov@ast.cam.ac.uk> writes: > And I do see the tblspc file left after the finish of "make check": > tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387 Interesting. If the tests are run immediately after initdb, 16387 is the relfilenode assigned to table "foo" in the tablespace regression test. But why would only that table be left behind? There are half a dozen relations in that tablespace at the point of the DROP CASCADE. BTW, I just finished trying to reproduce this on an IA64 machine belonging to Red Hat, without success. So that seems to eliminate any possibility of the machine architecture being the trigger issue. The compiler's still a likely cause though. Anybody have a similar ICC version (dugong's says it is 10.0 20070809) to try? Also, Sergey, can you find a non-dot-zero release to try? regards, tom lane
On Tue, 15 Jan 2013, Tom Lane wrote: > BTW, I just finished trying to reproduce this on an IA64 machine > belonging to Red Hat, without success. So that seems to eliminate > any possibility of the machine architecture being the trigger issue. > The compiler's still a likely cause though. > Anybody have a similar ICC version (dugong's says it is 10.0 20070809) > to try? Also, Sergey, can you find a non-dot-zero release to try? I think it is indeed the main issue. I've tried 10.1 ( 10.1.011 ) and it doesn't fail. When I tried 11.1 (icc (ICC) 11.1 20100401 ) it failed in a quite strange way (I don't quite remember it happenning before) test tablespace ... ok parallel group (18 tests): txid int2 text name oid varchar int4 char money float8 uuid float4 int8 boolean bit enum numeric rangetypes boolean ... ok char ... ok name ... ok varchar ... ok text ... ok int2 ... ok int4 ... ok int8 ... ok oid ... ok float4 ... ok float8 ... ok bit ... ok numeric ... ok txid ... ok uuid ... ok enum ... ok money ... ok rangetypes ... FAILED (test processexited with exit code 2) test strings ... FAILED (test process exited with exit code 2) test numerology ... FAILED (test process exited with exit code 2) parallel group (19 tests): path interval time inet circle macaddr comments timestamp timestamptz reltime date tstypes tinterval abstime timetz lseg box polygon point point ... FAILED (test process exited with exit code 2) lseg ... FAILED (test process exited with exit code 2) box ... FAILED (test process exited with exit code 2) path ... FAILED (test process exited with exit code 2) polygon ... FAILED (test process exited with exit code 2) circle ... FAILED (test process exited with exit code 2) date ... FAILED (test process exited with exit code 2) time ... FAILED (test process exited with exit code 2) timetz ... FAILED (test process exited with exit code 2) timestamp ... FAILED (test process exited with exit code 2) timestamptz ... FAILED (test process exited with exit code 2) interval ... FAILED (test process exited with exit code 2) abstime ... FAILED (test process exited with exit code 2) reltime ... FAILED (test process exited with exit code 2) tinterval ... FAILED (test process exited with exit code 2) inet ... FAILED (test process exited with exit code 2) macaddr ... FAILED (test process exited with exit code 2) tstypes ... FAILED (test process exited with exit code 2) comments ... FAILED (test process exited with exit code 2) parallel group (6 tests): geometry regex horology type_sanity oidjoins opr_sanity geometry ... FAILED horology ... FAILED regex ...ok oidjoins ... ok type_sanity ... ok opr_sanity ... ok test insert ... ok test create_function_1 ... ok test create_type ... ok test create_table ... ok test create_function_2 ... ok parallel group (2 tests): copyselect copy copy ... ok copyselect ... ok parallel group (2 tests): create_operator create_misc create_misc ... ok create_operator ...ok parallel group (2 tests): create_view create_index create_index ... FAILED (test process exited with exitcode 2) create_view ... ok parallel group (11 tests): constraints triggers create_cast create_function_3 updatable_views inherit drop_if_exists create_aggregate create_table_like typed_table vacuum create_aggregate ... FAILED (test process exited with exit code 2) create_function_3 ... FAILED (test process exited with exit code 2) create_cast ... FAILED (test process exited with exit code 2) constraints ... FAILED (test process exited with exit code 2) triggers ... FAILED (test process exited with exit code 2) inherit ... FAILED (test process exited with exit code 2) create_table_like ... FAILED (test process exited with exit code 2) typed_table ... FAILED (test process exited with exit code 2) vacuum ... FAILED (test process exited with exit code 2) drop_if_exists ... FAILED (test process exited with exit code 2) updatable_views ... FAILED (test process exited with exit code 2) test sanity_check ... FAILED (test process exited with exit code 2) test errors ... FAILED (test process exited with exit code 2) test select ... FAILED (test process exited with exit code 2) parallel group (20 tests): join transactions subselect portals select_having case select_distinct btree_index union hash_index namespace random delete select_implicit select_into arrays update prepared_xacts aggregates select_distinct_on select_into ... FAILED (test process exited with exit code 2) select_distinct ... FAILED (test process exited with exit code 2) select_distinct_on ... FAILED (test process exited with exit code 2) select_implicit ... FAILED (test process exited with exit code 2) select_having ... FAILED (test process exited with exit code 2) subselect ... FAILED (test process exited with exit code 2) union ... FAILED (test process exited with exit code 2) case ... FAILED (test process exited with exit code 2) join ... FAILED (test process exited with exit code 2) aggregates ... FAILED (test process exited with exit code 2) transactions ... FAILED (test process exited with exit code 2) random ... failed (ignored) (test process exited with exit code 2) portals ... FAILED (test process exited with exit code 2) arrays ... FAILED (test process exited with exit code 2) btree_index ... FAILED (test process exited with exit code 2) hash_index ... FAILED (test process exited with exit code 2) update ... FAILED (test process exited with exit code 2) namespace ... FAILED (test process exited with exit code 2) prepared_xacts ... FAILED (test process exited with exit code 2) delete ... FAILED (test process exited with exit code 2) parallel group (3 tests): security_label collate privileges privileges ... ok security_label ... ok collate ... ok parallel group (2 tests): alter_generic misc misc ... FAILED alter_generic ... ok test rules ... ok test event_trigger ... ok parallel group (16 tests): combocid portals_p2 advisory_lock xmlmap json tsdicts guc functional_deps dependency tsearch cluster select_views window foreign_data foreign_key bitmapops select_views ... ok portals_p2 ... ok foreign_key ... ok cluster ... ok dependency ... ok guc ... ok bitmapops ... ok combocid ... ok tsearch ...ok tsdicts ... ok foreign_data ... ok window ... ok xmlmap ... ok functional_deps ... ok advisory_lock ... ok json ... ok parallel group (19 tests): limit conversion prepare temp returning copy2 plancache sequence xml largeobject rowtypes polymorphism without_oid domain rangefuncs with truncate alter_table plpgsql plancache ... ok limit ... ok plpgsql ... ok copy2 ... ok temp ... ok domain ... ok rangefuncs ... ok prepare ... ok without_oid ...ok conversion ... ok truncate ... ok alter_table ... ok sequence ... ok polymorphism ... ok rowtypes ... ok returning ... ok largeobject ... ok with ... FAILED xml ... ok test stats ... ok ######### Digging into the log showed the following trap: TRAP: FailedAssertion("!(( ((void) ((bool) ((! assert_enabled) || ! (!(((bool) (((const void*)(&nodes[n ]->t_tid) != ((void *)0)) && ((&nodes[n]->t_tid)->ip_posid != 0))))) || (ExceptionalCondition("!(((bool ) (((const void*)(&nodes[n]->t_tid) != ((void *)0)) && ((&nodes[n]->t_tid)->ip_posid != 0))))", ("Faile dAssertion"), "spgdoinsert.c", 1222), 0)))), ( ((void) ((bool) ((! assert_enabled) || ! (!(((bool) ((co nst void*)(&(&nodes[n]->t_tid)->ip_blkid) != ((void *)0))))) || (ExceptionalCondition("!(((bool) ((cons t void*)(&(&nodes[n]->t_tid)->ip_blkid) != ((void *)0))))", ("FailedAssertion"), "spgdoinsert.c", 1222) , 0)))), (BlockNumber) (((&(&nodes[n]->t_tid)->ip_blkid)->bi_hi << 16) | ((uint16) (&(&nodes[n]->t_tid) ->ip_blkid)->bi_lo)) ) ) == leafBlock)", File: "spgdoinsert.c", Line: 1222) 2013-01-16 00:23:39 MSK [50f5bac9.3d6e:48] DEBUG: server process (PID 15774) was terminated by signal 6: Aborted 2013-01-16 00:23:39 MSK [50f5bac9.3d6e:49] DETAIL: Failed process was running: insert into test_range_ spgist select 'empty'::int4range from generate_series(1,500) g; 2013-01-16 00:23:39 MSK [50f5bac9.3d6e:50] LOG: server process (PID 15774) was terminated by signal 6: Aborted Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
On 2013-01-15 14:40:11 -0500, Tom Lane wrote: > Sergey Koposov <koposov@ast.cam.ac.uk> writes: > > And I do see the tblspc file left after the finish of "make check": > > tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387 > > Interesting. If the tests are run immediately after initdb, 16387 > is the relfilenode assigned to table "foo" in the tablespace regression > test. But why would only that table be left behind? There are half > a dozen relations in that tablespace at the point of the DROP CASCADE. > > BTW, I just finished trying to reproduce this on an IA64 machine > belonging to Red Hat, without success. So that seems to eliminate > any possibility of the machine architecture being the trigger issue. > The compiler's still a likely cause though. > > Anybody have a similar ICC version (dugong's says it is 10.0 20070809) > to try? Also, Sergey, can you find a non-dot-zero release to try? I played a bit arround (thanks Sergey!) and it seems to be some rather strange optimization issue around the fsync request queue. Namely changing /* OK, insert request */request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];request->rnode = rnode;request->forknum= forknum;request->segno = segno; into/* OK, insert request */request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];request->rnode.spcNode= rnode.spcNode;request->rnode.dbNode= rnode.dbNode;request->rnode.relNode = rnode.relNode;request->forknum = forknum;request->segno= segno; makes it pass reliably. Displaying the values of request after the assignment, but without the change shows bogus values showing up which explains the problems. How the hell thats correlating with the elog changes I don't yet know. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I played a bit arround (thanks Sergey!) and it seems to be some rather > strange optimization issue around the fsync request queue. > Namely changing > request->rnode = rnode; > into > request->rnode.spcNode = rnode.spcNode; > request->rnode.dbNode = rnode.dbNode; > request->rnode.relNode = rnode.relNode; > makes it pass reliably. Jeez. That's my candidate for weird compiler bug of the month. > How the hell thats correlating with the elog changes I don't yet know. There is an elog(ERROR) further up in the same function, but it's sure not clear how that could cause the compiler to misimplement a struct assignment. Maybe the problem is not in those lines alone, but the fact that rnode is a pass-by-value struct? (That is, maybe it's the value of the rnode local variable that's getting munged, somewhere up near the elog call?) We tend to not use pass-by-value struct params much, so we might not have noticed a compiler bug associated with that. Or IOW, does changing ForwardFsyncRequest to use a "const RelFileNode *rnode" parameter make it go away? regards, tom lane
On 2013-01-15 17:56:40 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I played a bit arround (thanks Sergey!) and it seems to be some rather > > strange optimization issue around the fsync request queue. > > > Namely changing > > request->rnode = rnode; > > into > > request->rnode.spcNode = rnode.spcNode; > > request->rnode.dbNode = rnode.dbNode; > > request->rnode.relNode = rnode.relNode; > > makes it pass reliably. > > Jeez. That's my candidate for weird compiler bug of the month. > > > How the hell thats correlating with the elog changes I don't yet know. > > There is an elog(ERROR) further up in the same function, but it's sure > not clear how that could cause the compiler to misimplement a struct > assignment. Indeed, replacing the elog() there with a plain abort() or the old-style elog definition makes it work. Just using a do-while with the old definition inside makes it fail. My IA64 knowledge is pretty basic, but I would guess this is stack or code alignment related I seem to remember quite some strange requirements there. > Maybe the problem is not in those lines alone, but the fact > that rnode is a pass-by-value struct? (That is, maybe it's the value of > the rnode local variable that's getting munged, somewhere up near the > elog call?) No, I found this because I printed the values before enquing the values into shmem and after dequeing. After noticing that they didn't match I added more... > We tend to not use pass-by-value struct params much, so we > might not have noticed a compiler bug associated with that. Or IOW, > does changing ForwardFsyncRequest to use a "const RelFileNode *rnode" > parameter make it go away? Nope, same thing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-16 00:26:01 +0100, Andres Freund wrote: > On 2013-01-15 17:56:40 -0500, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > I played a bit arround (thanks Sergey!) and it seems to be some rather > > > strange optimization issue around the fsync request queue. > > > > > Namely changing > > > request->rnode = rnode; > > > into > > > request->rnode.spcNode = rnode.spcNode; > > > request->rnode.dbNode = rnode.dbNode; > > > request->rnode.relNode = rnode.relNode; > > > makes it pass reliably. > > > > Jeez. That's my candidate for weird compiler bug of the month. > > > > > How the hell thats correlating with the elog changes I don't yet know. > > > > There is an elog(ERROR) further up in the same function, but it's sure > > not clear how that could cause the compiler to misimplement a struct > > assignment. > > Indeed, replacing the elog() there with a plain abort() or the old-style > elog definition makes it work. Just using a do-while with the old > definition inside makes it fail. > > My IA64 knowledge is pretty basic, but I would guess this is stack or > code alignment related I seem to remember quite some strange > requirements there. FWIW its also triggerable if two other function calls are places inside the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)). It seems the change just made an existing issue visible. No idea what to do about it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > FWIW its also triggerable if two other function calls are places inside > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)). [ confused... ] You mean replacing the abort() in the elog macro with one of these functions? Or something else? > It seems the change just made an existing issue visible. > No idea what to do about it. Pretty clearly a compiler bug at this point. Since there doesn't seem to be a clean workaround (no, I don't want to expand the struct assignment manually), and anyway we can't be sure that the bug doesn't also manifest in other places, recommending Sergey update his compiler seems like the thing to do. At this point I'm more interested in his report in <alpine.LRH.2.03.1301152012220.773@ast.cam.ac.uk> about the Assert at spgdoinsert.c:1222 failing. That's pretty new code, so more likely to have a genuine bug, and I wonder if it's related to the spgist issue in <50EBF992.2000704@qunar.com> ... regards, tom lane
On 2013-01-15 19:56:52 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > FWIW its also triggerable if two other function calls are places inside > > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)). > > [ confused... ] You mean replacing the abort() in the elog macro with > one of these functions? Or something else? I mean replacing the elog(ERROR, "ForwardFsyncRequest must...") with any two function calls inside a do/while(0). I just tried to place some random functions there instead of the elog to make sure its unrelated, and it still triggers the problem even before the elog commit. The assembler output of that function changes wildly with tiny changes and I don't understand IA-64 at all (does anybody?), so I don't see anything we can do there. > > It seems the change just made an existing issue visible. > > No idea what to do about it. > > Pretty clearly a compiler bug at this point. Since there doesn't seem > to be a clean workaround (no, I don't want to expand the struct > assignment manually), and anyway we can't be sure that the bug doesn't > also manifest in other places, recommending Sergey update his compiler > seems like the thing to do. Yea. Don't have a better suggestion. > At this point I'm more interested in his report in > <alpine.LRH.2.03.1301152012220.773@ast.cam.ac.uk> about > the Assert at spgdoinsert.c:1222 failing. That's pretty new code, so > more likely to have a genuine bug, and I wonder if it's related to > the spgist issue in <50EBF992.2000704@qunar.com> ... Yes, it looks more like it could be something real. There are suspicously many other failing tests though (misc, with) that don't seem to be related to the spgist crash. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-15 19:56:52 -0500, Tom Lane wrote: >> At this point I'm more interested in his report in >> <alpine.LRH.2.03.1301152012220.773@ast.cam.ac.uk> about >> the Assert at spgdoinsert.c:1222 failing. That's pretty new code, so >> more likely to have a genuine bug, and I wonder if it's related to >> the spgist issue in <50EBF992.2000704@qunar.com> ... > Yes, it looks more like it could be something real. There are > suspicously many other failing tests though (misc, with) that don't seem > to be related to the spgist crash. Looking again, the pg_regress output appears to indicate two separate crashes (one during rangetypes, the other during create_index). The reported Assert trap was in the rangetypes test, but the other one could very easily have been from spgist code as well. I'd tend to write off all the other reported diffs as followon damage from the crashes, at least without clearer evidence that they weren't. There are very many instances in our regression tests where failure to complete one test results in bogus diffs in later ones, because DB objects don't exist or don't have the expected contents. regards, tom lane
On 2013-01-16 02:13:26 +0100, Andres Freund wrote: > On 2013-01-15 19:56:52 -0500, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > FWIW its also triggerable if two other function calls are places inside > > > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)). > > > > [ confused... ] You mean replacing the abort() in the elog macro with > > one of these functions? Or something else? > > I mean replacing the elog(ERROR, "ForwardFsyncRequest must...") with any > two function calls inside a do/while(0). I just tried to place some > random functions there instead of the elog to make sure its unrelated, > and it still triggers the problem even before the elog commit. The > assembler output of that function changes wildly with tiny changes and I > don't understand IA-64 at all (does anybody?), so I don't see anything > we can do there. > > > > It seems the change just made an existing issue visible. > > > No idea what to do about it. > > > > Pretty clearly a compiler bug at this point. Since there doesn't seem > > to be a clean workaround (no, I don't want to expand the struct > > assignment manually), and anyway we can't be sure that the bug doesn't > > also manifest in other places, recommending Sergey update his compiler > > seems like the thing to do. > > Yea. Don't have a better suggestion. > > > At this point I'm more interested in his report in > > <alpine.LRH.2.03.1301152012220.773@ast.cam.ac.uk> about > > the Assert at spgdoinsert.c:1222 failing. That's pretty new code, so > > more likely to have a genuine bug, and I wonder if it's related to > > the spgist issue in <50EBF992.2000704@qunar.com> ... > > Yes, it looks more like it could be something real. There are > suspicously many other failing tests though (misc, with) that don't seem > to be related to the spgist crash. #3 0x4000000000b5c710 in ExceptionalCondition ( conditionName=0x4000000000c76d50 "!(( ((void) ((bool) ((! assert_enabled)|| ! (!(((bool) (((const void*)(&nodes[n]->t_tid) != ((void *)0)) && ((&nodes[n]->t_tid)->ip_posid != 0)))))|| (ExceptionalCondition(\"!(((bool) (((const void*)"..., errorType=0x4000000000c4c5a0 "FailedAssertion", fileName=0x4000000000c75d30"spgdoinsert.c", lineNumber=1222) at assert.c:54 #4 0x40000000001a6320 in doPickSplit (index=0x600000000007ff48, state=0x3, current=0x60000ffffff7a700, parent=0x4, newLeafTuple=0x6, level=512360, isNulls=64 '@', isNew=12 '\f') at spgdoinsert.c:1222 #5 0x40000000001a12d0 in spgdoinsert (index=0x2000000009856028, state=0x60000ffffff7a9d0, heapPtr=0x60000000001e6e7c, datum=6917546619826579712, isnull=0 '\0') at spgdoinsert.c:1996 #6 0x4000000000195870 in spginsert (fcinfo=0x60000ffffff7a9d0) at spginsert.c:222 #7 0x4000000000b77dd0 in FunctionCall6Coll (flinfo=0x6000000000102018, collation=0, arg1=2305843009373429800, arg2=6917546619826580944, arg3=6917546619826581200, arg4=6917529027643076220, arg5=2305843009373166576, arg6=0) at fmgr.c:1439 #8 0x4000000000148b70 in index_insert (indexRelation=0x2000000009856028, values=0x60000ffffff7add0, isnull=0x60000ffffff7aed0"", heap_t_ctid=0x60000000001e6e7c, heapRelation=0x2000000009815bf0, checkUnique=UNIQUE_CHECK_NO)at indexam.c:216 #9 0x40000000004e99f0 in ExecInsertIndexTuples (slot=0x60000000001e55c0, tupleid=0x60000000001e6e7c, estate=0x60000000001e4f18) at execUtils.c:1088 #10 0x4000000000516710 in ExecModifyTable (node=0x0) at nodeModifyTable.c:249 #11 0x40000000004c6350 in $$1$3_0$TAG$0ca$0$3 () at execProcnode.c:377 #12 0x40000000004bba00 in ExecutorRun (queryDesc=0x60000000001e4fb0, direction=NoMovementScanDirection, count=0) at execMain.c:1400 #13 0x40000000008493f0 in PortalRunMulti (portal=0x60000000000ff7f8, isTopLevel=-26 '�', dest=0x60000000001ef658, altdest=0x60000000001ef658,completionTag=0x60000ffffff7b2d0 "") at pquery.c:185 #14 0x4000000000848d20 in _setjmp_lpad_PortalRun_1$0$13 () at pquery.c:814 #15 0x4000000000840c60 in exec_simple_query ( query_string=0x600000000018d4f8 "insert into test_range_spgist select 'empty'::int4rangefrom generate_series(1,500) g;") at postgres.c:1048 #16 0x40000000008370a0 in _setjmp_lpad_PostgresMain_0$0$51 () at postgres.c:3969 ---Type <return> to continue, or q <return> to quit--- #17 0x4000000000720240 in BackendStartup (port=0x60000000000fc950) at postmaster.c:3989 #18 0x400000000071dc80 in ServerLoop () at postmaster.c:1575 #19 0x400000000071a700 in PostmasterMain (argc=9, argv=0x60000000000dc300) at postmaster.c:1244 #20 0x40000000005796d0 in main (argc=9, argv=0x60000000000dc010) at main.c:197 #4 0x40000000001a6320 in doPickSplit (index=0x600000000007ff48, state=0x3, current=0x60000ffffff7a700, parent=0x4, newLeafTuple=0x6, level=512360, isNulls=64 '@', isNew=12 '\f') at spgdoinsert.c:1222 1222 Assert(ItemPointerGetBlockNumber(&nodes[n]->t_tid) == leafBlock); (gdb) info locals in = {nTuples = 227, datums = 0x6000000000205060, level = 1} out = {hasPrefix = 0 '\0', prefixDatum = 0, nNodes = 8, nodeLabels = 0x0, mapTuplesToNodes = 0x6000000000209018, leafTupleDatums= 0x6000000000209430} includeNew = 1 '\001' startOffsets = {4, 0} rdata = {{data = 0x60000ffffff7a860 "\177\006", len = 56, buffer = 0, buffer_std = 1 '\001', next = 0x60000ffffff7a720},{ data = 0x6000000000206f08 "D", len = 72, buffer = 0, buffer_std = 1 '\001', next = 0x60000ffffff7a740},{data = 0x6000000000206090 "3", len = 456, buffer = 0, buffer_std = 1 '\001', next = 0x60000ffffff7a760},{data = 0x0, len = 0, buffer = 395, buffer_std = 1 '\001', next = 0x0}, {data = 0x2dffe98 <Address0x2dffe98 out of bounds>, len = 0, buffer = 0, buffer_std = 88 'X', next = 0xa020}, { data = 0x60000000001a6f80"A", len = 32, buffer = 16842752, buffer_std = 2 '\002', next = 0x60000ffffff7a7f0}, { data = 0xffffffff00000100<Address 0xffffffff00000100 out of bounds>, len = 65598, buffer = 0, buffer_std = -119 '\211', next = 0x164}, {data = 0x60000ffffff7a7f0 "p\200�\002", len = 0, buffer = 0, buffer_std = 112 'p', next = 0x30}, { data = 0x1<Address 0x1 out of bounds>, len = 4294420496, buffer = 1610616831, buffer_std = -128 '\200', next = 0x20}, { data =0x1 <Address 0x1 out of bounds>, len = 4294420576, buffer = 1610616831, buffer_std = 0 '\0', next = 0x68}} xlrec = {node = {spcNode = 1663, dbNode = 12030, relNode = 40992}, blknoSrc = 32, blknoDest = 4294967295, nDelete = 226,nInsert = 0, initSrc = 0 '\0', initDest = 1 '\001', blknoInner = 1610616831, offnumInner = 20520, initInner = 0 '\0',storesNulls = 0 '\0', blknoParent = 1610612736, offnumParent = 32, nodeI = 0, stateSrc = {myXid = 1516, isBuild = 0'\0'}} saveCurrent = {blkno = 395, buffer = 0, page = 0x0, offnum = 43104, node = 1610616831} (gdb) p parent $5 = (SPPageDesc *) 0x4 #5 0x40000000001a12d0 in spgdoinsert (index=0x2000000009856028, state=0x60000ffffff7a9d0, heapPtr=0x60000000001e6e7c, datum=6917546619826579712, isnull=0 '\0') at spgdoinsert.c:1996 1996 if (doPickSplit(index, state, ¤t, &parent, (gdb) p parent $4 = {blkno = 1, buffer = 356, page = 0x200000000148eea0 "", offnum = 1, node = 4} (gdb) p &parent $7 = (SPPageDesc *) 0x60000ffffff7a900 Looks like some out of bounds access? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-15 20:32:00 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-01-15 19:56:52 -0500, Tom Lane wrote: > >> At this point I'm more interested in his report in > >> <alpine.LRH.2.03.1301152012220.773@ast.cam.ac.uk> about > >> the Assert at spgdoinsert.c:1222 failing. That's pretty new code, so > >> more likely to have a genuine bug, and I wonder if it's related to > >> the spgist issue in <50EBF992.2000704@qunar.com> ... > > > Yes, it looks more like it could be something real. There are > > suspicously many other failing tests though (misc, with) that don't seem > > to be related to the spgist crash. > > Looking again, the pg_regress output appears to indicate two separate > crashes (one during rangetypes, the other during create_index). The > reported Assert trap was in the rangetypes test, but the other one > could very easily have been from spgist code as well. I'd tend to > write off all the other reported diffs as followon damage from the > crashes, at least without clearer evidence that they weren't. There are > very many instances in our regression tests where failure to complete > one test results in bogus diffs in later ones, because DB objects don't > exist or don't have the expected contents. I just checked and its just followup damage. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-16 02:34:52 +0100, Andres Freund wrote: > On 2013-01-16 02:13:26 +0100, Andres Freund wrote: > > On 2013-01-15 19:56:52 -0500, Tom Lane wrote: > > > Andres Freund <andres@2ndquadrant.com> writes: > > > > FWIW its also triggerable if two other function calls are places inside > > > > the above if() (I tried fprintf(stderr, "argh") and kill(0, 0)). > > > > > > [ confused... ] You mean replacing the abort() in the elog macro with > > > one of these functions? Or something else? > > > > I mean replacing the elog(ERROR, "ForwardFsyncRequest must...") with any > > two function calls inside a do/while(0). I just tried to place some > > random functions there instead of the elog to make sure its unrelated, > > and it still triggers the problem even before the elog commit. The > > assembler output of that function changes wildly with tiny changes and I > > don't understand IA-64 at all (does anybody?), so I don't see anything > > we can do there. > > > > > > It seems the change just made an existing issue visible. > > > > No idea what to do about it. > > > > > > Pretty clearly a compiler bug at this point. Since there doesn't seem > > > to be a clean workaround (no, I don't want to expand the struct > > > assignment manually), and anyway we can't be sure that the bug doesn't > > > also manifest in other places, recommending Sergey update his compiler > > > seems like the thing to do. > > > > Yea. Don't have a better suggestion. > > > > > At this point I'm more interested in his report in > > > <alpine.LRH.2.03.1301152012220.773@ast.cam.ac.uk> about > > > the Assert at spgdoinsert.c:1222 failing. That's pretty new code, so > > > more likely to have a genuine bug, and I wonder if it's related to > > > the spgist issue in <50EBF992.2000704@qunar.com> ... > > > > Yes, it looks more like it could be something real. There are > > suspicously many other failing tests though (misc, with) that don't seem > > to be related to the spgist crash. > > #4 0x40000000001a6320 in doPickSplit (index=0x600000000007ff48, state=0x3, current=0x60000ffffff7a700, parent=0x4, newLeafTuple=0x6, > level=512360, isNulls=64 '@', isNew=12 '\f') at spgdoinsert.c:1222 > (gdb) p parent > $4 = {blkno = 1, buffer = 356, page = 0x200000000148eea0 "", offnum = 1, node = 4} > > (gdb) p &parent > $7 = (SPPageDesc *) 0x60000ffffff7a900 -O0 passes Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > -O0 passes Grumble... suspect we're chasing another compiler bug now, but ... You might try -O1; if that shows the bug it'll probably be a tad easier to debug in. regards, tom lane
It's a compiler bug. icc 11.1 apparently thinks that this loop in doPickSplit: /* * Update nodes[] array to point into the newly formed innerTuple, so that * we can adjust their downlinks below. */ SGITITERATE(innerTuple, i, node) { nodes[i] = node; } is going to be iterated hundreds of times, because it expends a ridiculous amount of effort on it, starting with a loop prolog that prefetches about a thousand bytes starting at the nodes pointer :-( (Why does it think it needs to prefetch an array it's only going to write into? Is IA64's cache hardware really that stupid?) And it makes use of IA64's bizarre scheme for software-unrolling loops, which I am going to do my darnedest to forget now that I've learned it; but for the purposes of this bit you only need to know that the br.wtop.dptk instruction "renames" registers 32 and up, so that whatever is in r32 when the bottom of the loop is reached will be in r33 on the next iteration, and r33's contents move to r34, etc. In this particular example, this ridiculous complication saves a grand total of no instructions, but nevermind that. Before starting the loop, the code has computed r28 = innerTuple r29 = nodes r26 = r29 + 1200 (this is where it will continue the prefetching...) r33 = 0 r35 = innerTuple + innerTuple->prefixSize + 8 (ie, the initial value of "nt") r27 = innerTuple + innerTuple->prefixSize + 8 + 6 And the body of the SGITERATE loop looks like .b4_110: at top of loop, r35 contains "nt" pointer, r33 contains "i"(p17) st8 [r29]=r35,8 //0: {940:3} 4456 0 store nt at *r29, increment r29 by 8 bytes (thus, assign to nodes[i])(p17) add r32=1,r33 //0: {938:2}4453 0 compute i+1, will be next value of i due to register rename(p17) ld2 r36=[r28] //1: {938:2} 4462 0 fetch first 2 bytes of innerTuple(p17) ld2 r34=[r27],r33 //1: {938:2} 4459 0 fetch last 2 bytes of node tuple, on first iteration anyway ... and then add the value of r33 to r27, which is all wrong(p17) extr.u r37=r36,3,13 //2: {938:2} 44630 extract nNodes from fetched 2 bytes of innerTuple(p17) extr.u r33=r34,0,13 ;; //2: {938:2} 4460 0 extract size field of node tuple, or so it hopes(p17) lfetch.nt1 [r26],8 //3: {938:2} 4454 0 useless prefetch more than a thousand bytes away from the action(p17) cmp4.lt p16,p0=r32,r37 //3: {938:2}4464 0 compare whether r32 (next value of i) < nNodes(p17) add r34=r35,r33 //3: {938:2} 4461 0 set r34 (next value of r35) to r35 + size field, or so it hopes(p16) br.wtop.dptk .b4_110 ;; //3: {938:2}4465 0 rename the registers and do it again, if the cmp4 returned true The problem with this code is that r27, which ought to be always equal to r35 + 6, is incremented by the wrong amount in the second ld2 instruction, namely by the "i" counter. The value that *should* get added to it is the node size field, ie the same value that's loaded into r33 below that and then added to r35 in the last add instruction (and then stored into r34, which is about to become r35). So I think the compiler has outsmarted itself as to which rotating register contains which value when. The result of this breakage is that the set of node pointers computed by the loop is totally wrong for all values after the first. This means the later loop that's trying to insert the now-known downlink TIDs into the innerTuple's nodes is storing those TIDs into random locations, and thus tromping all over memory. The case where we get a reproducible crash is where the Asserts in that loop notice that what's at the pointed-to addresses isn't what's expected, before we manage to clobber anything critical. Diagnosis: icc 11.1 is not ready for prime time. I shall now retire with a glass of wine and attempt to forget everything I just learned about IA64. What a bizarre architecture ... regards, tom lane
Hi, On 2013-01-16 01:28:09 -0500, Tom Lane wrote: > It's a compiler bug. Gah. Not again. Not that I am surprised, but still. > icc 11.1 apparently thinks that this loop in doPickSplit: > (Why does it think it needs to prefetch an array it's only going to > write into? Is IA64's cache hardware really that stupid?) I think it is. They had that strange model of putting all the intelligence into the compiler and make the hardware relatively dumb. Worked well I'd say. > And it makes use of IA64's bizarre scheme for software-unrolling > loops, which I am going to do my darnedest to forget now that I've > learned it; > ... > Diagnosis: icc 11.1 is not ready for prime time. Consider me impressed. I tried to see what went wrong from a code generation POV with the original bug but I have to admit I gave up with only an inkling of an idea what it could be (I think it miscalculate the starting offset when copying the whole 'result' memory at once). That instruction set is just plain too crazy for my brain. The consequence seems to be icc on IA-64 in general is not ready for prime time... I doubt they are still investigating significant resources into it anyway. > I shall now retire with a glass of wine and attempt to forget everything > I just learned about IA64. What a bizarre architecture ... The explanation/problem alone leaves me in want of something stronger, only its 1pm here... Nice work, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On Wed, 16 Jan 2013, Andres Freund wrote: > On 2013-01-16 01:28:09 -0500, Tom Lane wrote: >> It's a compiler bug. Thanks for investigating. But I'm not sure there is any way other way for me other than switching to gcc, because intel stopped providing their IA64 version of compilers free of charge even for non-commercial/educational people: http://software.intel.com/en-us/intel-software-development-tools-for-intel-itanium-processors (their website is also a bit of maze, so I don't see anywhere the updated versions of the 10.0, 10.1, 11.1 compilers that I have) So unless somebody suggest otherwise, i'm going to switch to gcc on this buildfarm. Cheers, Sergey ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
On 2013-01-16 14:41:47 +0000, Sergey Koposov wrote: > Hi, > > On Wed, 16 Jan 2013, Andres Freund wrote: > >On 2013-01-16 01:28:09 -0500, Tom Lane wrote: > >>It's a compiler bug. > > Thanks for investigating. But I'm not sure there is any way other way for me > other than switching to gcc, because intel stopped providing their IA64 > version of compilers free of charge even for non-commercial/educational > people: > http://software.intel.com/en-us/intel-software-development-tools-for-intel-itanium-processors > (their website is also a bit of maze, so I don't see anywhere the updated > versions of the 10.0, 10.1, 11.1 compilers that I have) > > So unless somebody suggest otherwise, i'm going to switch to gcc on this > buildfarm. What about switching to -O1 of 11.1? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 01/16/2013 09:41 AM, Sergey Koposov wrote: > > So unless somebody suggest otherwise, i'm going to switch to gcc on > this buildfarm. If you switch compiler it should be a new buildfarm animal. (That just means re-registering so you get a new name/secret pair.) We have provision for upgrading the OS version and the compiler version, but changing which OS or which compiler is used requires a new animal - it's too great a discontinuity. cheers andrew
On Wed, 16 Jan 2013, Andres Freund wrote: >> So unless somebody suggest otherwise, i'm going to switch to gcc on this >> buildfarm. > > What about switching to -O1 of 11.1? I don't know. It is up to -hackers to decide. I think that icc on ia64 have shown bugginess time after time. But if you think that buildfarm with icc 11.1 -O1 carry more information than say running gcc, i can still run icc. S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Sergey Koposov <koposov@ast.cam.ac.uk> writes: > On Wed, 16 Jan 2013, Andres Freund wrote: >> What about switching to -O1 of 11.1? > I don't know. It is up to -hackers to decide. I think that icc on ia64 > have shown bugginess time after time. But if you think that buildfarm > with icc 11.1 -O1 carry more information than say running gcc, i can > still run icc. I think the reason that this bug doesn't manifest at -O1 is that then icc doesn't attempt to do any loop unrolling/vectorizing. So that's a big chunk of potential optimization bugs we'd be dodging. It's hard to say whether that renders the test worthless in comparison with what people would try to do in production. Should we recommend that people not try to use -O2 or higher with icc on IA64? IMO it's important that we have some icc members in the buildfarm, just because it's useful to see a different compiler's take on warnings. We do have some icc-on-mainstream-Intel members, but not many. Perhaps Sergey should use 10.1, which so far appears to not have so many bugs. regards, tom lane