Обсуждение: Postgres dies in the rules regression test (64-bit problem)
Hi, all. I have found what seems to be a 64/32-bit problem while testing the latest cvsup'ed source on an alpha running Digital Unix 4.0d. After fixing a bug in the 'money' type that prevented the rules regression test from working (I sent a patch to pgsql-patches but was told by Bruce that it will have to wait after the release), the test ran up to a point where the backend died dumping core. Here is what I've seen using gdb: pgbeta:nodes> gdb /usr/local/pgsql.beta/bin/postgres /usr/local/pgsql.beta/data/base/regression/core GDB is free software and you are welcome to distribute copies of itunder certain conditions; type "show copying" to see theconditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.16 (alpha-dec-osf3.2), Copyright 1996 Free Software Foundation, Inc... Core was generated by `postgres'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/shlib/libm.so...done. Reading symbols from /usr/shlib/libcurses.so...done. Reading symbols from /usr/shlib/libc.so...done. Reading symbols from /usr/lib/nls/loc//es_ES.ISO8859-1...done. #0 replace_opid (oper=0x4015aad0) at nodeFuncs.c:95 95 oper->opid = get_opcode(oper->opno); (gdb) where #0 replace_opid (oper=0x4015aad0) at nodeFuncs.c:95 #1 0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554 #2 0x12011e214 in preprocess_targetlist (tlist=0x14015a4c0, command_type=2, result_relation=5, range_table=0x14015b1b0)at preptlist.c:84 #3 0x120118808 in union_planner (parse=0x14015a100) at planner.c:162 #4 0x1201185d4 in planner (parse=0x14015a100) at planner.c:83 #5 0x120159d90 in pg_parse_and_plan ( query_string=0x11fffa918 "update rtest_v1 set a = rtest_t3.a + 20 where b = rtest_t3.b;", typev=0x0, nargs=0, queryListP=0x11fffa868, dest=Remote, aclOverride=0 '\000') at postgres.c:590 #6 0x12015a034 in pg_exec_query_dest ( query_string=0x11fffa918 "update rtest_v1 set a = rtest_t3.a + 20 where b = rtest_t3.b;", dest=Remote, aclOverride=0 '\000') at postgres.c:678 #7 0x120159f80 in pg_exec_query ( query_string=0x11fffa918 "update rtest_v1 set a = rtest_t3.a + 20 where b = rtest_t3.b;") at postgres.c:656 #8 0x12015baa0 in PostgresMain (argc=10, argv=0x11fffee90, real_argc=9, real_argv=0x11ffffc28) at postgres.c:1658 #9 0x12012d02c in DoBackend (port=0x1400d9a00) at postmaster.c:1628 #10 0x12012c778 in BackendStartup (port=0x1400d9a00) at postmaster.c:1373 #11 0x12012b5d8 in ServerLoop () at postmaster.c:823 #12 0x12012ae00 in PostmasterMain (argc=9, argv=0x11ffffc28) at postmaster.c:616 #13 0x1200e0b30 in main (argc=9, argv=0x11ffffc28) at main.c:93 (gdb) Although my knowledge of postgres internals is null, I have done a bit of investigation. 'replace_opid' is called from 'fix_opid', line 554 of file backend/optimizer/utils/clauses.c: replace_opid((Oper *) ((Expr *) clause)->oper); 'clause' is a pointer to Node. The actual value is tagged as T_Expr, so it seems to be used right. If you look at the contents of 'clause': (gdb) p *((Expr *) clause) $3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0, args = 0x14015ab30} Here is the problem. ((Expr*) clause)->oper is a pointer to Node, which (from the name of the field) I think that should be tagged as T_Oper. But, if you look carefully, it has the same value as 'clause' but *truncated to 32 bits*. This is a problem that I've seen many times when you store a pointer in an int (which still is 32 bits long in an alpha) and later you use it again as a pointer. I don't know if ((Expr*) clause)->oper should point to itself as it seems to do, but certainly its value is passed though an int variable and is truncated. If someone points me to the right place to look, I can play a bit more with gdb and try to find the cause. You can find the query that crashes the backend at the stack trace above. Cheers, Pedro. -- ------------------------------------------------------------------- Pedro José Lobo Perea Tel: +34 91 336 78 19 Centro de Cálculo Fax: +34 91 331 92 29 E.U.I.T. Telecomunicación e-mail: pjlobo@euitt.upm.es Universidad Politécnica de Madrid Ctra. de Valencia, Km. 7 E-28031 Madrid - España / Spain
"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes: > #0 replace_opid (oper=0x4015aad0) at nodeFuncs.c:95 > #1 0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554 > (gdb) p *((Expr *) clause) > $3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0, > args = 0x14015ab30} > I don't know if ((Expr*) clause)->oper should point to itself as it seems > to do, It shouldn't ever point to itself, but it looks to me like it's not --- the low order bits of clause are ...aaa0 and oper is ...aad0. > but certainly its value is passed though an int variable and is > truncated. Looks that way doesn't it :-(. I did some quick scratching around in the sources and couldn't find any obvious mistakes of that ilk. Most of the code that touches Oper nodes would have been exercised heavily long before we get to the rules regression test, so I'm not sure what to think. > If someone points me to the right place to look, I can play a bit more > with gdb and try to find the cause. The Expr node was presumably made by make_op() in backend/parser/parse_oper.c, although the tree might have been copied at least once using the functions in backend/nodes/copyfuncs.c. Good luck! regards, tom lane
On Thu, 10 Jun 1999, Tom Lane wrote: >"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes: >> #0 replace_opid (oper=0x4015aad0) at nodeFuncs.c:95 >> #1 0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554 > >> (gdb) p *((Expr *) clause) >> $3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0, >> args = 0x14015ab30} > >> I don't know if ((Expr*) clause)->oper should point to itself as it seems >> to do, > >It shouldn't ever point to itself, but it looks to me like it's not --- >the low order bits of clause are ...aaa0 and oper is ...aad0. Ooops, you are right! Now I am using a bigger font :-) >> but certainly its value is passed though an int variable and is >> truncated. > >Looks that way doesn't it :-(. I did some quick scratching around in >the sources and couldn't find any obvious mistakes of that ilk. Most of >the code that touches Oper nodes would have been exercised heavily long >before we get to the rules regression test, so I'm not sure what to think. I have also looked at the warnings that arise in the compilation process, and haven't found any that could be related to that. >> If someone points me to the right place to look, I can play a bit more >> with gdb and try to find the cause. > >The Expr node was presumably made by make_op() in >backend/parser/parse_oper.c, although the tree might have been copied at >least once using the functions in backend/nodes/copyfuncs.c. Good luck! Ok, I'll let you know if/when I find something (or, more probably, when I have more questions ;-) Regards, Pedro. -- ------------------------------------------------------------------- Pedro José Lobo Perea Tel: +34 91 336 78 19 Centro de Cálculo Fax: +34 91 331 92 29 E.U.I.T. Telecomunicación e-mail: pjlobo@euitt.upm.es Universidad Politécnica de Madrid Ctra. de Valencia, Km. 7 E-28031 Madrid - España / Spain
On Fri, 11 Jun 1999, Pedro J. Lobo wrote: >On Thu, 10 Jun 1999, Tom Lane wrote: > >>"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes: >>> >>> If someone points me to the right place to look, I can play a bit more >>> with gdb and try to find the cause. >> >>The Expr node was presumably made by make_op() in >>backend/parser/parse_oper.c, although the tree might have been copied at >>least once using the functions in backend/nodes/copyfuncs.c. Good luck! > >Ok, I'll let you know if/when I find something (or, more probably, when I >have more questions ;-) Well, I have found something. As you said, the node is made by make_op() (which, btw, is in parse_node.c ;-) and later copied using copyObject(). But those functions are correct and work as expected. I have located the point where the pointer is overwritten. Here is the stack trace: (gdb) where #0 ResolveNew (info=0x140169ac0, targetlist=0x140169f70, nodePtr=0x14016aac8, sublevels_up=0) at rewriteManip.c:719 #1 0x12013ea38 in ResolveNew (info=0x140169ac0, targetlist=0x140169f70, nodePtr=0x14016ab38, sublevels_up=0) at rewriteManip.c:670 #2 0x12013ea58 in ResolveNew (info=0x140169ac0, targetlist=0x140169f70, nodePtr=0x140169b80, sublevels_up=0) at rewriteManip.c:730 #3 0x12013ead0 in FixNew (info=0x140169ac0, parsetree=0x1401692d0) at rewriteManip.c:753 #4 0x12013cc5c in fireRules (parsetree=0x1401692d0, rt_index=1, event=CMD_UPDATE, instead_flag=0x11fffa6c0 "\001", locks=0x14016a8c0, qual_products=0x11fffa6b8) at rewriteHandler.c:2612 #5 0x12013ce90 in RewriteQuery (parsetree=0x1401692d0, instead_flag=0x11fffa6c0 "\001", qual_products=0x11fffa6b8) at rewriteHandler.c:2697 #6 0x12013cf38 in deepRewriteQuery (parsetree=0x1401692d0) at rewriteHandler.c:2742 #7 0x12013d008 in QueryRewriteOne (parsetree=0x1401692d0) at rewriteHandler.c:2791 #8 0x12013d128 in BasicQueryRewrite (parsetree=0x1401692d0) at rewriteHandler.c:2862 #9 0x12013d230 in QueryRewrite (parsetree=0x1401692d0) at rewriteHandler.c:2916 #10 0x1200b3ce4 in ExplainQuery (query=0x1401692d0, verbose=0 '\000', dest=Remote) at explain.c:68 #11 0x12015dce0 in ProcessUtility (parsetree=0x140169110, dest=Remote) at utility.c:651 #12 0x12015a138 in pg_exec_query_dest ( query_string=0x11fffa918 "explain update rtest_v1 set a = rtest_t3.a + 20 whereb = rtest_t3.b;", dest=Remote, aclOverride=0 '\000') at postgres.c:727 #13 0x120159f80 in pg_exec_query ( query_string=0x11fffa918 "explain update rtest_v1 set a = rtest_t3.a + 20 where b= rtest_t3.b;") at postgres.c:656 #14 0x12015baa0 in PostgresMain (argc=10, argv=0x11fffee90, real_argc=9, real_argv=0x11ffffc28) at postgres.c:1658 #15 0x12012d02c in DoBackend (port=0x1400d9a00) at postmaster.c:1628 #16 0x12012c778 in BackendStartup (port=0x1400d9a00) at postmaster.c:1373 #17 0x12012b5d8 in ServerLoop () at postmaster.c:823 #18 0x12012ae00 in PostmasterMain (argc=9, argv=0x11ffffc28) at postmaster.c:616 #19 0x1200e0b30 in main (argc=9, argv=0x11ffffc28) at main.c:93 (gdb) ResolveNew() is a recursive function in backend/rewrite/rewriteManip.c. The problem appears one time that it is called with nodePtr pointing to a Var node. This is the node: (gdb) p *((Var *) *nodePtr) $4 = {type = T_Var, varno = 4, varattno = 1, vartype = 23, vartypmod = -1, varlevelsup = 0, varnoold = 4, varoattno = 1} In the beginning of the function, there is a 'switch(nodeTag(node))' ('node' is assigned the value of '*nodePtr'), that jumps to the corresponding 'case T_Var:' at line 695. There you see a call to FindMatchingNew() (a static function of the same c file), that returns a pointer to a node that, from what I have seen in that function, is expected to be of type Expr, and indeed it is. This pointer is stored in a variable named 'n'. Right after that, you can see this code: if (n == NULL) { ... some code that isn't executed because n != NULL } else { *nodePtr = copyObject(n); ((Var *) *nodePtr)->varlevelsup = this_varlevelsup; } Well, this *can't* be correct. 'n' points to a node of type Expr, and of course copyObject returns a node of that type. But that node is treated in the following line as if it were of type Var! The value of 'this_varlevelsup' is 0, and the offset of the field 'varlevelsup' of node Var is the same as that of the field 'oper' of node Expr, and the pointer is overwritten. I'd bet that the copy of the Expr node is not supposed to be stored in *nodePtr (which points to a node of type Var), but I don't know what would be the right action. I've found the bug, now I'll let the wise ones fix it ;-) This problem has been hidden until now (for me, at least) because of a bug in the 'money' data type that broke the rules test before it reached that point. That bug sowed up only when you had locale enabled and your currency didn't use cents (like spanish pesetas). On 32-bit systems it is likely that the pointer isn't overwritten, so the bug didn't showed up, either. Cheers, Pedro. -- ------------------------------------------------------------------- Pedro José Lobo Perea Tel: +34 91 336 78 19 Centro de Cálculo Fax: +34 91 331 92 29 E.U.I.T. Telecomunicación e-mail: pjlobo@euitt.upm.es Universidad Politécnica de Madrid Ctra. de Valencia, Km. 7 E-28031 Madrid - España / Spain
"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes: > I have located the point where the pointer is overwritten. > *nodePtr = copyObject(n); > ((Var *) *nodePtr)->varlevelsup = this_varlevelsup; > Well, this *can't* be correct. 'n' points to a node of type Expr, Good sleuthing! It looks like on a 32-bit machine, the overwritten area will be just past the last field that's part of Expr, and because of the memory manager's habit of rounding up object sizes that area just happens to be available space rather than the start of the next object. So that's why we hadn't found it before. I believe the copied TLE entry could be any of several other node types besides Var and Expr, so it's probably possible to see related failures even on a 32-bit machine. Jan, what should really be happening here? regards, tom lane