Обсуждение: Postgres dies in the rules regression test (64-bit problem)

Поиск
Список
Период
Сортировка

Postgres dies in the rules regression test (64-bit problem)

От
"Pedro J. Lobo"
Дата:
Hi, all.

I have found what seems to be a 64/32-bit problem while testing the latest
cvsup'ed source on an alpha running Digital Unix 4.0d. After fixing a bug
in the 'money' type that prevented the rules regression test from working
(I sent a patch to pgsql-patches but was told by Bruce that it will have
to wait after the release), the test ran up to a point where the backend
died dumping core. Here is what I've seen using gdb:

pgbeta:nodes> gdb /usr/local/pgsql.beta/bin/postgres
/usr/local/pgsql.beta/data/base/regression/core 
GDB is free software and you are welcome to distribute copies of itunder certain conditions; type "show copying" to see
theconditions.
 
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (alpha-dec-osf3.2), Copyright 1996 Free Software Foundation, Inc...
Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/shlib/libm.so...done.
Reading symbols from /usr/shlib/libcurses.so...done.
Reading symbols from /usr/shlib/libc.so...done.
Reading symbols from /usr/lib/nls/loc//es_ES.ISO8859-1...done.
#0  replace_opid (oper=0x4015aad0) at nodeFuncs.c:95
95              oper->opid = get_opcode(oper->opno);
(gdb) where
#0  replace_opid (oper=0x4015aad0) at nodeFuncs.c:95
#1  0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554
#2  0x12011e214 in preprocess_targetlist (tlist=0x14015a4c0, command_type=2,    result_relation=5,
range_table=0x14015b1b0)at preptlist.c:84
 
#3  0x120118808 in union_planner (parse=0x14015a100) at planner.c:162
#4  0x1201185d4 in planner (parse=0x14015a100) at planner.c:83
#5  0x120159d90 in pg_parse_and_plan (   query_string=0x11fffa918 "update rtest_v1 set a = rtest_t3.a + 20
where b = rtest_t3.b;", typev=0x0, nargs=0, queryListP=0x11fffa868,
dest=Remote,    aclOverride=0 '\000') at postgres.c:590
#6  0x12015a034 in pg_exec_query_dest (   query_string=0x11fffa918 "update rtest_v1 set a = rtest_t3.a + 20
where b = rtest_t3.b;", dest=Remote, aclOverride=0 '\000') at
postgres.c:678
#7  0x120159f80 in pg_exec_query (   query_string=0x11fffa918 "update rtest_v1 set a = rtest_t3.a + 20
where b = rtest_t3.b;") at postgres.c:656
#8  0x12015baa0 in PostgresMain (argc=10, argv=0x11fffee90, real_argc=9,    real_argv=0x11ffffc28) at postgres.c:1658
#9  0x12012d02c in DoBackend (port=0x1400d9a00) at postmaster.c:1628
#10 0x12012c778 in BackendStartup (port=0x1400d9a00) at postmaster.c:1373
#11 0x12012b5d8 in ServerLoop () at postmaster.c:823
#12 0x12012ae00 in PostmasterMain (argc=9, argv=0x11ffffc28)   at postmaster.c:616
#13 0x1200e0b30 in main (argc=9, argv=0x11ffffc28) at main.c:93
(gdb) 

Although my knowledge of postgres internals is null, I have done a bit of
investigation. 'replace_opid' is called from 'fix_opid', line 554 of file 
backend/optimizer/utils/clauses.c:

replace_opid((Oper *) ((Expr *) clause)->oper);

'clause' is a pointer to Node. The actual value is tagged as T_Expr, so it
seems to be used right. If you look at the contents of 'clause':

(gdb) p *((Expr *) clause)
$3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0,  args = 0x14015ab30}

Here is the problem. ((Expr*) clause)->oper is a pointer to Node, which
(from the name of the field) I think that should be tagged as T_Oper. But,
if you look carefully, it has the same value as 'clause' but *truncated to
32 bits*. This is a problem that I've seen many times when you store a
pointer in an int (which still is 32 bits long in an alpha) and later you
use it again as a pointer.

I don't know if ((Expr*) clause)->oper should point to itself as it seems
to do, but certainly its value is passed though an int variable and is
truncated.

If someone points me to the right place to look, I can play a bit more
with gdb and try to find the cause. You can find the query that crashes
the backend at the stack trace above.

Cheers,
Pedro.

-- 
-------------------------------------------------------------------
Pedro José Lobo Perea                   Tel:    +34 91 336 78 19
Centro de Cálculo                       Fax:    +34 91 331 92 29
E.U.I.T. Telecomunicación               e-mail: pjlobo@euitt.upm.es
Universidad Politécnica de Madrid
Ctra. de Valencia, Km. 7                E-28031 Madrid - España / Spain



Re: [HACKERS] Postgres dies in the rules regression test (64-bit problem)

От
Tom Lane
Дата:
"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:
> #0  replace_opid (oper=0x4015aad0) at nodeFuncs.c:95
> #1  0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554

> (gdb) p *((Expr *) clause)
> $3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0, 
>   args = 0x14015ab30}

> I don't know if ((Expr*) clause)->oper should point to itself as it seems
> to do,

It shouldn't ever point to itself, but it looks to me like it's not ---
the low order bits of clause are ...aaa0 and oper is ...aad0.

> but certainly its value is passed though an int variable and is
> truncated.

Looks that way doesn't it :-(.  I did some quick scratching around in
the sources and couldn't find any obvious mistakes of that ilk.  Most of
the code that touches Oper nodes would have been exercised heavily long
before we get to the rules regression test, so I'm not sure what to think.

> If someone points me to the right place to look, I can play a bit more
> with gdb and try to find the cause.

The Expr node was presumably made by make_op() in
backend/parser/parse_oper.c, although the tree might have been copied at
least once using the functions in backend/nodes/copyfuncs.c.  Good luck!
        regards, tom lane


Re: [HACKERS] Postgres dies in the rules regression test (64-bit problem)

От
"Pedro J. Lobo"
Дата:
On Thu, 10 Jun 1999, Tom Lane wrote:

>"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:
>> #0  replace_opid (oper=0x4015aad0) at nodeFuncs.c:95
>> #1  0x1201208b0 in fix_opid (clause=0x14015aaa0) at clauses.c:554
>
>> (gdb) p *((Expr *) clause)
>> $3 = {type = T_Expr, typeOid = 23, opType = OP_EXPR, oper = 0x4015aad0, 
>>   args = 0x14015ab30}
>
>> I don't know if ((Expr*) clause)->oper should point to itself as it seems
>> to do,
>
>It shouldn't ever point to itself, but it looks to me like it's not ---
>the low order bits of clause are ...aaa0 and oper is ...aad0.

Ooops, you are right! Now I am using a bigger font :-)

>> but certainly its value is passed though an int variable and is
>> truncated.
>
>Looks that way doesn't it :-(.  I did some quick scratching around in
>the sources and couldn't find any obvious mistakes of that ilk.  Most of
>the code that touches Oper nodes would have been exercised heavily long
>before we get to the rules regression test, so I'm not sure what to think.

I have also looked at the warnings that arise in the compilation process,
and haven't found any that could be related to that.

>> If someone points me to the right place to look, I can play a bit more
>> with gdb and try to find the cause.
>
>The Expr node was presumably made by make_op() in
>backend/parser/parse_oper.c, although the tree might have been copied at
>least once using the functions in backend/nodes/copyfuncs.c.  Good luck!

Ok, I'll let you know if/when I find something (or, more probably, when I
have more questions ;-)

Regards,
Pedro.

-- 
-------------------------------------------------------------------
Pedro José Lobo Perea                   Tel:    +34 91 336 78 19
Centro de Cálculo                       Fax:    +34 91 331 92 29
E.U.I.T. Telecomunicación               e-mail: pjlobo@euitt.upm.es
Universidad Politécnica de Madrid
Ctra. de Valencia, Km. 7                E-28031 Madrid - España / Spain



Re: [HACKERS] Postgres dies in the rules regression test (64-bit problem)

От
"Pedro J. Lobo"
Дата:
On Fri, 11 Jun 1999, Pedro J. Lobo wrote:

>On Thu, 10 Jun 1999, Tom Lane wrote:
>
>>"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:
>>>
>>> If someone points me to the right place to look, I can play a bit more
>>> with gdb and try to find the cause.
>>
>>The Expr node was presumably made by make_op() in
>>backend/parser/parse_oper.c, although the tree might have been copied at
>>least once using the functions in backend/nodes/copyfuncs.c.  Good luck!
>
>Ok, I'll let you know if/when I find something (or, more probably, when I
>have more questions ;-)

Well, I have found something. As you said, the node is made by make_op()
(which, btw, is in parse_node.c ;-) and later copied using copyObject().
But those functions are correct and work as expected.

I have located the point where the pointer is overwritten. Here is the
stack trace:

(gdb) where
#0  ResolveNew (info=0x140169ac0, targetlist=0x140169f70, nodePtr=0x14016aac8,    sublevels_up=0) at
rewriteManip.c:719
#1  0x12013ea38 in ResolveNew (info=0x140169ac0, targetlist=0x140169f70,    nodePtr=0x14016ab38, sublevels_up=0) at
rewriteManip.c:670
#2  0x12013ea58 in ResolveNew (info=0x140169ac0, targetlist=0x140169f70,    nodePtr=0x140169b80, sublevels_up=0) at
rewriteManip.c:730
#3  0x12013ead0 in FixNew (info=0x140169ac0, parsetree=0x1401692d0)   at rewriteManip.c:753
#4  0x12013cc5c in fireRules (parsetree=0x1401692d0, rt_index=1,    event=CMD_UPDATE, instead_flag=0x11fffa6c0 "\001",
locks=0x14016a8c0,   qual_products=0x11fffa6b8) at rewriteHandler.c:2612
 
#5  0x12013ce90 in RewriteQuery (parsetree=0x1401692d0,    instead_flag=0x11fffa6c0 "\001", qual_products=0x11fffa6b8)
at rewriteHandler.c:2697
 
#6  0x12013cf38 in deepRewriteQuery (parsetree=0x1401692d0)   at rewriteHandler.c:2742
#7  0x12013d008 in QueryRewriteOne (parsetree=0x1401692d0)   at rewriteHandler.c:2791
#8  0x12013d128 in BasicQueryRewrite (parsetree=0x1401692d0)   at rewriteHandler.c:2862
#9  0x12013d230 in QueryRewrite (parsetree=0x1401692d0)   at rewriteHandler.c:2916
#10 0x1200b3ce4 in ExplainQuery (query=0x1401692d0, verbose=0 '\000',    dest=Remote) at explain.c:68
#11 0x12015dce0 in ProcessUtility (parsetree=0x140169110, dest=Remote)   at utility.c:651
#12 0x12015a138 in pg_exec_query_dest (   query_string=0x11fffa918 "explain update rtest_v1 set a = rtest_t3.a +   20
whereb = rtest_t3.b;", dest=Remote, aclOverride=0 '\000') at   postgres.c:727
 
#13 0x120159f80 in pg_exec_query (   query_string=0x11fffa918 "explain update rtest_v1 set a = rtest_t3.a +   20 where
b= rtest_t3.b;") at postgres.c:656
 
#14 0x12015baa0 in PostgresMain (argc=10, argv=0x11fffee90, real_argc=9,    real_argv=0x11ffffc28) at postgres.c:1658
#15 0x12012d02c in DoBackend (port=0x1400d9a00) at postmaster.c:1628
#16 0x12012c778 in BackendStartup (port=0x1400d9a00) at postmaster.c:1373
#17 0x12012b5d8 in ServerLoop () at postmaster.c:823
#18 0x12012ae00 in PostmasterMain (argc=9, argv=0x11ffffc28)   at postmaster.c:616
#19 0x1200e0b30 in main (argc=9, argv=0x11ffffc28) at main.c:93
(gdb)

ResolveNew() is a recursive function in backend/rewrite/rewriteManip.c.
The problem appears one time that it is called with nodePtr pointing to a
Var node. This is the node:

(gdb) p *((Var *) *nodePtr)
$4 = {type = T_Var, varno = 4, varattno = 1, vartype = 23, vartypmod = -1,  varlevelsup = 0, varnoold = 4, varoattno =
1}

In the beginning of the function, there is a 'switch(nodeTag(node))'
('node' is assigned the value of '*nodePtr'), that jumps to the
corresponding 'case T_Var:' at line 695. There you see a call to
FindMatchingNew() (a static function of the same c file), that returns a
pointer to a node that, from what I have seen in that function, is
expected to be of type Expr, and indeed it is. This pointer is stored in
a variable named 'n'. Right after that, you can see this code:

if (n == NULL)
{   ... some code that isn't executed because n != NULL
}
else
{       *nodePtr = copyObject(n);       ((Var *) *nodePtr)->varlevelsup = this_varlevelsup;
}

Well, this *can't* be correct. 'n' points to a node of type Expr, and of
course copyObject returns a node of that type. But that node is treated in
the following line as if it were of type Var! The value of
'this_varlevelsup' is 0, and the offset of the field 'varlevelsup' of
node Var is the same as that of the field 'oper' of node Expr, and the 
pointer is overwritten.

I'd bet that the copy of the Expr node is not supposed to be stored in
*nodePtr (which points to a node of type Var), but I don't know what would
be the right action. I've found the bug, now I'll let the wise ones fix it
;-)

This problem has been hidden until now (for me, at least) because of a bug
in the 'money' data type that broke the rules test before it reached that
point. That bug sowed up only when you had locale enabled and your
currency didn't use cents (like spanish pesetas). On 32-bit systems it
is likely that the pointer isn't overwritten, so the bug didn't showed up,
either.

Cheers,
Pedro.

-- 
-------------------------------------------------------------------
Pedro José Lobo Perea                   Tel:    +34 91 336 78 19
Centro de Cálculo                       Fax:    +34 91 331 92 29
E.U.I.T. Telecomunicación               e-mail: pjlobo@euitt.upm.es
Universidad Politécnica de Madrid
Ctra. de Valencia, Km. 7                E-28031 Madrid - España / Spain



Re: [HACKERS] Postgres dies in the rules regression test (64-bit problem)

От
Tom Lane
Дата:
"Pedro J. Lobo" <pjlobo@euitt.upm.es> writes:
> I have located the point where the pointer is overwritten.

>         *nodePtr = copyObject(n);
>         ((Var *) *nodePtr)->varlevelsup = this_varlevelsup;

> Well, this *can't* be correct. 'n' points to a node of type Expr,

Good sleuthing!  It looks like on a 32-bit machine, the overwritten
area will be just past the last field that's part of Expr, and because
of the memory manager's habit of rounding up object sizes that area
just happens to be available space rather than the start of the next
object.  So that's why we hadn't found it before.

I believe the copied TLE entry could be any of several other node types
besides Var and Expr, so it's probably possible to see related failures
even on a 32-bit machine.

Jan, what should really be happening here?
        regards, tom lane