Обсуждение: 7.0.2 crash, backtrace with debug available

Поиск
Список
Период
Сортировка

7.0.2 crash, backtrace with debug available

От
Alfred Perlstein
Дата:
There's two tracebacks of crashed 7.0.2 backends at the end of this
email.

I posted earlier this week about a table of ours getting corrupted
after some time.  The table looks like this:

details:id         | integer     | attr_type  | varchar(32) | attr_name  | varchar(32) | attr_vers  | varchar(32) |
attr_hits | bigint      | default 0
 

We're using a perl script to update this table from a "raw" table:

details_raw:id         | integer     | stat_date  | timestamp   | attr_type  | varchar(32) | attr_name  | varchar(32) |
attr_vers | varchar(32) | attr_hits  | bigint      | default 0
 


SELECT id, attr_type, attr_name, attr_vers, sum(attr_hits) as attr_hits
FROM details_raw
WHERE stat_date < ? 
GROUP BY counter_id, attr_type, attr_name, attr_vers 
;

then we loop over this UPDATE query calling an INSERT query if UPDATE
returns 0 rows.  We are vacuuming the table after every update (which
can be several hundred rows) which is why i need the vacuum.

INSERT INTO details   (id, attr_type, attr_name, attr_vers, attr_hits)   VALUES ( ?, ?, ?, ?, ? )
;

UPDATE details
SET attr_hits = attr_hits + ? 
WHERE id = ? AND attr_type = ? AND attr_name = ? AND attr_vers = ?
;

After a while we get this crash apparently, followed by crashes of any
backend that scans this table, not updating this specific table makes
the crashes go away, the problem seems pretty isolated to this data.

% gdb /usr/local/pgsql/bin/postgres postgres.54738.core 
#0  0x4829f77a in memmove () from /usr/lib/libc.so.4
(gdb) bt
#0  0x4829f77a in memmove () from /usr/lib/libc.so.4
#1  0x53260f14 in ?? ()
#2  0x8093c93 in vc_attrstats (onerel=0x84a2788, vacrelstats=0x8496290,    tuple=0xbfbfe93c) at vacuum.c:2354
#3  0x8091609 in vc_scanheap (vacrelstats=0x8496290, onerel=0x84a2788,    vacuum_pages=0xbfbfe9d0,
fraged_pages=0xbfbfe9c0)at vacuum.c:980
 
#4  0x8090ccb in vc_vacone (relid=2238241037, analyze=1, va_cols=0x0)   at vacuum.c:599
#5  0x8090454 in vc_vacuum (VacRelP=0xbfbfea60, analyze=1 '\001', va_cols=0x0)   at vacuum.c:299
#6  0x80903dc in vacuum (vacrel=0x84960e8 "\230`I\b ", verbose=1,    analyze=1 '\001', va_spec=0x0) at vacuum.c:223
#7  0x80fa444 in ProcessUtility (parsetree=0x8496110, dest=Remote)   at utility.c:694
#8  0x80f7e5e in pg_exec_query_dest (   query_string=0x81a9370 "VACUUM verbose analyze webhit_details_formatted;",
dest=Remote,aclOverride=0) at postgres.c:617
 
#9  0x80f7db9 in pg_exec_query (   query_string=0x81a9370 "VACUUM verbose analyze webhit_details_formatted;")   at
postgres.c:562
#10 0x80f8d1a in PostgresMain (argc=9, argv=0xbfbff0e0, real_argc=10,    real_argv=0xbfbffb40) at postgres.c:1590
#11 0x80e1d06 in DoBackend (port=0x843f000) at postmaster.c:2009
#12 0x80e1899 in BackendStartup (port=0x843f000) at postmaster.c:1776
#13 0x80e0abd in ServerLoop () at postmaster.c:1037
#14 0x80e04be in PostmasterMain (argc=10, argv=0xbfbffb40) at postmaster.c:725
#15 0x80aee43 in main (argc=10, argv=0xbfbffb40) at main.c:93
#16 0x80633c5 in _start ()
(gdb) list
2349                                    stats->guess1_hits = 1;
2350                                    stats->guess2_hits = 1;
2351                            }
2352                            if (!value_hit)
2353                            {
2354                                    vc_bucketcpy(stats->attr, value, &stats->guess2, &stats->guess2_len);
2355                                    stats->guess1_hits = 1;
2356                                    stats->guess2_hits = 1;
2357                            }
2358                    }
(gdb) print stats
No symbol "stats" in current context.
(gdb) print stats->attr
No symbol "stats" in current context.
(gdb) print value_hit
$1 = 0 '\000'
(gdb) print value
$2 = 1395003156
(gdb) print stats->guess2_len
No symbol "stats" in current context.
(gdb) print i
$1 = 3
(gdb) print attr_cnt
$2 = 167920861
(gdb) print *vacattrstats
Cannot access memory at address 0xb5a9e104.
(gdb) print tupDesc
$3 = 0x84a6368
(gdb) print *tupDesc
$4 = {natts = 5, attrs = 0x8492748, constr = 0x84a6380}
(gdb) print *onerel 
$5 = {rd_fd = 138985496, rd_nblocks = 892691556, rd_refcnt = 25698,  rd_myxactonly = 53 '5', rd_isnailed = 53 '5',
rd_isnoname= 100 'd',  rd_unlinked = 102 'f', rd_am = 0xac70000, rd_rel = 0x40580044, rd_id = 77,  rd_lockInfo =
{lockRelId= {relId = 1937204590, dbId = 1848586042}},  rd_att = 0x31737765, rd_rules = 0x2e63712e, rd_istrat =
0x706d7973, rd_support = 0x63697461, trigdesc = 0x61632e6f}
 
(gdb) print value;
Invalid character ';' in expression.
(gdb) print value 
$6 = 1395003156
(gdb) print *value
$7 = 892691554
(gdb) print isnull
$8 = 0 '\000'
(gdb) 
* IDENTIFICATION*        $Header: /home/pgcvs/pgsql/src/backend/commands/vacuum.c,v 1.148 2000/
05/19 03:22:29 tgl Exp $


here's another that happened to occur during what seems to be a 'COPY OUT':
#0  0x482a7d95 in ?? ()
(gdb) bt
#0  0x482a7d95 in ?? ()
#1  0x808c393 in CopyTo (rel=0x8777890, binary=0 '\000', oids=0 '\000',    fp=0x0, delim=0x8159fa9 "\t",
null_print=0x8159fab"\\N") at copy.c:508
 
#2  0x808bf99 in DoCopy (relname=0x87230e8 "~+", binary=0 '\000',    oids=0 '\000', from=0 '\000', pipe=1 '\001',
filename=0x0,   delim=0x8159fa9 "\t", null_print=0x8159fab "\\N") at copy.c:374
 
#3  0x80f98a3 in ProcessUtility (parsetree=0x8723110, dest=Remote)   at utility.c:262
#4  0x80f7e5e in pg_exec_query_dest (query_string=0x81a9388 "", dest=Remote,    aclOverride=0) at postgres.c:617
#5  0x80f7db9 in pg_exec_query (query_string=0x81a9388 "") at postgres.c:562
#6  0x80f8d1a in PostgresMain (argc=9, argv=0xbfbff0e0, real_argc=10,    real_argv=0xbfbffb40) at postgres.c:1590
#7  0x80e1d06 in DoBackend (port=0x843f000) at postmaster.c:2009
#8  0x80e1899 in BackendStartup (port=0x843f000) at postmaster.c:1776
#9  0x80e0abd in ServerLoop () at postmaster.c:1037
#10 0x80e04be in PostmasterMain (argc=10, argv=0xbfbffb40) at postmaster.c:725
#11 0x80aee43 in main (argc=10, argv=0xbfbffb40) at main.c:93
#12 0x80633c5 in _start ()
(gdb) up
#1  0x808c393 in CopyTo (rel=0x8777890, binary=0 '\000', oids=0 '\000',    fp=0x0, delim=0x8159fa9 "\t",
null_print=0x8159fab"\\N") at copy.c:508
 
508                                             string = (char *) (*fmgr_faddr(&out_functions[i]))
(gdb) list
503                                             continue;
504                                     }
505     #endif   /* _DROP_COLUMN_HACK__ */
506                                     if (!isnull)
507                                     {
508                                             string = (char *) (*fmgr_faddr(&out_functions[i]))
509                                                     (value, elements[i], typmod[i]);
510                                             CopyAttributeOut(fp, string, delim);
511                                             pfree(string);
512                                     }
(gdb) print isnull
$1 = 0 '\000'
(gdb) print string
$2 = 0xfffffffc <Address 0xfffffffc out of bounds>
(gdb) print value
$3 = 1072255572
(gdb) print elements[i]
$4 = 11134
(gdb) print typmod[i]  
$5 = 1742544
(gdb) print out_functions[i]
$6 = {fn_addr = 0x4005a, fn_plhandler = 0x208900, fn_oid = 11134, fn_nargs = 0}
(gdb) 
* IDENTIFICATION*    $Header: /home/pgcvs/pgsql/src/backend/commands/copy.c,v 1.106.2.2 2000/06/28 06:13:01 tgl Exp $

I know there's been a couple of updates to the source since this date
(compiled on Aug 3), any idea if:
a) an upgrade is a good idea
b) an upgrade is a safe idea

If there's anything that I can do to provide clearer information?

thanks very much for your time,
-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: 7.0.2 crash, backtrace with debug available

От
Alfred Perlstein
Дата:
* Alfred Perlstein <bright@wintelcom.net> [000920 06:19] wrote:
> There's two tracebacks of crashed 7.0.2 backends at the end of this
> email.

Sorry, a couple more things I should have included:

I'm running with a ~256 shared memory segment:

DAEMON=/usr/local/pgsql/bin/postmaster
DATA=/vol/amrd0/database/data
LOGFILE=/vol/amrd0/database/data/postgres.log
$DAEMON -N 32 -i -B 32768 -D$DATA -S -o "-F -S 65534" > $LOGFILE

-rw-------  1 pgsql  pgsql  1179570176 Sep 20 05:35 postgres.54738.core vacuuming process, why is it so huge?
-rw-------  1 pgsql  pgsql   280502272 Sep 20 05:48 postgres.54896.core COPY OUT process.

-Alfred



Re: 7.0.2 crash, backtrace with debug available

От
Tom Lane
Дата:
Alfred Perlstein <bright@wintelcom.net> writes:
> I posted earlier this week about a table of ours getting corrupted
> after some time.  The table looks like this:
> ...
> After a while we get this crash apparently, followed by crashes of any
> backend that scans this table, not updating this specific table makes
> the crashes go away, the problem seems pretty isolated to this data.

Hmm.  It looks like one of the variable-length fields is getting
clobbered, but I have no idea why.  It will probably take some digging
in the corefile and clobbered table to learn much.  The good thing is
that you seem to have a fairly easily reproducible bug, so tracking it
down should be possible.

Is the data+program needed to trigger the bug self-contained enough that
you could wrap it up and send it to me?  The most convenient thing from
my end would be to reproduce and study it here, if possible.

> I know there's been a couple of updates to the source since this date
> (compiled on Aug 3), any idea if:
> a) an upgrade is a good idea
> b) an upgrade is a safe idea

I see no reason that you shouldn't update to the tip of the
REL7_0_PATCHES branch, but I don't have a lot of hope that it will cure
your problem.  I doubt this is related to anything we've fixed.
        regards, tom lane


Re: 7.0.2 crash, backtrace with debug available

От
Alfred Perlstein
Дата:
* Alfred Perlstein <bright@wintelcom.net> [000920 06:19] wrote:
> There's two tracebacks of crashed 7.0.2 backends at the end of this
> email.

Also, after this the database is extremely unstable, queries lock up
and the postgresql server (not OS) requires a complete restart and
vacuum analyze before it's stable.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."