Обсуждение: vacuum analyze corrupts database

Поиск
Список
Период
Сортировка

vacuum analyze corrupts database

От
Michael Brusser
Дата:
We're running into problem with vacuum analyze. We first noticed that
with Postgres v.7.2.1. Now with v.7.3.2 it appears again.
After nightly maintenance application isn't functioning. It looks that
corruption occurs (at least) on one table.

I was able to narrow down the problem by dropping all but one
column from the table in question. What I have now is a table
with a single varchar column and just over three hundred records on it.
Running vacuum analyze on this table results in some sort of corruption.
I'm running Pg 7.3.2 on Solaris. There's no explicit settings for vacuum
in postgresql.conf

Attached is the dump of this table. The problem can be easily reproduced.
Load the dump into your database, it'll create table "nla"

syncdb=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ;
 count
-------
   121
(1 row)

syncdb=# vacuum analyze nla;
VACUUM

-- --- now repeat the same query ---
syncdb=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: WARNING:  Message
from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
Failed.

Now, if I reconnect and delete the nla's record from pg_statistic
I can run the query again.

Problem could be data related, I tried to poke around and update
all records beginning with "sync:///FUNCTREE" to something different,
then problem may go away.

Could someone from the Postgres team comment on this, please.
Mike.


Вложения

Re: vacuum analyze corrupts database

От
Tom Lane
Дата:
Michael Brusser <michael@synchronicity.com> writes:
> Attached is the dump of this table. The problem can be easily reproduced.

Not as easily as all that --- your example works fine for me, under
7.2.4, 7.3.3, or CVS tip.  I speculate that there may be some necessary
factor in build options or runtime environment.  How exactly did you
configure Postgres, what compiler did you use, are you using a non-C
locale or non-default multibyte encoding, etc?

It would help to know where the crash is occurring; could you get a
debugger backtrace please?
        regards, tom lane


Re: vacuum analyze corrupts database

От
Manfred Koizar
Дата:
On Fri, 23 May 2003 10:33:17 -0400, Michael Brusser
<michael@synchronicity.com> wrote:
>syncdb=# vacuum analyze nla;
>VACUUM
>
>-- --- now repeat the same query ---
>syncdb=# select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ;
>server closed the connection unexpectedly
>        This probably means the server terminated abnormally
>        before or while processing the request.

Works for me:

regression=# VACUUM VERBOSE ANALYSE nla;
INFO:  --Relation public.nla--
INFO:  Pages 3: Changed 0, Empty 0; Tup 323: Vac 0, Keep 0, UnUsed 0.       Total CPU 0.00s/0.00u sec elapsed 0.00
sec.
INFO:  Analyzing public.nla
VACUUM
regression=# select count (*) from nla where note_url LIKE
'sync:///FUNCTREE%' ;count
-------  121
(1 row)

regression=# SELECT version();                           version
----------------------------------------------------------------PostgreSQL 7.3.2 on i686-pc-linux-gnu, compiled by GCC
2.7.2.1
(1 row)

I did the VACUUM - SELECT cycle almost 20 times on my develpment
system and then several times on a second machine, all without any
problem:                          version
-------------------------------------------------------------PostgreSQL 7.3 on i586-pc-linux-gnu, compiled by GCC
2.95.2

Are you able to reproduce the problem on any other computer?

ServusManfred


Re: vacuum analyze corrupts database

От
Michael Brusser
Дата:
> Not as easily as all that --- your example works fine for me, under
> 7.2.4, 7.3.3, or CVS tip.  I speculate that there may be some necessary
> factor in build options or runtime environment.  How exactly did you
> configure Postgres, what compiler did you use, are you using a non-C
> locale or non-default multibyte encoding, etc?
>
> It would help to know where the crash is occurring; could you get a
> debugger backtrace please?
>
>             regards, tom lane

This is a good news. What you said here made me do something
I should've done already - I loaded this table into another
instance of the database and I could not reproduce the problem!
I try to analyze situation, I think this is what it looks like.
Two databases were built few days apart with no changes to the code,
make files, build options, etc.
vacuum analyze on the same table very consistently creates problem
on one db, and works fine on another.
I use the same instance of postmaster against each db, I guess
whatever makes difference must be somewhere under $PGDATA directory.
The database that has the problem underwent some testing. Some data
was created and deleted while QA guys tested our application.
Test was manual, this makes it difficult to exactly reproduce the case.
I've attached the config options we use and database log.
If you're willing to look farther I could tar/zip the db as well.
but let me know if there's something else I could try first.

What I did this time was
select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ;
vacuum analyze nla;
\encoding
select count (*) from nla where note_url LIKE 'sync:///FUNCTREE%' ;

I alo tried to drop and recreate database - same result.

Thank you.
Mike.


------------------------------------------------------
tomkins%  pg_config  --configure
'--srcdir=/home/synch/build/relipg21/src/vendor/postgresql/732'
'--prefix=/home/synch/build/relipg21/obj/SunOS_58_CC_60/postgresql-732/insta
ll'
'--enable-locale'
'--enable-multibyte=UNICODE'
'--with-tcl'
'--without-tk'
'--with-tclconfig=/home/synch/build/relipg21/sbin/SunOS_58_CC_60'
'--with-tkconfig=/home/synch/build/relipg21/sbin/SunOS_58_CC_60'
'--with-includes=/home/synch/build/relipg21/obj/SunOS_58_CC_60/postgresql-73
2/../tcl/include

/home/synch/build/relipg21/obj/SunOS_58_CC_60/postgresql-732/../tk/include'
'--with-libraries=/home/synch/build/relipg21/sbin/SunOS_58_CC_60'
'--without-readline'
'--without-zlib'
'--bindir=/home/synch/build/relipg21/sbin/SunOS_58_CC_60'
'--libdir=/home/synch/build/relipg21/sbin/SunOS_58_CC_60'
'CC=gcc -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
------------------------------------------------------
tomkins% uname -a
SunOS tomkins 5.8 Generic_108528-19 sun4u sparc SUNW,Ultra-80
tomkins% locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
------------------------------------------------------


Вложения

Re: vacuum analyze corrupts database

От
Tom Lane
Дата:
Michael Brusser <michael@synchronicity.com> writes:
> I should've done already - I loaded this table into another
> instance of the database and I could not reproduce the problem!

Hm.  The log looks like the crash is occurring during planning (since
you get a "parse tree" message but no "ProcessQuery").  I'm guessing
that some selectivity-estimation operation that tries to use the
pg_statistic entries generated by VACUUM ANALYZE is failing.  Can't
narrow it down further than that though.  Again, how about a debugger
stack trace?  Also, the pg_stats rows for the table would be interesting.

> If you're willing to look farther I could tar/zip the db as well.

Waste of time, likely, since I don't have a Solaris system to install it
on.
        regards, tom lane


Re: vacuum analyze corrupts database

От
Michael Brusser
Дата:
>... Again, how about a debugger stack trace?
> Also, the pg_stats rows for the table would be interesting.
>

---------
Tom,
thanks for feedback.

This time I started postgres directly and ran 'analyze nla;'
followed by 'select ... where ... like..' clause with debugger attached -
I got this trace:

=================================================================
(dbx)
signal SEGV (no mapping at the fault address) in pfree at 0x2535e4
pfree+0x1c:     ld      [%o1 + 0x4], %o0
(dbx) where 30
=>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628,
0xffbee888), at 0x2535e4
  [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0,
0x413), at 0x1fc6b4
  [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at
0x1f94e4
  [4] scalarltsel(0xffbeea30, 0x1f9864, 0x413, 0x413, 0xb, 0xff0000), at
0x1f9a00
  [5] DirectFunctionCall4(0x1f9864, 0x484608, 0x42a, 0x4892d0, 0x0,
0x488a08), at 0x242de8
  [6] prefix_selectivity(0x484608, 0x488a88, 0x488e28, 0xffbeec08,
0xffbeec23, 0xffbeec90), at 0x1fe8fc
  [7] patternsel(0xffbeed38, 0x0, 0x42e9b0, 0x0, 0x0, 0xffbeed10), at
0x1f9e3c
  [8] likesel(0xffbeed38, 0xffbeed38, 0x1f9ff4, 0x20, 0x0, 0x28), at
0x1fa004
  [9] OidFunctionCall4(0x71b, 0x484608, 0x4bd, 0x488b70, 0x1, 0x492c18), at
0x2447c0
  [10] restriction_selectivity(0x484608, 0x4bd, 0x488b70, 0x1, 0x0,
0xffbeee80), at 0x169720
  [11] clauselist_selectivity(0x484608, 0x488d98, 0x1, 0x0, 0x0,
0xff141c4c), at 0x14d990
  [12] restrictlist_selectivity(0x484608, 0x488d50, 0x1, 0x20, 0x0,
0xff18e9d4), at 0x14d7f4
  [13] set_baserel_size_estimates(0x484608, 0x488be0, 0x1000000, 0x10, 0xc0,
0x488a08), at 0x1503d4
  [14] set_plain_rel_pathlist(0x484608, 0x488be0, 0x484a10, 0x0, 0x0, 0x0),
at 0x14cb94
  [15] set_base_rel_pathlists(0x484608, 0x0, 0x1, 0x0, 0x0, 0x488cb0), at
0x14cb58
  [16] make_one_rel(0x484608, 0x484608, 0xffbef268, 0xffbef260, 0x4885d8,
0x28), at 0x14ca00
  [17] subplanner(0x484608, 0x0, 0x0, 0x0, 0x29, 0x492c18), at 0x15c600
  [18] query_planner(0x484608, 0x0, 0x0, 0x0, 0x4, 0x0), at 0x15c4f0
  [19] grouping_planner(0x484608, 0xbff00000, 0x0, 0x5, 0x0, 0xff141c4c), at
0x15e4c8
  [20] subquery_planner(0x484608, 0xbff00000, 0x0, 0x2b14c0, 0x225e0,
0xff18e9d4), at 0x15cc28
  [21] planner(0x484608, 0x0, 0x0, 0xa, 0x4354, 0x0), at 0x15c8d0
  [22] pg_plan_query(0x484608, 0x1, 0x0, 0xffbef590, 0x0, 0x0), at 0x1a5afc
  [23] pg_exec_query_string(0x4844e8, 0x1, 0x42e8a0, 0x2b8f6c, 0x73,
0x484500), at 0x1a618c
  [24] PostgresMain(0x2, 0x323428, 0x323440, 0x400, 0x3251c4, 0x175), at
0x1a8038
  [25] main(0x2, 0xffbef7b4, 0xffbef7c0, 0x2ea310, 0x0, 0x0), at 0x1282e0

===================================================================

At some point I also ran this case under truss, here's the tail of the truss
log.
... ... ...
read(0, "   s e l e c t   c o u n".., 1024)     = 68
write(2, " D E B U G :     S t a r".., 32)      = 32
write(2, " L O G :     q u e r y :".., 82)      = 82
write(2, " L O G :     p a r s e  ".., 1246)    = 1246
write(2, " L O G :     r e w r i t".., 1258)    = 1258
llseek(24, 81920, SEEK_SET)                     = 81920
read(24, "\0\0\0\0\0\0\010\0\0\001".., 8192)    = 8192
llseek(25, 49152, SEEK_SET)                     = 49152
read(25, "\0\0\0\0\0\0\010\0\0\001".., 8192)    = 8192
brk(0x00499AE8)                                 = 0
brk(0x004D9AE8)                                 = 0
open64("/home/tmichael/sean/postgresql/base/100423/16645", O_RDWR) = 32
read(32, "\0\0\0\0\0 C 6E0\0\0\0\t".., 8192)    = 8192
read(32, "\0\0\0\001 Y 7C8\0\0\0 '".., 8192)    = 8192
open64("/home/tmichael/sean/postgresql/base/100423/16408", O_RDWR) = 33
llseek(33, 24576, SEEK_SET)                     = 24576
read(33, "\0\0\0\001 Y17 h\0\0\0 '".., 8192)    = 8192
llseek(24, 57344, SEEK_SET)                     = 57344
read(24, "\0\0\0\0\0\0\010\0\0\001".., 8192)    = 8192
    Incurred fault #6, FLTBOUNDS  %pc = 0x002541D4
      siginfo: SIGSEGV SEGV_MAPERR addr=0x0161F3EC
    Received signal #11, SIGSEGV [default]
      siginfo: SIGSEGV SEGV_MAPERR addr=0x0161F3EC
        *** process killed ***

===========================================

I am attaching the dump of pg_statistic table,
I believe the record for the nla table is the last one in the file.

Thank you,
please let me know what else can I do.
Mike.

Вложения

Re: vacuum analyze corrupts database

От
Tom Lane
Дата:
Michael Brusser <michael@synchronicity.com> writes:
> (dbx) where 30
> =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628,
> 0xffbee888), at 0x2535e4
>   [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0,
> 0x413), at 0x1fc6b4
>   [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at
> 0x1f94e4

Hmm.  Are you running the database in a non-C locale?  (pg_controldata
would tell you.)  If so, this pfree is trying to pfree one of three
strings that were filled with strxfrm().

I am wondering if strxfrm() on your platform sometimes writes more bytes
than it is supposed to.  I have seen vsnprintf() overrun its output
buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit
libc in Solaris 7 had such a problem).  Could there be a similar bug in
their strxfrm?
        regards, tom lane


Re: vacuum analyze corrupts database

От
Michael Brusser
Дата:
> Hmm.  Are you running the database in a non-C locale?
> (pg_controldata would tell you.)
-----------

Here's the output of pg_controldata:

pg_control version number:            72
Catalog version number:               200211021
Database cluster state:               in production
pg_control last modified:             Sun May 25 18:38:06 2003
Current log file ID:                  0
Next log file segment:                2
Latest checkpoint location:           0/15EF7A8
Prior checkpoint location:            0/15ED2D8
Latest checkpoint's REDO location:    0/15EF7A8
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        47
Latest checkpoint's NextXID:          3563
Latest checkpoint's NextOID:          118086
Time of latest checkpoint:            Sun May 25 18:38:02 2003
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               Floating point
Maximum length of locale name:        128
LC_COLLATE:                           en_US.ISO8859-1
LC_CTYPE:                             en_US.ISO8859-1
-----------------------------
locale settings on the host:
tomkins% locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
-----------------------------
> I am wondering if strxfrm() on your platform sometimes writes more bytes
> than it is supposed to.  I have seen vsnprintf() overrun its output
> buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit
> libc in Solaris 7 had such a problem).  Could there be a similar bug in
> their strxfrm?

We're on Solaris 8. I'll try to find information on strxfrm bugs,
but do you rule out any problems in Postgres code?
Is there a good explanation to why the same table loaded into another
Postgres
installation on the same host can be analyzed without any problems?
Also in my database I can drop/create database, load table and reproduce the
error.
Not sure what to make out of this.

Mike.


-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane
Sent: Sunday, May 25, 2003 9:03 PM
To: michael@synchronicity.com
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] vacuum analyze corrupts database


Michael Brusser <michael@synchronicity.com> writes:
> (dbx) where 30
> =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628,
> 0xffbee888), at 0x2535e4
>   [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0,
> 0x413), at 0x1fc6b4
>   [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at
> 0x1f94e4

Hmm.  Are you running the database in a non-C locale?  (pg_controldata
would tell you.)  If so, this pfree is trying to pfree one of three
strings that were filled with strxfrm().

I am wondering if strxfrm() on your platform sometimes writes more bytes
than it is supposed to.  I have seen vsnprintf() overrun its output
buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit
libc in Solaris 7 had such a problem).  Could there be a similar bug in
their strxfrm?
        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly




Re: vacuum analyze corrupts database

От
Tom Lane
Дата:
Michael Brusser <michael@synchronicity.com> writes:
> but do you rule out any problems in Postgres code?

Never ... but when you're the only one reporting a problem, a local
issue is something to consider.

> Is there a good explanation to why the same table loaded into another
> Postgres
> installation on the same host can be analyzed without any problems?

Well, first thing I'd ask is whether the other installation is using
the same locale settings.
        regards, tom lane


Re: vacuum analyze corrupts database

От
Michael Brusser
Дата:
> Well, first thing I'd ask is whether the other installation is using
> the same locale settings.
Oh, yes, I am very sure of that.
Mike




Re: vacuum analyze corrupts database

От
Alvaro Herrera Munoz
Дата:
On Wed, May 28, 2003 at 09:04:18AM -0400, Michael Brusser wrote:
> > Well, first thing I'd ask is whether the other installation is using
> > the same locale settings.
> Oh, yes, I am very sure of that.

Note that while the machine reports the C locale, pg_controldata
actually said the database was en_US.iso8859-1 or something like that.
Make sure pg_controldata reports the same on the other machine...

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"The eagle never lost so much time as
when he submitted to learn from the crow." (William Blake, citado por Nobody)