Re: [HACKERS] Vacuum analyze bug CAUGHT

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] Vacuum analyze bug CAUGHT
Дата
Msg-id 7113.936888336@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Vacuum analyze bug CAUGHT  (Michael Simms <grim@argh.demon.co.uk>)
Ответы Re: [HACKERS] Vacuum analyze bug CAUGHT
Re: [HACKERS] Vacuum analyze bug CAUGHT
Список pgsql-hackers
Michael Simms <grim@argh.demon.co.uk> writes:
> Okee, I have caught the vacuum analyse crash that was giving me a load
> of grief

I spent a good while last night trying to duplicate your report, but
couldn't on either current sources or latest REL6_5 branch.  Maybe that
means we fixed the problem already --- but I think it's more likely that
there's a platform dependency involved, or some additional enabling
condition that you forgot to mention.  So you'll have to keep poking
at it.  However, I do have some comments and suggestions.

Starting at the end, the final crash:

> Program received signal SIGSEGV, Segmentation fault.
> AllocSetReset (set=0x0) at aset.c:159
> aset.c:159: No such file or directory.
> (gdb) where
> #0  AllocSetReset (set=0x0) at aset.c:159
> #1  0x810ac12 in EndPortalAllocMode () at portalmem.c:938
> #2  0x8078833 in AtAbort_Memory () at xact.c:800
> #3  0x80789ff in AbortTransaction () at xact.c:1026
> #4  0x8078aef in AbortCurrentTransaction () at xact.c:1243
> #5  0x80deed6 in PostgresMain (argc=-1073742288, argv=0xbffff7a0, real_argc=7, 
>     real_argv=0xbffffd04) at postgres.c:1550

appears to be the same bug I alluded to before: memory allocation isn't
cleaned up properly if elog(ERROR) is executed outside a transaction.
I know how to fix this, and will do so before 6.5.2, but fixing it will
just prevent a coredump after an error has already occurred.  What we
need to be looking for in your setup is the reason why an error is being
reported at all.

The first thing you can do is look to see what the error message is ---
it should be in the postmaster's stderr logfile, even though 6.5.* libpq
omits to display it because of the crash.  Another useful thing would be
to set a breakpoint at elog() so that you can examine the context of the
error report.  (Actually, since vacuum generates lots of elog(NOTICE)
and elog(DEBUG), the breakpoint had better be further down in elog,
perhaps where it's about to longjmp back to PostgresMain.)

BTW, VACUUM runs a separate transaction for each table it works on,
so although most of its work is done inside a transaction, there are
short intervals between tables where it's not in one.  The error must
be getting reported during one of these intervals.  That narrows things
down a lot.

Now, about the SIGUSR2.  That's mighty suggestive, but it's not
necessarily an indication of a bug.  There are two reasons why a
SIGUSR2 would be delivered to a backend.  One is LISTEN/NOTIFY, which
I assume we can discount, since you'd have to have explicitly done it.
The other is that the SI message buffer manager sends a broadcast
SIGUSR2 to all the backends if it thinks its message buffer is getting
dangerously full --- presumably that's what you saw happening.  So the
questions raised are (a) why is the buffer getting full, and (b) could
it have actually overflowed later, and if so did the overflow recovery
code work?

Background: the SI message buffer is a scheme for notifying backends
to discard cached copies of tuples from the system tables.  Whenever
a backend modifies a tuple from the system tables, it has to send out
an SI ("shared cache invalidation") message telling the other backends
it has changed tuple X in table Y, so that they discard their
no-longer-accurate cached copies, if any.  Messages in the buffer can be
discarded only after all active backends have read them.  Stuff like
VACUUM tends to produce a lot of SI messages, but table
creation/deletion causes some too.

The SI buffer is global across an installation (one postmaster and its
children), *not* per-database, so even if you only had these two
backends connected to your crashtest database, they could have been
affected by anything that was going on in other databases belonging
to the same postmaster.  Were there other backends alive at the time,
and if so what were they doing, or not doing?

If all the backends are busy, the SI buffer really shouldn't get
anywhere near full, although I suppose it could happen under extreme
conditions.  The case where the buffer tends to fill is when one or
more backends are sitting idle, waiting for a client command.  They'll
only read SI messages during transaction start (or, in 6.6, after
acquiring a lock), so an idle backend blocks the SI buffer from
emptying.  The point of the SIGUSR2 broadcast is to kick-start idle
backends so that they will execute a transaction and receive their
SI messages.

I am guessing that you had an idle backend connected to some other
database of the same postmaster, so that the SI buffer was getting
full of the messages being generated by the VACUUM and table create/
delete processes.  If you had no idle backends then we ought to look
for the reason for the SI buffer filling far enough to cause SIGUSR2;
but if you did then it's a normal condition.

BTW, if you have a backend stopped due to a gdb breakpoint or trap,
it's certainly not consuming SI messages.  So when you left it sit after
observing the SIGUSR2, you altered the behavior.  The VACUUM process was
still generating SI messages, and even though the hypothetical idle
backend would have eaten its messages thanks to SIGUSR2, the one you had
blocked with gdb stopped doing so.  So eventually there would have been
an SI overflow condition.  I think that would take longer than "3-5
seconds", though, unless your machine is way faster than mine.

If the SI buffer does overflow because someone isn't eating his messages
promptly, it's not supposed to be fatal.  Rather, the SI manager
reports to all the backends "Sorry, I've lost track of what's going on,
so I suggest you discard *all* your cached tuples."  However, 6.5.* has
some bugs in the shared cache reset process.  (I think I have fixed these
for 6.6, but there are enough poorly-tested revisions in the affected
modules that we'd decided not to risk trying to back-patch 6.5.*.)

Anyway, back to the point: this is all very suggestive that maybe what
you are seeing is SI overflow and a consequent failure.  But I'm not
convinced of it, partly because of the timing issue and partly because
the vacuum process would have issued a "NOTICE: SIReadEntryData: cache
state reset" message before entering the buggy code.  I didn't see one
in your trace.  However, if you find instances of this message in
your postmaster logfile, then it's definitely a possible cause.
(I would still wonder why SI overflow is occurring in the normal case
where you're not using gdb to block a backend from responding to
SIGUSR2.)

I hope this gives you enough info to poke at the problem more
intelligently.

Lastly, did you build with --enable-cassert?  The assert checks slow things
down a little, but are often real helpful when looking for backend bugs.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [COMMITTERS] pgsql/src/backend/lib (stringinfo.c)
Следующее
От: Lamar Owen
Дата:
Сообщение: Re: [HACKERS] PG_UPGRADE status