Обсуждение: BUG #6086: Segmentation fault

Поиск
Список
Период
Сортировка

BUG #6086: Segmentation fault

От
"Dennis"
Дата:
The following bug has been logged online:

Bug reference:      6086
Logged by:          Dennis
Email address:      dennis.noordsij@helsinki.fi
PostgreSQL version: 9.0.4
Operating system:   FreeBSD 8.2 (64 bit)
Description:        Segmentation fault
Details:

For some reason a 9.0.4 server which has always run completely stable, and
without any recent changes, crashed ("WARNING:  terminating connection
because of crash of another server process" ..)

Since it is not a debug build the backtrace is not that useful, but
hopefully may give some idea of where to look, or ring a bell with someone.

#0  0x0000000806a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3
#1  0x0000000800c1d965 in xmlRMutexLock () from /usr/local/lib/libxml2.so.5
#2  0x0000000800c6c67d in xmlDictFree () from /usr/local/lib/libxml2.so.5
#3  0x0000000800bb1ac5 in xmlFreeParserCtxt () from
/usr/local/lib/libxml2.so.5
#4  0x0000000800bc3f19 in xmlParseBalancedChunkMemoryRecover () from
/usr/local/lib/libxml2.so.5
#5  0x0000000800bc4090 in xmlParseBalancedChunkMemory () from
/usr/local/lib/libxml2.so.5
#6  0x000000000068c6a0 in cursor_to_xml ()
#7  0x000000000068c7f7 in xml_in ()
#8  0x00000000006a3d0a in InputFunctionCall ()
#9  0x00000000006a494a in OidInputFunctionCall ()
#10 0x00000000004db6f7 in coerce_type ()
#11 0x00000000004dbcd7 in coerce_to_target_type ()
#12 0x00000000004dbd8a in coerce_to_specific_type ()
#13 0x00000000004de092 in transformWithClause ()
#14 0x00000000004de84b in transformExpr ()
#15 0x00000000004ddfad in transformWithClause ()
#16 0x00000000004de84b in transformExpr ()
#17 0x00000000004ddfad in transformWithClause ()
#18 0x00000000004de84b in transformExpr ()
#19 0x00000000004e9cb7 in transformTargetEntry ()
#20 0x00000000004ea459 in transformTargetList ()
#21 0x00000000004bff84 in transformStmt ()
#22 0x00000000005f0624 in pg_analyze_and_rewrite_params ()
#23 0x000000000055d63f in SPI_cursor_open_with_paramlist ()
#24 0x000000000055e2da in SPI_prepare_params ()
#25 0x0000000805f0d31c in exec_get_datum_type () from
/usr/local/lib/postgresql/plpgsql.so
#26 0x0000000805f0df0f in exec_get_datum_type () from
/usr/local/lib/postgresql/plpgsql.so
#27 0x0000000805f0f8b3 in exec_get_datum_type () from
/usr/local/lib/postgresql/plpgsql.so
#28 0x0000000805f117ed in exec_get_datum_type () from
/usr/local/lib/postgresql/plpgsql.so
#29 0x0000000805f1109b in exec_get_datum_type () from
/usr/local/lib/postgresql/plpgsql.so
#30 0x0000000805f136f2 in plpgsql_exec_function () from
/usr/local/lib/postgresql/plpgsql.so
#31 0x0000000805f093a3 in plpgsql_call_handler () from
/usr/local/lib/postgresql/plpgsql.so
#32 0x0000000000544f00 in GetAttributeByNum ()
#33 0x0000000000541dfa in ExecProject ()
#34 0x0000000000556293 in ExecResult ()
#35 0x000000000054196d in ExecProcNode ()
#36 0x00000000005406bf in standard_ExecutorRun ()
#37 0x00000000005f2d98 in PostgresMain ()
#38 0x00000000005f4281 in PortalRun ()
#39 0x00000000005f0ebb in pg_parse_query ()
#40 0x00000000005f18c7 in PostgresMain ()
#41 0x00000000005c1a9d in ClosePostmasterPorts ()
#42 0x00000000005c2767 in PostmasterMain ()
#43 0x000000000056cc7e in main ()

Most of what it usually does is, as can be seen in this backtrace, running
plpgsql procedures which return XML data (so I'm hesitant to blame bad
hardware at this point). It happened during a normal query that is ran very
often, everything via psycopg2. Server uses WAL log shipping (crash did not
coincide with running an archive command, no base backup in progress)

I will switch to a debug build, but I haven't seen this crash before so it
might never produce a useful backtrace.

Thanks for any suggestions

Re: BUG #6086: Segmentation fault

От
Robert Haas
Дата:
On Thu, Jun 30, 2011 at 3:32 AM, Dennis <dennis.noordsij@helsinki.fi> wrote:
> For some reason a 9.0.4 server which has always run completely stable, and
> without any recent changes, crashed ("WARNING: =A0terminating connection
> because of crash of another server process" ..)
>
> Since it is not a debug build the backtrace is not that useful, but
> hopefully may give some idea of where to look, or ring a bell with someon=
e.
>
> #0 =A00x0000000806a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3
> #1 =A00x0000000800c1d965 in xmlRMutexLock () from /usr/local/lib/libxml2.=
so.5
> #2 =A00x0000000800c6c67d in xmlDictFree () from /usr/local/lib/libxml2.so=
.5
> #3 =A00x0000000800bb1ac5 in xmlFreeParserCtxt () from
> /usr/local/lib/libxml2.so.5
> #4 =A00x0000000800bc3f19 in xmlParseBalancedChunkMemoryRecover () from
> /usr/local/lib/libxml2.so.5
> #5 =A00x0000000800bc4090 in xmlParseBalancedChunkMemory () from
> /usr/local/lib/libxml2.so.

Any idea what query triggered this?

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6086: Segmentation fault

От
noordsij
Дата:
> Any idea what query triggered this?

Only up to which stored procedure (which itself contains multiple
statements).


However, in a new fresh FreeBSD environment (under virtualbox) I can
trigger a similar (my guess is it is the same issue) segmentation fault
reliably.

Application is a Python web app, using psycopg2. It creates N worker
threads, where each worker thread creates its own connection to postgresql.
The threads then block on a global queue from which they read and process
requests. That is, connections are not shared between threads, and all
queries that are part of servicing a request are inside a single
transaction bound to a single connection. Connections are re-used.  (please
assume this is implemented correctly on the client; a faulty client should
not crash a server)

If N=1, everything is OK. If N=3 (confirmed, but I suppose anything > 1)
then the following happens for one particular testcase every single time:

gdb `which postgres` /usr/local/pgsql/data/postgres.core
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `postgres'.
Program terminated with signal 10, Bus error.
Reading symbols from /usr/local/lib/libintl.so.9...done.
Loaded symbols for /usr/local/lib/libintl.so.9
Reading symbols from /usr/local/lib/libxml2.so.5...done.
Loaded symbols for /usr/local/lib/libxml2.so.5
Reading symbols from /usr/lib/libssl.so.6...done.
Loaded symbols for /usr/lib/libssl.so.6
Reading symbols from /lib/libcrypto.so.6...done.
Loaded symbols for /lib/libcrypto.so.6
Reading symbols from /lib/libm.so.5...done.
Loaded symbols for /lib/libm.so.5
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /usr/local/lib/libiconv.so.3...done.
Loaded symbols for /usr/local/lib/libiconv.so.3
Reading symbols from /lib/libz.so.5...done.
Loaded symbols for /lib/libz.so.5
Reading symbols from /usr/local/lib/postgresql/plpgsql.so...done.
Loaded symbols for /usr/local/lib/postgresql/plpgsql.so
Reading symbols from /usr/local/lib/postgresql/citext.so...done.
Loaded symbols for /usr/local/lib/postgresql/citext.so
Reading symbols from /lib/libthr.so.3...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /usr/local/lib/postgresql/pg_trgm.so...done.
Loaded symbols for /usr/local/lib/postgresql/pg_trgm.so
Reading symbols from /usr/local/lib/postgresql/dict_snowball.so...done.
Loaded symbols for /usr/local/lib/postgresql/dict_snowball.so
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x0000000802a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3
[New Thread 801a42540 (LWP 100325)]
(gdb) bt
#0  0x0000000802a0d5dc in pthread_mutex_lock () from /lib/libthr.so.3
#1  0x0000000800d619ae in xmlRMutexLock (tok=0x801a12360) at threads.c:369
#2  0x0000000800dd6864 in xmlDictReference (dict=0x801a05760) at dict.c:510
#3  0x0000000800dd98e2 in xmlSAX2StartDocument (ctx=0x801a6c600) at
SAX2.c:999
#4  0x0000000800cdf992 in xmlParseDocument (ctxt=0x801a6c600) at
parser.c:10285
#5  0x0000000800ce8bc7 in xmlDoRead (ctxt=0x801a6c600, URL=0x0,
encoding=0x0, options=0, reuse=1) at parser.c:14612
#6  0x0000000800ce915d in xmlCtxtReadMemory (ctxt=0x801a6c600,
    buffer=0x801b55440 "<anonymized xml"...,
    size=272, URL=0x0, encoding=0x0, options=0) at parser.c:14890
#7  0x000000000076fcd2 in xpath (fcinfo=0x7fffffffd310) at xml.c:3400
#8  0x000000000059e7bb in ExecMakeFunctionResult (fcache=0x80293fee0,
econtext=0x802833140, isNull=0x7fffffffd84b "",
    isDone=0x0) at execQual.c:1827
#9  0x000000000059f183 in ExecEvalFunc (fcache=0x80293fee0,
econtext=0x802833140, isNull=0x7fffffffd84b "", isDone=0x0)
    at execQual.c:2263
#10 0x00000008025c7514 in exec_eval_simple_expr (estate=0x7fffffffdb20,
expr=0x8028df8e0, result=0x7fffffffd7e8,
    isNull=0x7fffffffd84b "", rettype=0x7fffffffd84c) at pl_exec.c:4597
#11 0x00000008025c6c8c in exec_eval_expr (estate=0x7fffffffdb20,
expr=0x8028df8e0, isNull=0x7fffffffd84b "",
    rettype=0x7fffffffd84c) at pl_exec.c:4188
#12 0x00000008025c3755 in exec_stmt_raise (estate=0x7fffffffdb20,
stmt=0x8028df800) at pl_exec.c:2485
#13 0x00000008025c18cb in exec_stmt (estate=0x7fffffffdb20,
stmt=0x8028df800) at pl_exec.c:1326
#14 0x00000008025c168a in exec_stmts (estate=0x7fffffffdb20,
stmts=0x801b5b0e8) at pl_exec.c:1233
#15 0x00000008025c14d5 in exec_stmt_block (estate=0x7fffffffdb20,
block=0x8028e99a0) at pl_exec.c:1170
#16 0x00000008025bfa85 in plpgsql_exec_function (func=0x801bd8440,
fcinfo=0x7fffffffdda0) at pl_exec.c:316
#17 0x00000008025ba99e in plpgsql_call_handler (fcinfo=0x7fffffffdda0) at
pl_handler.c:122
#18 0x000000000059e7bb in ExecMakeFunctionResult (fcache=0x801bec440,
econtext=0x801bec250, isNull=0x801bece98 "",
    isDone=0x801becfb0) at execQual.c:1827
#19 0x000000000059f183 in ExecEvalFunc (fcache=0x801bec440,
econtext=0x801bec250, isNull=0x801bece98 "",
    isDone=0x801becfb0) at execQual.c:2263
#20 0x00000000005a5347 in ExecTargetList (targetlist=0x801becf80,
econtext=0x801bec250, values=0x801bece80,
    isnull=0x801bece98 "", itemIsDone=0x801becfb0, isDone=0x7fffffffe344)
at execQual.c:5089
#21 0x00000000005a5913 in ExecProject (projInfo=0x801beceb0,
isDone=0x7fffffffe344) at execQual.c:5304
#22 0x00000000005b9da3 in ExecResult (node=0x801bec140) at nodeResult.c:155
#23 0x000000000059b3bf in ExecProcNode (node=0x801bec140) at
execProcnode.c:355
#24 0x00000000005993dd in ExecutePlan (estate=0x801bec030,
planstate=0x801bec140, operation=CMD_SELECT,
    sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection,
dest=0x801b3fb30) at execMain.c:1188
#25 0x0000000000598063 in standard_ExecutorRun (queryDesc=0x801bbc830,
direction=ForwardScanDirection, count=0)
    at execMain.c:280
#26 0x0000000000597f55 in ExecutorRun (queryDesc=0x801bbc830,
direction=ForwardScanDirection, count=0) at execMain.c:229
#27 0x0000000000697ab8 in PortalRunSelect (portal=0x801a68030, forward=1
'\001', count=0, dest=0x801b3fb30) at pquery.c:952
#28 0x00000000006977ae in PortalRun (portal=0x801a68030,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x801b3fb30,
    altdest=0x801b3fb30, completionTag=0x7fffffffe640 "") at pquery.c:796
#29 0x0000000000691cb1 in exec_simple_query (
    query_string=0x801a2a030 "select some_proc(2, E'<anonymized xml
..."...)
    at postgres.c:1058
#30 0x0000000000695c63 in PostgresMain (argc=2, argv=0x801a1a6f8,
username=0x801a1a6c0 "pgsql") at postgres.c:3936
#31 0x0000000000657a9d in BackendRun (port=0x801a6c300) at
postmaster.c:3555
#32 0x00000000006571da in BackendStartup (port=0x801a6c300) at
postmaster.c:3242
#33 0x000000000065463c in ServerLoop () at postmaster.c:1431
#34 0x0000000000653e10 in PostmasterMain (argc=3, argv=0x7fffffffebb8) at
postmaster.c:1092
#35 0x00000000005d7bca in main (argc=3, argv=0x7fffffffebb8) at main.c:188


Is some XML (parser) context somehow being shared / carried over between
postgresql processes?  postmaster does not seem to want to run under
valgrind at all (perhaps low memory in the VM? should it work easily?) so I
don't have more details about what it thinks goes wrong. I'll keep looking,
but any suggestions are very welcome!

Best,
Dennis

Re: BUG #6086: Segmentation fault

От
Robert Haas
Дата:
On Sun, Jul 24, 2011 at 5:06 PM, noordsij <noordsij@cs.helsinki.fi> wrote:
>> Any idea what query triggered this?
>
> Only up to which stored procedure (which itself contains multiple
> statements).

If you provide a reproducible test case, we can probably get this fixed.

Otherwise, we probably can't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #6086: Segmentation fault

От
Alvaro Herrera
Дата:
Excerpts from Robert Haas's message of lun jul 25 11:20:55 -0400 2011:
> On Sun, Jul 24, 2011 at 5:06 PM, noordsij <noordsij@cs.helsinki.fi> wrote:
> >> Any idea what query triggered this?
> >
> > Only up to which stored procedure (which itself contains multiple
> > statements).
>
> If you provide a reproducible test case, we can probably get this fixed.
>
> Otherwise, we probably can't.

But why is libxml calling pthread_mutex_lock and such?  Unless libxml is
making itself multithread inside a backend, that shouldn't matter --
much less cause a crash.

Note (to the OP) that it's totally unexpected to have a Postgres backend
become multithreaded, and has been shown to be dangerous when a library
made it so.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: BUG #6086: Segmentation fault

От
Stefan Kaltenbrunner
Дата:
On 07/25/2011 05:57 PM, Alvaro Herrera wrote:
> Excerpts from Robert Haas's message of lun jul 25 11:20:55 -0400 2011:
>> On Sun, Jul 24, 2011 at 5:06 PM, noordsij <noordsij@cs.helsinki.fi> wrote:
>>>> Any idea what query triggered this?
>>>
>>> Only up to which stored procedure (which itself contains multiple
>>> statements).
>>
>> If you provide a reproducible test case, we can probably get this fixed.
>>
>> Otherwise, we probably can't.
>
> But why is libxml calling pthread_mutex_lock and such?  Unless libxml is
> making itself multithread inside a backend, that shouldn't matter --
> much less cause a crash.
>
> Note (to the OP) that it's totally unexpected to have a Postgres backend
> become multithreaded, and has been shown to be dangerous when a library
> made it so.

well we have seen issues on freebsd in the past with libraries there
being compiled with threading support which caused them to get
ridiculous low stack limits.
Maybe it would be worth trying to recompile the libxml port without
threading support (if that is an option).


Stefan

Re: BUG #6086: Segmentation fault

От
noordsij
Дата:
After a few hours of watching strange things happening I finally stumbled
on the cause.

Very short summary: the postgres binary needs to be linked to libpthread,
as this will ensure a special fork() inside libthr (the FreeBSD libpthread
implementation/wrapper) is used which correctly deals with any pthread
usage (I am guessing especially statically initialized pthread_X variables
in for example libxml2).

This can be accomplished by adding "-lpthread" to the LDFLAGS when building
either libxml2 or postgresql itself, or by preloading libpthread.so when
starting postgresql.

Before advocating any particular solution, I'd like to throw this out first
and see how strongly people feel about this.


Cheers, and thanks to everyone who replied so far
Dennis

Re: BUG #6086: Segmentation fault

От
Tom Lane
Дата:
noordsij <noordsij@cs.helsinki.fi> writes:
> After a few hours of watching strange things happening I finally stumbled
> on the cause.

> Very short summary: the postgres binary needs to be linked to libpthread,
> as this will ensure a special fork() inside libthr (the FreeBSD libpthread
> implementation/wrapper) is used which correctly deals with any pthread
> usage (I am guessing especially statically initialized pthread_X variables
> in for example libxml2).

> This can be accomplished by adding "-lpthread" to the LDFLAGS when building
> either libxml2 or postgresql itself, or by preloading libpthread.so when
> starting postgresql.

Linking pthreads into the backend is likely to cause more problems than
it solves, especially if you're proposing that we do that everywhere.

            regards, tom lane

Re: BUG #6086: [SOLVED] Segmentation fault

От
noordsij
Дата:
> Linking pthreads into the backend is likely to cause more problems than
> it solves, especially if you're proposing that we do that everywhere.
>
>             regards, tom lane


No, I am not proposing that. Not linking to pthread when not needed should
actually be an
optimization since the libc noop stubs get used instead.

I've found the issue now (yes, again :)

Short version:
FreeBSD compiles the main server and contrib modules without and with
pthreads, respectively. No bug in postgresql. If you're in a hurry, no need
to read any further.


Longer version:
With FreeBSD ports, the contrib part is compiled separately. For some
reason, the contrib units are being made to link to pthread, but the server
itself isn't.

So the backend forks to handle a request, some stored procedure touches
some XML, libxml2 thinks it is initializing some mutex by calling some
pthread_ stubs in libc, the stored procedure updates a column which has a
trigram index, postgresql dynamically loads pg_trgm.so, which links to
libpthread, and now the weak libc symbols are overridden and a little bit
later we end up calling a real pthread_ method passing a non-initialized
mutex and get a bus error.

Lots of fun to debug, seeing pthread_self() suddenly changing return value
within a single thread/backend in the middle of processing a single
request..


With a custom (correct) compilation everything functions correctly with an
unmodified libxml2 and postgresql, and without any linking to or preloading
libpthread.

Question is, why are postgresql90-server and postgresql90-contrib compiled
differently ?

It seems it may have been intended as a quick fix for pgbench, as the
configure script for the -contrib port does not end up with "-pthread" in
the PTHREAD_CFLAGS in Makefile.global which breaks compilation of pgbench,
the server does have this, haven't found the reason yet. Side effect is
that all modules link to pthread while the server doesn't ..


I will continue this discussion within FreeBSD, as this is not a PostgreSQL
bug (though if there is ever a loadable contrib module which does need
pthreads the above situation will occur again).

Thanks again to all who helped!

Cheers
Dennis