Обсуждение: Segmentation fault with core dump
Hi Group,
I'm using PG 9.1.9 with a client application using various versions of the pgsqlODBC driver on Windows. Cursors are used heavily, as well as some pretty heavy trigger queries on db writes which update several materialized views.
The server has 48GB RAM installed, PG is configured for 12GB shared buffers, 8MB max_stack_depth, 32MB temp_buffers, and 2MB work_mem. Most of the other settings are defaults.
The server will seg fault from every few days to up to two weeks. Each time one of the postgres server processes seg faults, the server gets terminated by signal 11, restarts in recovery for up to 30 seconds, after which time it accepts connections as if nothing ever happened. Unfortunately all the open cursors and connections are lost, so the client apps are left in a bad state.
Seg faults have also occurred with PG 8.4. However that server's DELL OMSA (hardware health monitoring system) began to report RAM parity errors, so I figured that the seg faults were due to hardware issues and I did not configure the system to save core files in order to debug. I migrated the database to a server running PG9.1 with the hopes that the problem would disappear, but it has not. So now I'm starting to debug.
Below are the relevant details. I'm not terribly savvy with gdb, so please let me know what else I could/should examine from the core dump, as well as anything else about the system/configuration.
Kind Regards,
-Joshua
#NB: some info in square brackets has been [redacted]
# grep postmaster /var/log/messages
Apr 10 13:18:32 [hostname] kernel: postmaster[17356]: segfault at 40 ip 0000000000710e2e sp 00007fffd193ca70 error 4 in postgres[400000+4ea000]
gdb /usr/pgsql-9.1/bin/postmaster -c core.17356
[...loading/reading symbols...]
Core was generated by `postgres: [username] [databasename] [client_ipaddress](1500) SELECT '.
Program terminated with signal 11, Segmentation fault.
#0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605
605 if (owner->ncatrefs < owner->maxcatrefs)
(gdb) q
# uname -a
Linux [hostname] 2.6.32-358.2.1.el6.x86_64 #1 SMP Tue Mar 12 14:18:09 CDT 2013 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release
Scientific Linux release 6.3 (Carbon)
# psql -U jberry
psql (9.1.9)
Type "help" for help.
jberry=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
(1 row)
Joshua Berry escribió: > gdb /usr/pgsql-9.1/bin/postmaster -c core.17356 > [...loading/reading symbols...] > Core was generated by `postgres: [username] [databasename] > [client_ipaddress](1500) SELECT '. > Program terminated with signal 11, Segmentation fault. > #0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605 > 605 if (owner->ncatrefs < owner->maxcatrefs) > (gdb) q Run "bt" here please (maybe "bt full" is even more illuminating) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-04-10 16:34:40 -0500, Joshua Berry wrote: > Below are the relevant details. I'm not terribly savvy with gdb, so please > let me know what else I could/should examine from the core dump, as well as > anything else about the system/configuration. > Kind Regards, > -Joshua > > #NB: some info in square brackets has been [redacted] > # grep postmaster /var/log/messages > Apr 10 13:18:32 [hostname] kernel: postmaster[17356]: segfault at 40 ip > 0000000000710e2e sp 00007fffd193ca70 error 4 in postgres[400000+4ea000] > > gdb /usr/pgsql-9.1/bin/postmaster -c core.17356 > [...loading/reading symbols...] > Core was generated by `postgres: [username] [databasename] > [client_ipaddress](1500) SELECT '. > Program terminated with signal 11, Segmentation fault. > #0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605 > 605 if (owner->ncatrefs < owner->maxcatrefs) > (gdb) q Could you show the output of 'bt full'? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi Andres!
On Wed, Apr 10, 2013 at 4:49 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Could you show the output of 'bt full'?
Program terminated with signal 11, Segmentation fault.
#0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605
605 if (owner->ncatrefs < owner->maxcatrefs)
(gdb) bt
#0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605
#1 0x00000000006e1382 in SearchCatCache (cache=0x2a1aad0,
v1=<value optimized out>, v2=<value optimized out>,
v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1143
#2 0x00000000006ec69e in getTypeOutputInfo (type=20, typOutput=0x2a8ac30,
typIsVarlena=0x2a8ac38 "") at lsyscache.c:2438
#3 0x0000000000459027 in printtup_prepare_info (myState=0x2a50c40,
typeinfo=0x2e7fc50, numAttrs=42) at printtup.c:263
#4 0x00000000004593c4 in printtup (slot=0x2d14618, self=0x2a50c40)
at printtup.c:297
#5 0x00000000006376ca in RunFromStore (portal=0x2aa9360,
direction=<value optimized out>, count=10, dest=0x2a50c40) at pquery.c:1121
#6 0x00000000006377b2 in PortalRunSelect (portal=0x2aa9360,
forward=<value optimized out>, count=10, dest=0x2a50c40) at pquery.c:939
#7 0x0000000000638c78 in PortalRun (portal=0x2aa9360, count=10,
isTopLevel=1 '\001', dest=0x2a50c40, altdest=0x2a50c40,
completionTag=0x7fffd193d0f0 "") at pquery.c:787
#8 0x000000000063661e in exec_execute_message (argc=<value optimized out>,
argv=<value optimized out>, dbname=0x298a150 "[dbname]",
username=<value optimized out>) at postgres.c:1965
#9 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
dbname=0x298a150 "[dbname]", username=<value optimized out>)
at postgres.c:4026
#10 0x00000000005f6c61 in BackendRun () at postmaster.c:3612
#11 BackendStartup () at postmaster.c:3302
#12 ServerLoop () at postmaster.c:1466
#13 0x00000000005f9431 in PostmasterMain (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1127
#14 0x000000000059a9b0 in main (argc=5, argv=0x2988480) at main.c:199
(gdb) bt full
#0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605
newmax = <value optimized out>
#1 0x00000000006e1382 in SearchCatCache (cache=0x2a1aad0,
v1=<value optimized out>, v2=<value optimized out>,
v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1143
res = 1 '\001'
cur_skey = {{sk_flags = 0, sk_attno = -2, sk_strategy = 3,
sk_subtype = 0, sk_collation = 0, sk_func = {
fn_addr = 0x686640 <oideq>, fn_oid = 184, fn_nargs = 2,
fn_strict = 1 '\001', fn_retset = 0 '\000', fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x298b108, fn_expr = 0x0},
sk_argument = 20}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0,
sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0,
fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000',
fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0,
fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}, {sk_flags = 0,
sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0,
sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0,
fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000',
fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0},
{sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0,
fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000',
fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0,
fn_expr = 0x0}, sk_argument = 0}}
hashValue = 2280326203
hashIndex = 59
elt = 0x7fc28648fb88
ct = 0x7fc28648fb78
relation = <value optimized out>
scandesc = <value optimized out>
ntp = <value optimized out>
#2 0x00000000006ec69e in getTypeOutputInfo (type=20, typOutput=0x2a8ac30,
typIsVarlena=0x2a8ac38 "") at lsyscache.c:2438
typeTuple = <value optimized out>
pt = <value optimized out>
__func__ = "getTypeOutputInfo"
#3 0x0000000000459027 in printtup_prepare_info (myState=0x2a50c40,
typeinfo=0x2e7fc50, numAttrs=42) at printtup.c:263
thisState = <value optimized out>
format = <value optimized out>
formats = 0x0
i = <value optimized out>
__func__ = "printtup_prepare_info"
#4 0x00000000004593c4 in printtup (slot=0x2d14618, self=0x2a50c40)
at printtup.c:297
typeinfo = <value optimized out>
myState = 0x2a50c40
buf = {data = 0x2aa9360 "x\376\243\002", len = 47269400, maxlen = 0,
cursor = 47269400}
natts = 42
i = <value optimized out>
#5 0x00000000006376ca in RunFromStore (portal=0x2aa9360,
direction=<value optimized out>, count=10, dest=0x2a50c40) at pquery.c:1121
oldcontext = 0x2aca6f0
ok = <value optimized out>
current_tuple_count = 0
slot = 0x2d14618
#6 0x00000000006377b2 in PortalRunSelect (portal=0x2aa9360,
forward=<value optimized out>, count=10, dest=0x2a50c40) at pquery.c:939
queryDesc = 0x0
direction = ForwardScanDirection
nprocessed = <value optimized out>
__func__ = "PortalRunSelect"
#7 0x0000000000638c78 in PortalRun (portal=0x2aa9360, count=10,
isTopLevel=1 '\001', dest=0x2a50c40, altdest=0x2a50c40,
completionTag=0x7fffd193d0f0 "") at pquery.c:787
save_exception_stack = 0x7fffd193cfe0
save_context_stack = 0x0
local_sigjmp_buf = {{__jmpbuf = {140736709513424, 8412518331319730861,
44733280, 0, 4294967295, 3, -8412469024494566739,
8412517527857939117}, __mask_was_saved = 0, __saved_mask = {
__val = {76, 44316880, 5, 0, 44804496, 767, 5, 0, 4294967295, 3,
5565893, 11495616, 4746857, 140736709513424, 140736709513424,
0}}}}
result = <value optimized out>
nprocessed = <value optimized out>
saveTopTransactionResourceOwner = 0x2a439e0
saveTopTransactionContext = 0x2a438d0
saveActivePortal = 0x0
saveResourceOwner = 0x2a439e0
savePortalContext = 0x0
saveMemoryContext = 0x2a438d0
__func__ = "PortalRun"
#8 0x000000000063661e in exec_execute_message (argc=<value optimized out>,
argv=<value optimized out>, dbname=0x298a150 "[dbname]",
username=<value optimized out>) at postgres.c:1965
completed = <value optimized out>
execute_is_fetch = 1 '\001'
completionTag = "\000ELECT 3\000\000\241\361=\000\000\000pÑ\321\377\177\000\000\300@\231\002", '\000' <repeats 12 times>"\200, \355\232\002\000\000\000\000PÑ\321\377\177\000\000U\000`\377\377\377\377\377"
save_log_statement_stats = 0 '\000'
was_logged = <value optimized out>
dest = DestRemoteExecute
receiver = 0x2a50c40
portal = <value optimized out>
sourceText = 0x2d141b8 "declare \"SQL_CUR017CB040\" cursor with hold for SELECT anl.LAB,anl.JOB,anl.COMPANYCODE,anl.SAMPLETYPE,anl.COMPANYLAB,anl.CYLNO,anl.CONTAINERCODE,anl.DATEGCANALYSIS,anl.DATEREPORT,anl.SAMPLENAME,anl.FIE"...
prepStmtName = 0x817c44 "<unnamed>"
portalParams = 0x0
is_xact_command = <value optimized out>
msec_str = '\000' <repeats 16 times>, "1500\000\000\000\000\330@\231\002\000\000\000"
#9 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
dbname=0x298a150 "[dbname]", username=<value optimized out>)
at postgres.c:4026
portal_name = 0x2a50420 "SQL_CUR017CB040"
max_rows = 10
firstchar = <value optimized out>
input_message = {data = 0x2a50420 "SQL_CUR017CB040", len = 20,
maxlen = 1024, cursor = 20}
local_sigjmp_buf = {{__jmpbuf = {140736709513392, 8412515096121983661,
1, 1, -9187201950435737471, 0, -8412469024567967059,
8412517530690536109}, __mask_was_saved = 1, __saved_mask = {
__val = {0, 0, 0, 18446462598732840960, 18446744073709551615, 0,
0, 0, 0, 0, 0, 266046903056, 140486762587344, 0, 4294967295,
0}}}}
send_ready_for_query = 0 '\000'
__func__ = "PostgresMain"
#10 0x00000000005f6c61 in BackendRun () at postmaster.c:3612
ac = 1
secs = 418739329
usecs = 43047
i = <value optimized out>
av = 0x298a398
maxac = <value optimized out>
#11 BackendStartup () at postmaster.c:3302
bn = 0x29940c0
pid = 0
#12 ServerLoop () at postmaster.c:1466
rmask = {fds_bits = {32, 0 <repeats 15 times>}}
selres = <value optimized out>
readmask = {fds_bits = {224, 0 <repeats 15 times>}}
nSockets = 8
now = <value optimized out>
last_touch_time = 1365422110
__func__ = "ServerLoop"
#13 0x00000000005f9431 in PostmasterMain (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1127
opt = <value optimized out>
status = <value optimized out>
userDoption = <value optimized out>
listen_addr_saved = 1 '\001'
i = <value optimized out>
__func__ = "PostmasterMain"
#14 0x000000000059a9b0 in main (argc=5, argv=0x2988480) at main.c:199
No locals.
(gdb)
On 2013-04-10 16:53:12 -0500, Joshua Berry wrote: > #7 0x0000000000638c78 in PortalRun (portal=0x2aa9360, count=10, > isTopLevel=1 '\001', dest=0x2a50c40, altdest=0x2a50c40, > completionTag=0x7fffd193d0f0 "") at pquery.c:787 > save_exception_stack = 0x7fffd193cfe0 > save_context_stack = 0x0 > local_sigjmp_buf = {{__jmpbuf = {140736709513424, > 8412518331319730861, > 44733280, 0, 4294967295, 3, -8412469024494566739, > 8412517527857939117}, __mask_was_saved = 0, __saved_mask = { > __val = {76, 44316880, 5, 0, 44804496, 767, 5, 0, 4294967295, > 3, > 5565893, 11495616, 4746857, 140736709513424, > 140736709513424, > 0}}}} > result = <value optimized out> > nprocessed = <value optimized out> > saveTopTransactionResourceOwner = 0x2a439e0 > saveTopTransactionContext = 0x2a438d0 > saveActivePortal = 0x0 > saveResourceOwner = 0x2a439e0 > savePortalContext = 0x0 > saveMemoryContext = 0x2a438d0 > __func__ = "PortalRun" Ok, so while we have a valid resource owner up to here, portal->resonwer is NULL. Could you 'up' until youre in this frame and then do 'print *portal'? Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Ok, so while we have a valid resource owner up to here, portal->resonwer
is NULL.
Could you 'up' until youre in this frame and then do 'print *portal'?
#7 0x0000000000638c78 in PortalRun (portal=0x2aa9360, count=10,
isTopLevel=1 '\001', dest=0x2a50c40, altdest=0x2a50c40,
completionTag=0x7fffd193d0f0 "") at pquery.c:787
787 nprocessed = PortalRunSelect(portal, true, count, dest);
(gdb) print *portal
$1 = {name = 0x2a3fe78 "SQL_CUR017CB040", prepStmtName = 0x0,
heap = 0x2aca6f0, resowner = 0x0, cleanup = 0x535d30 <PortalCleanup>,
createSubid = 0,
sourceText = 0x2d141b8 "declare \"SQL_CUR017CB040\" cursor with hold for SELECT anl.LAB,anl.JOB,anl.COMPANYCODE,anl.SAMPLETYPE,anl.COMPANYLAB,anl.CYLNO,anl.CONTAINERCODE,anl.DATEGCANALYSIS,anl.DATEREPORT,anl.SAMPLENAME,anl.FIE"...,
commandTag = 0x828f32 "SELECT", stmts = 0x2d145e8, cplan = 0x0,
portalParams = 0x0, strategy = PORTAL_ONE_SELECT, cursorOptions = 52,
status = PORTAL_ACTIVE, portalPinned = 0 '\000', queryDesc = 0x0,
tupDesc = 0x2e7fc50, formats = 0x0, holdStore = 0x2d0d430,
holdContext = 0x2f4e290, atStart = 0 '\000', atEnd = 0 '\000',
posOverflow = 0 '\000', portalPos = 10, creation_time = 418933112934458,
visible = 1 '\001'}
(gdb)
Andres Freund <andres@2ndquadrant.com> writes: > Ok, so while we have a valid resource owner up to here, portal->resonwer > is NULL. Yeah, that's what I'm guessing. Given the exposed reference to a cursor WITH HOLD, it seems likely that the reason the portal has no resowner is that PreCommit_Portals() got rid of it when the cursor was held. However, if that were the explanation, it's not clear how come this isn't falling over all the time for everybody who uses holdable cursors. Any cache flush happening just before fetching from a holdable cursor ought to do it ... (Wanders away wondering just how much the regression tests exercise holdable cursors.) regards, tom lane
I wrote: > (Wanders away wondering just how much the regression tests exercise > holdable cursors.) And the answer is they're not testing this code path at all, because if you do DECLARE c CURSOR WITH HOLD FOR ... FETCH ALL FROM c; then the second query executes with a portal (and resource owner) created to execute the FETCH command, not directly on the held portal. After a little bit of thought I'm not sure it's even possible to reproduce this problem with libpq, because it doesn't expose any way to issue a bare protocol Execute command against a pre-existing portal. (I had thought psqlOBC went through libpq, but maybe it's playing some games here.) Anyway, I'm thinking the appropriate fix might be like this - CurrentResourceOwner = portal->resowner; + if (portal->resowner) + CurrentResourceOwner = portal->resowner; in several places in pquery.c; that is, keep using TopTransactionResourceOwner if the portal doesn't have its own. A more general but probably much more invasive solution would be to fake up an intermediate portal when pulling data from a held portal, to more closely approximate the explicit-FETCH case. regards, tom lane
On 2013-04-10 17:31:09 -0500, Joshua Berry wrote: > > Ok, so while we have a valid resource owner up to here, portal->resonwer > > is NULL. > > > > Could you 'up' until youre in this frame and then do 'print *portal'? > > > > #7 0x0000000000638c78 in PortalRun (portal=0x2aa9360, count=10, > isTopLevel=1 '\001', dest=0x2a50c40, altdest=0x2a50c40, > completionTag=0x7fffd193d0f0 "") at pquery.c:787 > 787 nprocessed = > PortalRunSelect(portal, true, count, dest); > (gdb) print *portal > $1 = {name = 0x2a3fe78 "SQL_CUR017CB040", prepStmtName = 0x0, > heap = 0x2aca6f0, resowner = 0x0, cleanup = 0x535d30 <PortalCleanup>, > createSubid = 0, > sourceText = 0x2d141b8 "declare \"SQL_CUR017CB040\" cursor with hold for > SELECT > anl.LAB,anl.JOB,anl.COMPANYCODE,anl.SAMPLETYPE,anl.COMPANYLAB,anl.CYLNO,anl.CONTAINERCODE,anl.DATEGCANALYSIS,anl.DATEREPORT,anl.SAMPLENAME,anl.FIE"..., > commandTag = 0x828f32 "SELECT", stmts = 0x2d145e8, cplan = 0x0, > portalParams = 0x0, strategy = PORTAL_ONE_SELECT, cursorOptions = 52, > status = PORTAL_ACTIVE, portalPinned = 0 '\000', queryDesc = 0x0, > tupDesc = 0x2e7fc50, formats = 0x0, holdStore = 0x2d0d430, > holdContext = 0x2f4e290, atStart = 0 '\000', atEnd = 0 '\000', > posOverflow = 0 '\000', portalPos = 10, creation_time = 418933112934458, > visible = 1 '\001'} > (gdb) Ok, I might be seeing whats going on here. Could you go to 'printtup' and print *myState, *myState->attrinfo, *typpeinfo? Tom: It looks to me like printtup_prepare_info won't normally be called in an held cursor. But if some concurrent DDL changed the number of columns in typeinfo vs thaose in the the receiver that could explain the issue and why its not seen all the time, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > I wrote: > > (Wanders away wondering just how much the regression tests exercise > > holdable cursors.) > > And the answer is they're not testing this code path at all, because if > you do > DECLARE c CURSOR WITH HOLD FOR ... > FETCH ALL FROM c; > then the second query executes with a portal (and resource owner) > created to execute the FETCH command, not directly on the held portal. But in that path CurrentResourceOwner gets reset to portal->resowner as well (see PortalRunFetch())? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Tom: It looks to me like printtup_prepare_info won't normally be called > in an held cursor. But if some concurrent DDL changed the number of > columns in typeinfo vs thaose in the the receiver that could explain the > issue and why its not seen all the time, right? It looks to me like there are probably two triggering conditions: 1. Client is doing a direct protocol Execute on a held-cursor portal. 2. Cache flush occurs to drop the syscache entries needed by getTypeOutputInfo. (Otherwise, they'd still be around from when the held cursor was created.) The first of these explains why we don't see it in the CLOBBER_CACHE_ALWAYS buildfarm runs, and the second one explains why Joshua is only seeing it intermittently and not every darn time his application does that. regards, tom lane
Ok, I might be seeing whats going on here. Could you go to 'printtup'
and print *myState, *myState->attrinfo, *typpeinfo?
#4 0x00000000004593c4 in printtup (slot=0x2d14618, self=0x2a50c40)
at printtup.c:297
297 printtup_prepare_info(myState, typeinfo, natts);
(gdb) print *myState, *myState->attrinfo, *typeinfo
value has been optimized out
(gdb)
On 2013-04-10 18:25:24 -0500, Joshua Berry wrote: > > Ok, I might be seeing whats going on here. Could you go to 'printtup' > > and print *myState, *myState->attrinfo, *typpeinfo? > > > > #4 0x00000000004593c4 in printtup (slot=0x2d14618, self=0x2a50c40) > at printtup.c:297 > 297 printtup_prepare_info(myState, typeinfo, natts); > (gdb) print *myState, *myState->attrinfo, *typeinfo > value has been optimized out Hm. Make that a print *(DR_printtup *) self print *((DR_printtup *) self)->attrinfo print *slot->tts_tupleDescriptor then. Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-04-10 19:06:12 -0400, Tom Lane wrote: >> And the answer is they're not testing this code path at all, because if >> you do >> DECLARE c CURSOR WITH HOLD FOR ... >> FETCH ALL FROM c; >> then the second query executes with a portal (and resource owner) >> created to execute the FETCH command, not directly on the held portal. > But in that path CurrentResourceOwner gets reset to portal->resowner as > well (see PortalRunFetch())? Right, but that's the FETCH's portal, which is a regular "live" portal that has a ResOwner. regards, tom lane
On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > I wrote: > > (Wanders away wondering just how much the regression tests exercise > > holdable cursors.) > > And the answer is they're not testing this code path at all, because if > you do > DECLARE c CURSOR WITH HOLD FOR ... > FETCH ALL FROM c; > then the second query executes with a portal (and resource owner) > created to execute the FETCH command, not directly on the held portal. > > After a little bit of thought I'm not sure it's even possible to > reproduce this problem with libpq, because it doesn't expose any way to > issue a bare protocol Execute command against a pre-existing portal. > (I had thought psqlOBC went through libpq, but maybe it's playing some > games here.) Hm. PQexecPrepared() looks like it can do that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
print *(DR_printtup *) selfHm. Make that a
print *((DR_printtup *) self)->attrinfo
print *slot->tts_tupleDescriptor
(gdb) print *(DR_printtup *) self
$2 = {pub = {receiveSlot = 0x459390 <printtup>,
rStartup = 0x459550 <printtup_startup>,
rShutdown = 0x458a20 <printtup_shutdown>,
rDestroy = 0x458a10 <printtup_destroy>, mydest = DestRemoteExecute},
portal = 0x2aa9360, sendDescrip = 0 '\000', attrinfo = 0x2e7fc50,
nattrs = 42, myinfo = 0x2a8ac30}
(gdb) print *((DR_printtup *) self)->attrinfo
$3 = {natts = 42, attrs = 0x2e7fc78, constr = 0x0, tdtypeid = 2249,
tdtypmod = -1, tdhasoid = 0 '\000', tdrefcount = -1}
(gdb) print *slot->tts_tupleDescriptor
$4 = {natts = 42, attrs = 0x2e7fc78, constr = 0x0, tdtypeid = 2249,
tdtypmod = -1, tdhasoid = 0 '\000', tdrefcount = -1}
(gdb)
On 2013-04-10 19:29:06 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > >> And the answer is they're not testing this code path at all, because if > >> you do > >> DECLARE c CURSOR WITH HOLD FOR ... > >> FETCH ALL FROM c; > >> then the second query executes with a portal (and resource owner) > >> created to execute the FETCH command, not directly on the held portal. > > > But in that path CurrentResourceOwner gets reset to portal->resowner as > > well (see PortalRunFetch())? > > Right, but that's the FETCH's portal, which is a regular "live" portal > that has a ResOwner. I don't think so? standard_ProcessUtility: PerformPortalFetch: /* get the portal from the portal name */ portal = GetPortalByName(stmt->portalname); ... /* Do it */ nprocessed = PortalRunFetch(portal, stmt->direction, stmt->howMany, dest); PortalRunFetch: PG_TRY(); { ActivePortal = portal; CurrentResourceOwner = portal->resowner; So it seems to trigger a very similar codepath? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > I wrote: > > (Wanders away wondering just how much the regression tests exercise > > holdable cursors.) > > And the answer is they're not testing this code path at all, because if > you do > DECLARE c CURSOR WITH HOLD FOR ... > FETCH ALL FROM c; > then the second query executes with a portal (and resource owner) > created to execute the FETCH command, not directly on the held portal. > > After a little bit of thought I'm not sure it's even possible to > reproduce this problem with libpq, because it doesn't expose any way to > issue a bare protocol Execute command against a pre-existing portal. > (I had thought psqlOBC went through libpq, but maybe it's playing some > games here.) > > Anyway, I'm thinking the appropriate fix might be like this > > - CurrentResourceOwner = portal->resowner; > + if (portal->resowner) > + CurrentResourceOwner = portal->resowner; > > in several places in pquery.c; that is, keep using > TopTransactionResourceOwner if the portal doesn't have its own. > > A more general but probably much more invasive solution would be to fake > up an intermediate portal when pulling data from a held portal, to > more closely approximate the explicit-FETCH case. We could also allocate a new resowner for the duration of that transaction. That would get reassigned to the transactions resowner in PreCommit_Portals (after a slight change there). That actually seems simple enough? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
| I'm using PG 9.1.9 with a client application using various versions of the
| pgsqlODBC driver on Windows. Cursors are used heavily, as well as some pretty
| heavy trigger queries on db writes which update several materialized views.
|
| The server has 48GB RAM installed, PG is configured for 12GB shared buffers,
| 8MB max_stack_depth, 32MB temp_buffers, and 2MB work_mem. Most of the other
| settings are defaults.
|
| The server will seg fault from every few days to up to two weeks. Each time
| one of the postgres server processes seg faults, the server gets terminated by
| signal 11, restarts in recovery for up to 30 seconds, after which time it
| accepts connections as if nothing ever happened. Unfortunately all the open
| cursors and connections are lost, so the client apps are left in a bad state.
|
| Seg faults have also occurred with PG 8.4. ... I migrated the database to a
| server running PG9.1 with the hopes that the problem would disappear, but it
| has not. So now I'm starting to debug.
|
| # uname -a
| Linux [hostname] 2.6.32-358.2.1.el6.x86_64 #1 SMP Tue Mar 12 14:18:09 CDT 2013
| x86_64 x86_64 x86_64 GNU/Linux
| # cat /etc/redhat-release
| Scientific Linux release 6.3 (Carbon)
|
| # psql -U jberry
| psql (9.1.9)
| Type "help" for help.
|
| jberry=# select version();
| version
| -------------------------------------------------------------------------------
| PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7
| 20120313 (Red Hat 4.4.7-3), 64-bit
| (1 row)
I've had another postmaster segfault on my production server. It appears to be the same failure as the last one nearly a month ago, but I wanted to post the gdb bt details in case it helps shed light on the issue. Please let me know if anyone would like to drill into the dumped core with greater detail. Both the OS and PG versions remain unchanged.
Kind Regards,
-Joshua
From the PG log:
2013-05-07 08:48:35 CDT <%@> LOG: server process (PID 12367) was terminated by signal 11: Segmentation fault
From /var/log/messages
May 7 08:48:17 active kernel: postmaster[12367]: segfault at 40 ip 0000000000710e2e sp 00007fffdcaeedf0 error 4 in postgres[400000+4ea000]
(gdb) bt full
#0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:605
newmax = <value optimized out>
#1 0x00000000006e1382 in SearchCatCache (cache=0x25bf270, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1143
res = 1 '\001'
cur_skey = {{sk_flags = 0, sk_attno = -2, sk_strategy = 3, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0x686640 <oideq>, fn_oid = 184, fn_nargs = 2, fn_strict = 1 '\001',
fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x252f108, fn_expr = 0x0}, sk_argument = 20}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0},
sk_argument = 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000',
fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0},
sk_argument = 0}}
hashValue = 2280326203
hashIndex = 59
elt = 0x7fe0742cf228
ct = 0x7fe0742cf218
relation = <value optimized out>
scandesc = <value optimized out>
ntp = <value optimized out>
#2 0x00000000006ec69e in getTypeOutputInfo (type=20, typOutput=0x277dea0, typIsVarlena=0x277dea8 "") at lsyscache.c:2438
typeTuple = <value optimized out>
pt = <value optimized out>
__func__ = "getTypeOutputInfo"
#3 0x0000000000459027 in printtup_prepare_info (myState=0x25d53d0, typeinfo=0x2686640, numAttrs=84) at printtup.c:263
thisState = <value optimized out>
format = <value optimized out>
formats = 0x0
i = <value optimized out>
__func__ = "printtup_prepare_info"
#4 0x00000000004593c4 in printtup (slot=0x26fd960, self=0x25d53d0) at printtup.c:297
typeinfo = <value optimized out>
myState = 0x25d53d0
buf = {data = 0x262cf40 "\320I\\\002", len = 40884576, maxlen = 0, cursor = 40884576}
natts = 84
i = <value optimized out>
#5 0x00000000006376ca in RunFromStore (portal=0x262cf40, direction=<value optimized out>, count=10, dest=0x25d53d0) at pquery.c:1121
oldcontext = 0x295a470
ok = <value optimized out>
current_tuple_count = 0
slot = 0x26fd960
#6 0x00000000006377b2 in PortalRunSelect (portal=0x262cf40, forward=<value optimized out>, count=10, dest=0x25d53d0) at pquery.c:939
queryDesc = 0x0
direction = ForwardScanDirection
nprocessed = <value optimized out>
__func__ = "PortalRunSelect"
#7 0x0000000000638c78 in PortalRun (portal=0x262cf40, count=10, isTopLevel=1 '\001', dest=0x25d53d0, altdest=0x25d53d0, completionTag=0x7fffdcaef470 "") at pquery.c:787
save_exception_stack = 0x7fffdcaef360
save_context_stack = 0x0
local_sigjmp_buf = {{__jmpbuf = {140736895841360, 9079525024474526506, 40030016, 0, 4294967295, 3, -9079456159289380054, 9079524170016813866}, __mask_was_saved = 0, __saved_mask = {
__val = {39139104, 140736895840832, 64, 39668656, 0, 0, 80, 38984240, 8640448, 0, 140736895841360, 0, 40030016, 140736895841360, 140736895841360, 0}}}}
result = <value optimized out>
nprocessed = <value optimized out>
saveTopTransactionResourceOwner = 0x25c8170
saveTopTransactionContext = 0x25c8060
saveActivePortal = 0x0
saveResourceOwner = 0x2748af0
savePortalContext = 0x0
saveMemoryContext = 0x295b020
__func__ = "PortalRun"
#8 0x000000000063661e in exec_execute_message (argc=<value optimized out>, argv=<value optimized out>, dbname=0x252e150 "[dbname]", username=<value optimized out>) at postgres.c:1965
completed = <value optimized out>
execute_is_fetch = 1 '\001'
completionTag = "\000ELECT 3\000\000\241\361=\000\000\000\360\364\256\334\377\177\000\000\200/U\002", '\000' <repeats 12 times>, " 5U\002\000\000\000\000\320\364\256\334\377\177\000\000U\000`\377\377\377\377\377"
save_log_statement_stats = 0 '\000'
was_logged = <value optimized out>
dest = DestRemoteExecute
receiver = 0x25d53d0
portal = <value optimized out>
sourceText = 0x2644f10 "declare \"SQL_CUR02F082B0\" cursor with hold for SELECT job.JOB,job.COMPANYCODE,job.RECDBYCODE,job.PROJECT,job.REMARKS,job.JOBTYPE_ID,job.PRIORITY,job.LABLOC_ID,---Type <return> to continue, or q <return> to quit---
jobsch.JOB,jobsch.ISOFFLINE,jobsch.COMPAN"...
prepStmtName = 0x817c44 "<unnamed>"
portalParams = 0x0
is_xact_command = <value optimized out>
msec_str = '\000' <repeats 16 times>, "59423\000\000\000\230/U\002\000\000\000"
#9 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x252e150 "[dbname]", username=<value optimized out>) at postgres.c:4026
portal_name = 0x25d4bb0 "SQL_CUR02F082B0"
max_rows = 10
firstchar = <value optimized out>
input_message = {data = 0x25d4bb0 "SQL_CUR02F082B0", len = 20, maxlen = 1024, cursor = 20}
local_sigjmp_buf = {{__jmpbuf = {140736895841328, 9079528712728408874, 1, 1, -9187201950435737471, 0, -9079456159413112022, 9079524172817429290}, __mask_was_saved = 1, __saved_mask = {
__val = {0, 0, 0, 18446462598732840960, 18446744073709551615, 0, 0, 0, 0, 0, 0, 266046903056, 140615307920592, 0, 4294967295, 0}}}}
send_ready_for_query = 0 '\000'
__func__ = "PostgresMain"
#10 0x00000000005f6c61 in BackendRun () at postmaster.c:3612
ac = 1
secs = 421182656
usecs = 495798
i = <value optimized out>
av = 0x252e398
maxac = <value optimized out>
#11 BackendStartup () at postmaster.c:3302
bn = 0x2552f80
pid = 0
#12 ServerLoop () at postmaster.c:1466
rmask = {fds_bits = {8, 0 <repeats 15 times>}}
selres = <value optimized out>
readmask = {fds_bits = {56, 0 <repeats 15 times>}}
nSockets = 6
now = <value optimized out>
last_touch_time = 1367867385
__func__ = "ServerLoop"
#13 0x00000000005f9431 in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1127
opt = <value optimized out>
status = <value optimized out>
userDoption = <value optimized out>
listen_addr_saved = 1 '\001'
i = <value optimized out>
__func__ = "PostmasterMain"
#14 0x000000000059a9b0 in main (argc=5, argv=0x252c480) at main.c:199
No locals.
(gdb)
Kind Regards,
-Joshua
On Fri, Apr 12, 2013 at 6:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-04-10 19:06:12 -0400, Tom Lane wrote:We could also allocate a new resowner for the duration of that> I wrote:
> > (Wanders away wondering just how much the regression tests exercise
> > holdable cursors.)
>
> And the answer is they're not testing this code path at all, because if
> you do
> DECLARE c CURSOR WITH HOLD FOR ...
> FETCH ALL FROM c;
> then the second query executes with a portal (and resource owner)
> created to execute the FETCH command, not directly on the held portal.
>
> After a little bit of thought I'm not sure it's even possible to
> reproduce this problem with libpq, because it doesn't expose any way to
> issue a bare protocol Execute command against a pre-existing portal.
> (I had thought psqlOBC went through libpq, but maybe it's playing some
> games here.)
>
> Anyway, I'm thinking the appropriate fix might be like this
>
> - CurrentResourceOwner = portal->resowner;
> + if (portal->resowner)
> + CurrentResourceOwner = portal->resowner;
>
> in several places in pquery.c; that is, keep using
> TopTransactionResourceOwner if the portal doesn't have its own.
>
> A more general but probably much more invasive solution would be to fake
> up an intermediate portal when pulling data from a held portal, to
> more closely approximate the explicit-FETCH case.
transaction. That would get reassigned to the transactions resowner in
PreCommit_Portals (after a slight change there).
That actually seems simple enough?
Greetings,
Andres Freund
Hi, (2013/05/09 1:39), Joshua Berry wrote: > | I'm using PG 9.1.9 with a client application using various versions of > the > | pgsqlODBC driver on Windows. Cursors are used heavily, as well as some > pretty > | heavy trigger queries on db writes which update several materialized > views. > | > | The server has 48GB RAM installed, PG is configured for 12GB shared > buffers, > | 8MB max_stack_depth, 32MB temp_buffers, and 2MB work_mem. Most of the > other > | settings are defaults. > | > | The server will seg fault from every few days to up to two weeks. Each > time > | one of the postgres server processes seg faults, the server gets > terminated by > | signal 11, restarts in recovery for up to 30 seconds, after which time it > | accepts connections as if nothing ever happened. Unfortunately all the > open > | cursors and connections are lost, so the client apps are left in a bad > state. > | > | Seg faults have also occurred with PG 8.4. ... I migrated the database > to a > | server running PG9.1 with the hopes that the problem would disappear, > but it > | has not. So now I'm starting to debug. > | > | # uname -a > | Linux [hostname] 2.6.32-358.2.1.el6.x86_64 #1 SMP Tue Mar 12 14:18:09 > CDT 2013 > | x86_64 x86_64 x86_64 GNU/Linux > | # cat /etc/redhat-release > | Scientific Linux release 6.3 (Carbon) > | > | # psql -U jberry > | psql (9.1.9) > | Type "help" for help. > | > | jberry=# select version(); > | version > | > ------------------------------------------------------------------------------- > | PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) > 4.4.7 > | 20120313 (Red Hat 4.4.7-3), 64-bit > | (1 row) > > I've had another postmaster segfault on my production server. It appears > to be the same failure as the last one nearly a month ago, but I wanted > to post the gdb bt details in case it helps shed light on the issue. > Please let me know if anyone would like to drill into the dumped core > with greater detail. Both the OS and PG versions remain unchanged. > > Kind Regards, > -Joshua > > > On Fri, Apr 12, 2013 at 6:12 AM, Andres Freund <andres@2ndquadrant.com > <mailto:andres@2ndquadrant.com>> wrote: > > On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > > I wrote: > > > (Wanders away wondering just how much the regression tests exercise > > > holdable cursors.) > > > > And the answer is they're not testing this code path at all, > because if > > you do > > DECLARE c CURSOR WITH HOLD FOR ... > > FETCH ALL FROM c; > > then the second query executes with a portal (and resource owner) > > created to execute the FETCH command, not directly on the held > portal. > > > > After a little bit of thought I'm not sure it's even possible to > > reproduce this problem with libpq, because it doesn't expose any > way to > > issue a bare protocol Execute command against a pre-existing portal. > > (I had thought psqlOBC went through libpq, but maybe it's playing > some > > games here.) > > > > Anyway, I'm thinking the appropriate fix might be like this > > > > - CurrentResourceOwner = portal->resowner; > > + if (portal->resowner) > > + CurrentResourceOwner = portal->resowner; > > > > in several places in pquery.c; that is, keep using > > TopTransactionResourceOwner if the portal doesn't have its own. > > > > A more general but probably much more invasive solution would be > to fake > > up an intermediate portal when pulling data from a held portal, to > > more closely approximate the explicit-FETCH case. > > We could also allocate a new resowner for the duration of that > transaction. That would get reassigned to the transactions resowner in > PreCommit_Portals (after a slight change there). > That actually seems simple enough? I made some changes to multi thread handling of psqlodbc driver. It's also better to fix the crash at backend side. I made 2 patches. The 1st one temporarily changes CurrentResourceOwner to CurTransactionResourceOwner during catalog cache handling. The 2nd one allocates a new resource owner for held portals. Both fix the crash in my test case. regards, Hiroshi Inoue
Вложения
Hiroshi Inoue <inoue@tpf.co.jp> writes: > It's also better to fix the crash at backend side. Yeah, definitely. Do we have a self-contained test case for this? regards, tom lane
(2013/06/12 0:03), Tom Lane wrote: > Hiroshi Inoue <inoue@tpf.co.jp> writes: >> It's also better to fix the crash at backend side. > > Yeah, definitely. Do we have a self-contained test case for this? Unfortunately no. I'm testing with a modified psqlodbc driver. The simplest way may be as follows. 1. Implement an API function like PQexecPortal(portal name, count) which sends only an execute message. 2. Call the APIs as follows. // Connect to a database. conn = PQxxxx(....); // // Not begin a transaction for simplicity. // // The following operation defines a holdable portal whose // portal name = cursor_name. PQexec(conn, "declare cursor_name cursor with hold for select ...."); // At this point the holdable cursor becomes a held portal whose // resowner is reset to NULL. The following function call would // crash the backend. PQexecPortal(conn, cursor_name, 1);