Обсуждение: SPI_ERROR_CONNECT within pl/pgsql, PG 8.4
L.S. After an upgrade to v8.4 one of our clients is experiencing heaps of problems, they get errors like "ERROR: SPI_connect failed: SPI_ERROR_CONNECT". I refer to this earlier post, it looks a lot like it: http://archives.postgresql.org/pgsql-general/2009-07/msg00388.php We too had no problem when running v8.3 Tom, you stated: Really? Could we see a self-contained example? We're working on that, but its very difficult. We have a way to reproduce it in our application, but this works only in the production environment, not in our development situation (the more users, the more faster the problem appears). Some users are able to work 15 minutes without problems and then they get errors. Restarting the app and thus the connection silences it again for a while, etc. Though a reproducable 'psql' testcase is not yet available, I do have a bit of additional information that might serve as a clue to some: * when they happen, the code triggering the errors seems to be fairly limited * a small part are all triggered in before triggers and all are in a custom function session_userid() * the bulk part are triggered in both after as well as deferred triggers that in turn call a couple of custom functions _all_ of these custom function have the following structure in common: CREATE OR REPLACE FUNCTION xxxxxxxxx() RETURNS int LANGUAGE 'plpgsql' IMMUTABLE STRICT SECURITY INVOKER AS ' DECLARE result INT := 0; BEGIN BEGIN SELECT id INTO STRICT result FROM xxxxxxxxx WHERE xxxxxxxxx; EXCEPTION WHEN NO_DATA_FOUND THEN RAISE EXCEPTION ''Unknown (%)...!!'', xxxxxxxxx WHEN TOO_MANY_ROWS THEN RAISE EXCEPTION ''More than one found (%)...!!'', xxxxxxxxx; END; RETURN result; END;'; => mind the fact that it is marked 'immutable' => mind the fact that due to the exception usage, it contains an inner begin/end block Though the frequency / localisation of these errors seems too high to be a coincidence, I do have to point out that these are not the only functions with the same structure. I have a whole set of _status() and _type() functions that look exactly the same, but who never come up in an error. However, this might simply be due to the usage-pattern of my application. Any clues as to how to gather additional information that might bring us closer to a solution is appreciated also. I'd have no problem with applying some patch as long as it's safe enough for a production environment ;) -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > Any clues as to how to gather additional information that might bring us > closer to a solution is appreciated also. A stack trace from the point of the error would probably tell us a great deal. Maybe you could attach to a backend with gdb, set a breakpoint at the failure return in SPI_connect(), and then provoke the error manually? regards, tom lane
Hi Tom, Op maandag 13 juli 2009, schreef Tom Lane: > Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > > Any clues as to how to gather additional information that might bring us > > closer to a solution is appreciated also. > > A stack trace from the point of the error would probably tell us a great > deal. Maybe you could attach to a backend with gdb, set a breakpoint > at the failure return in SPI_connect(), and then provoke the error > manually? Just fyi, a breakpoint at SPI_connect with condition _SPI_curid != _SPI_connected produced the following backtrace: Program received signal SIGUSR2, User defined signal 2. 0x00002b539af2b5f5 in recv () from /lib64/libc.so.6 (gdb) bt #0 0x00002b539af2b5f5 in recv () from /lib64/libc.so.6 #1 0x000000000054d692 in secure_read () #2 0x0000000000552c74 in pq_recvbuf () #3 0x0000000000553077 in pq_getbyte () #4 0x00000000005ce5f6 in PostgresMain () #5 0x00000000005a50fb in ServerLoop () #6 0x00000000005a5c2a in PostmasterMain () #7 0x000000000055498e in main () However, after continuing this did NOT give the SPI_connect error message, so this probably is about something else completely? We cannot reproduce the error anymore due to end of working hours, will try again tomorrow morning (localtime). More to follow. -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > Just fyi, a breakpoint at SPI_connect with condition > _SPI_curid != _SPI_connected Right, that's what to look for. > produced the following backtrace: > Program received signal SIGUSR2, User defined signal 2. > 0x00002b539af2b5f5 in recv () from /lib64/libc.so.6 > (gdb) bt > #0 0x00002b539af2b5f5 in recv () from /lib64/libc.so.6 > #1 0x000000000054d692 in secure_read () > #2 0x0000000000552c74 in pq_recvbuf () > #3 0x0000000000553077 in pq_getbyte () > #4 0x00000000005ce5f6 in PostgresMain () > #5 0x00000000005a50fb in ServerLoop () > #6 0x00000000005a5c2a in PostmasterMain () > #7 0x000000000055498e in main () This is a normal interbackend communication signal. You need to configure gdb to ignore SIGUSR2 (ie, pass it on and not stop execution). Probably SIGUSR1 too. regards, tom lane
Hi, > This is a normal interbackend communication signal. You need to > configure gdb to ignore SIGUSR2 (ie, pass it on and not stop execution). > Probably SIGUSR1 too. Done. Here's the result: Breakpoint 1, 0x000000000054aec0 in SPI_connect () (gdb) bt #0 0x000000000054aec0 in SPI_connect () #1 0x00002afdffd243c2 in plpgsql_call_handler () from /usr/local/pgsql/lib/plpgsql.so #2 0x00000000005331f7 in ExecMakeFunctionResult () #3 0x0000000000534f1a in ExecEvalExprSwitchContext () #4 0x00000000005919e4 in evaluate_expr () #5 0x0000000000592e69 in simplify_function () #6 0x000000000059354e in eval_const_expressions_mutator () #7 0x0000000000556d70 in expression_tree_mutator () #8 0x0000000000593414 in eval_const_expressions_mutator () #9 0x00000000005934e2 in eval_const_expressions_mutator () #10 0x00000000005566bd in expression_tree_mutator () #11 0x000000000059332d in eval_const_expressions_mutator () #12 0x0000000000556d70 in expression_tree_mutator () #13 0x000000000059332d in eval_const_expressions_mutator () #14 0x000000000059436f in eval_const_expressions () #15 0x00000000005862a1 in preprocess_expression () #16 0x00000000005889cf in subquery_planner () #17 0x0000000000588f63 in standard_planner () #18 0x00000000005cc6b1 in pg_plan_query () #19 0x00000000005cc76e in pg_plan_queries () #20 0x000000000066a522 in RevalidateCachedPlan () #21 0x00002afdffd27e12 in exec_eval_expr () from /usr/local/pgsql/lib/plpgsql.so #22 0x00002afdffd28122 in exec_eval_boolean () from /usr/local/pgsql/lib/plpgsql.so #23 0x00002afdffd2d927 in exec_stmt_if () from /usr/local/pgsql/lib/plpgsql.so #24 0x00002afdffd2c3b8 in exec_stmts () from /usr/local/pgsql/lib/plpgsql.so #25 0x00002afdffd2bb70 in exec_stmt_block () from /usr/local/pgsql/lib/plpgsql.so #26 0x00002afdffd2e26f in plpgsql_exec_function () from /usr/local/pgsql/lib/plpgsql.so #27 0x00002afdffd2447e in plpgsql_call_handler () from /usr/local/pgsql/lib/plpgsql.so #28 0x00000000005331f7 in ExecMakeFunctionResult () #29 0x0000000000530157 in ExecProject () #30 0x0000000000541dd3 in ExecResult () #31 0x000000000052fb86 in ExecProcNode () #32 0x000000000052d99f in standard_ExecutorRun () #33 0x00000000005cfff4 in PortalRunSelect () #34 0x00000000005d11e0 in PortalRun () #35 0x00000000005ccfcb in exec_simple_query () #36 0x00000000005cdce4 in PostgresMain () #37 0x00000000005a50fb in ServerLoop () #38 0x00000000005a5c2a in PostmasterMain () #39 0x000000000055498e in main () Looking forward to your reply ! -- Best, Frank.
Hi Marek, > Could you please send me gdb > commands - I could also provide the backtrace Basically the flow is as follows: - identify the backend that you expect to run in to this error by pid - start gdb with the option "-p <pid>" to attach to this backend - set a conditional breakpoint: break SPI_connect if _SPI_curid != _SPI_connected gdb should confirm this with something like Breakpoint 1 at 0x54af60 - tell gdb to ignore SIGUSR1: handle SIGUSR1 nostop gdb should confirm this with something like: Signal Stop Print Pass to program Description SIGUSR1 No Yes Yes User defined signal 1 - tell gdb to ignore SIGUSR2: handle SIGUSR2 nostop gdb should confirm this with something like: Signal Stop Print Pass to program Description SIGUSR2 No Yes Yes User defined signal 2 - type 'c' to allow the backend to continue - reproduce the problem in your application, the moment you do it will seem to 'freeze' (so you will _not_ see any error message you usually were seeing, at least not yet), because gdb will have catched the situation and should now show a prompt again - type 'bt' to get a backtrace, just copy&paste&post ;) You can quit gdb by a plain <ctrl>-c, just confirm when it asks you to detach. The backend will continue to run, your app should now show the error. -- Best, Frank.
2009/7/14 Frank van Vugt <ftm.van.vugt@foxi.nl>: > Hi, > >> This is a normal interbackend communication signal. =A0You need to >> configure gdb to ignore SIGUSR2 (ie, pass it on and not stop execution). >> Probably SIGUSR1 too. Frank, thanks for letting me know about your post - I didn't have time yet to read yesterday posts, but I'm glad that I'm not only one who is experiencing that problem :-). Thanks for the backtrace - I hope that Tom will find the reason of that error. Could you please send me gdb commands - I could also provide the backtrace, so Tom would have wider view of what is happening. Marek
2009/7/14 Frank van Vugt <ftm.van.vugt@foxi.nl>: > Basically the flow is as follows: > You can quit gdb by a plain <ctrl>-c, just confirm when it asks you to detach. > The backend will continue to run, your app should now show the error. Thanks.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > [ backtrace ] Hmph. So evidently the unexpected recursion into SPI is happening when a cached plan has to be recomputed. I suspected something of the sort, but now the question is why you didn't see the same problem in 8.3 ... regards, tom lane
Hi, > Hmph. So evidently the unexpected recursion into SPI is happening when > a cached plan has to be recomputed. I suspected something of the sort, > but now the question is why you didn't see the same problem in 8.3 ... Just as an additional confirmation: nothing else but the db changed when we migrated from 8.3.7 to 8.4.0 last weekend Anything else I can do to assist? FYI: - the production site today ran into this problem just over a thousand times resulting in the same number of rollbacks and reconnects (since that solves it for a little while again), so needless to say there's some additional motivation to help ;) - with relation to recomputation of cached plans: our usage patterns basically do not include DDL, with the exception of temp tables; the functions that bail are all using a single table each, the records of which are not changing in any way - we use before/after and deferred triggers, all in sql or pgsql; we use a single after-trigger written in C for keeping track of history, which does its own SPI_connect/finish calls, but as stated earlier, all of this was in place a long time when we were running 8.3.7, though obviously the C-trigger has been recompiled - the distribution over triggers / functions that run into this error seems uneven, basically they all boil down to a couple of custom functions, the specifics of which I described earlier (same structure, immutable, use exception thus an extra begin/end block, etc), but since we have loads of other simular functions that don't come up, it's hard to see what would be special here Depending on the other idea's you might have, we could fairly easily temporarily redefine one or more of the pgsql functions that suffer from this error as plain sql functions, just to see what would happen.......? (I'm reaching, I know ;) ) -- Best, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: >> Hmph. So evidently the unexpected recursion into SPI is happening when >> a cached plan has to be recomputed. I suspected something of the sort, >> but now the question is why you didn't see the same problem in 8.3 ... > Just as an additional confirmation: nothing else but the db changed when we > migrated from 8.3.7 to 8.4.0 last weekend I'm convinced that 8.3 has the same bug, in the sense that it could fail this way if it had to revalidate a cached plan. Probably the reason you didn't see it before is that 8.4 has more conditions in which it will invalidate cached plans. In particular, is it possible that this failure occurs when you try to call a plpgsql function that has just been replaced via CREATE OR REPLACE FUNCTION? [ looks at CVS logs ... ] Another case that will cause plan invalidation in 8.4 and not 8.3 is creating or dropping a schema. Do you do a lot of that? regards, tom lane
I think the attached patch will fix it for you. regards, tom lane Index: src/backend/utils/cache/plancache.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/cache/plancache.c,v retrieving revision 1.27 diff -c -r1.27 plancache.c *** src/backend/utils/cache/plancache.c 11 Jun 2009 14:49:05 -0000 1.27 --- src/backend/utils/cache/plancache.c 14 Jul 2009 15:04:38 -0000 *************** *** 45,50 **** --- 45,51 ---- #include "access/transam.h" #include "catalog/namespace.h" #include "executor/executor.h" + #include "executor/spi.h" #include "nodes/nodeFuncs.h" #include "optimizer/planmain.h" #include "storage/lmgr.h" *************** *** 502,509 **** --- 503,521 ---- { /* * Generate plans for queries. + * + * The planner may try to call SPI-using functions, which causes + * a problem if we're already inside one. Rather than expect + * all SPI-using code to do SPI_push whenever a replan could + * happen, it seems best to take care of the case here. */ + bool pushed; + + pushed = SPI_push_conditional(); + slist = pg_plan_queries(slist, plansource->cursor_options, NULL); + + SPI_pop_conditional(pushed); } /*
Hi, > I'm convinced that 8.3 has the same bug, in the sense that it could fail > this way if it had to revalidate a cached plan. Probably the reason you > didn't see it before is that 8.4 has more conditions in which it will > invalidate cached plans. Ok > In particular, is it possible that this > failure occurs when you try to call a plpgsql function that has just > been replaced via CREATE OR REPLACE FUNCTION? Ah, you might be on to something here, we have a few patterns that match that.... ----- When starting a certain wizard in our app, we need to keep track of complex dependancies; in order to do this, we call a custom function called 'tree_amounts_blockade_setup', which basically executes the following steps: - create temp table - create index on that table - create before trigger on that table based on an existing pgsql function - create after trigger on that table based on another existing pgsql function Now, if I look at the distribution of users running into this error today, it looks like this: 1 a 1 b 1 c 1 d 2 e 3 f 3 g 3 h 3 i 4 j 4 k 10 l 16 m 16 n 22 o 120 p 853 q User 'q' is precisely the single user that would indeed use this 'wizard' the most and therefor would call 'tree_amounts_blockade_setup' the most. It has to be said though that the winner's score is a bit skewed, I verified this and it turns out the specific user was happily ignoring all errors and continuing the work (or at least trying to) without reconnecting ;) ----- One other thing that comes to mind is the fact that we have functions that when called should include certain temp tables when they exist. We can and do check for existence of a temp table through something like (SELECT 1 FROM information_schema.tables WHERE table_schema ~''pg_temp'' AND table_name = ''t_creditor_address'') but for this to work the temp schema has to exists. For this we 'cheat', every user will create and drop a temp table once upon connecting to the database, the function itself doublechecks and bails out IF NOT pg_my_temp_schema() > 0 > [ looks at CVS logs ... ] Another case that will cause plan > invalidation in 8.4 and not 8.3 is creating or dropping a schema. > Do you do a lot of that? Apart from the temp schema described above, we don't do that at all. ========== I'd also like to mention still our C-trigger..... : - it starts of with three plan variables containing '0' and it uses SPI_saveplan() to fill these with a cached plan - the three underlying tables will never change structure during such a session, so I expected these variables / plans to never change - if these static SPIPlanPtr variables are non-zero, the cached plans will be reused in a SPI_execute_plan() call Now the docs of SPI_saveplan say the following: "If one of the objects (a table, function, etc.) referenced by the prepared plan is dropped or redefined, then future executions of SPI_execute_plan may fail or return different results than the plan initially indicates. " Given the fact that 8.4 invalidates plans much more often, this might become tricky..... the 'SPI_execute_plan may fail' part can be handled, but I'm a bit worried about the 'may return different results' part..... Is there a way to determine (efficiently) that such a save plan has been invalidated? Or maybe for the future: would a kind of guarded plan pointer be handy...? -- Best, Frank.
Hi, > I think the attached patch will fix it for you. Great, thanks! I'll apply this tonight, will get back with results tomorrow. -- Best, Frank.
Tom Lane wrote: > I'm convinced that 8.3 has the same bug, in the sense that it could fail > this way if it had to revalidate a cached plan. Probably the reason you > didn't see it before is that 8.4 has more conditions in which it will > invalidate cached plans. In particular, is it possible that this > failure occurs when you try to call a plpgsql function that has just > been replaced via CREATE OR REPLACE FUNCTION? Do we have the stuff where an ANALYZE could invalidate cached plans? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > [ looks at CVS logs ... ] Another case that will cause plan > invalidation in 8.4 and not 8.3 is creating or dropping a schema. > Do you do a lot of that? Hmm, he says he uses temporary tables. Does creating/dropping a temporary namespace trigger plan cache invalidation? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > I'd also like to mention still our C-trigger..... : > - it starts of with three plan variables containing '0' and it uses > SPI_saveplan() to fill these with a cached plan > - the three underlying tables will never change structure during such a > session, so I expected these variables / plans to never change > - if these static SPIPlanPtr variables are non-zero, the cached plans will be > reused in a SPI_execute_plan() call > Now the docs of SPI_saveplan say the following: > "If one of the objects (a table, function, etc.) referenced by the prepared > plan is dropped or redefined, then future executions of SPI_execute_plan may > fail or return different results than the plan initially indicates. " > Given the fact that 8.4 invalidates plans much more often, this might become > tricky..... the 'SPI_execute_plan may fail' part can be handled, but I'm a bit > worried about the 'may return different results' part..... Is there a way to > determine (efficiently) that such a save plan has been invalidated? You don't need to worry about that as long as you're not changing the table's DDL in a way that would affect the query result. It's entirely possible for the plan to get invalidated and regenerated underneath you, but that should be invisible to a SPI caller. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I'm convinced that 8.3 has the same bug, in the sense that it could fail >> this way if it had to revalidate a cached plan. Probably the reason you >> didn't see it before is that 8.4 has more conditions in which it will >> invalidate cached plans. In particular, is it possible that this >> failure occurs when you try to call a plpgsql function that has just >> been replaced via CREATE OR REPLACE FUNCTION? > Do we have the stuff where an ANALYZE could invalidate cached plans? That was true in 8.3, too. But it doesn't affect this particular code path because exec_eval_simple_expr won't be used for anything that references a table. In 8.4 you could have a plan inval on an expression that just uses a user-defined function, but it's a lot harder to trigger such a thing in 8.3. Which is why we didn't get reports before. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> [ looks at CVS logs ... ] Another case that will cause plan >> invalidation in 8.4 and not 8.3 is creating or dropping a schema. >> Do you do a lot of that? > Hmm, he says he uses temporary tables. Does creating/dropping a > temporary namespace trigger plan cache invalidation? It could, but I suppose that most DBs would soon reach a steady state where they have all the pg_temp_ schemas they need. (Hadn't occurred to me before, but this is a real good reason for the implementation not discarding the actual schema when a temp namespace is recycled ...) regards, tom lane
2009/7/14 Tom Lane <tgl@sss.pgh.pa.us>: > Tom Lane wrote: I've applied the patch - I hope it will fix our problem. I will ask users of our application to do those things they did when the error was thrown. Thanks a lot Tom. ML
Hi Tom, Op dinsdag 14 juli 2009, schreef Tom Lane: > I think the attached patch will fix it for you. Yep, after applying it yesterday evening, we didn't see this problem at all today, so it definitely looks as if this patch nailed it. Thanks for the [great|quick] support ! -- Best, Frank.
2009/7/15 Frank van Vugt <ftm.van.vugt@foxi.nl>: > Hi Tom, > Yep, after applying it yesterday evening, we didn't see this problem at all > today, so it definitely looks as if this patch nailed it. > Thanks for the [great|quick] support ! I confirm that too - no errors after applying the patch. Great work Tom, thanks!!!! ML