Обсуждение: 8.3.0 backend segfaults
(Sorry if duplicates show up this is the third time ive posted this in the past 10 hours, Im assuming it got dropped because of the attachments) Problem: Apparently random segfaults apparently query agnostic, seem to be more frequent when a pg_dump is running The most frequent query it segfaults with is: select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.last_deactive)) as time_off from workers as w left join worker_vis as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id')))) where wv.worker_vis_id is not null and w.defunct = 0 order by coalesce(w.alias, w.worker); (sample arguments, they do not seem to make a difference) 167214, 13 167340, 16 But have seen other simpler queries such as (sorry no backtraces for these... yet) insert into audit_transaction (action, open_user_id, page_load_id, user_id) values (?, ?, ?, ?); ARGS=suds, 509057, 15148365, 167217 If I select * from workers; and select * from workers_active; everything looks good, no segfaults. I can also take a pg_dump if virtually nothing is using the database. Otherwise pg_dump dies randomly with an errors like: Dumping the contents of table "clients_audit" failed: PQgetCopyData() failed. Dumping the contents of table "file_data" failed: PQgetCopyData() failed. Dumping the contents of table "workers_audit" failed: PQgetCopyData() failed. (note these are probably 3 of the larger tables in the database, maybe it has something to do with statement_timeout, i do have it set quite low (3 min) ?) If I manually prepare and execute the above queries they work fine. prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.last_deactive)) as time_off from workers as w left join worker_vis as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id')))) where wv.worker_vis_id is not null and w.defunct = 0 order by coalesce(w.alias, w.worker); execute worker (167214, 13); Core dumps and binaries available if needed. I have about 6 core dumps across 2 different servers. Which leads me to believe its either postgres bug or a corrupt database (the other server is a pitr slave for the master which i switched over to see if i could reproduce the segfault on the master) Nothing in dmesg, nothing in mcelog. Raid controller is a 3ware 9550SX-12 with 6 70GB WD Raptor hard drives in a raid 10, smart tests pass fine... everything looks good. Work load is a web application where each page beings a transaction; creates a temp table, does a few selects, inserts and updates and the commits. Postgresql 8.2.5 was working without any problems just prior to the upgrade. \d workers Table "public.workers" Column | Type | Modifiers ---------------+--------------------------+----------------------------------------------------------------- worker_id | bigint | not null default nextval('workers_worker_id_seq'::regclass) date_created | timestamp with time zone | not null default now() clientid | bigint | not null worker | text | not null alias | text | job_type | smallint | not null active | smallint | not null default 0 last_active | timestamp with time zone | last_deactive | timestamp with time zone | defunct | smallint | not null default 0 audit_class | integer | audit_date | timestamp with time zone | not null default now() audit_desc | text | audit_id | bigint | audit_seq | integer | not null default nextval('audit_basic_audit_seq_seq'::regclass) audit_table | character varying(64) | audit_tid | bigint | audit_type | character varying(32) | audit_orig_id | integer | path | text | Indexes: "workers_pkey" PRIMARY KEY, btree (worker_id) CLUSTER Triggers: workers_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON workers FOR EACH ROW EXECUTE PROCEDURE audit_table_go() \d worker_vis Table "public.worker_vis" Column | Type | Modifiers ---------------+--------------------------+----------------------------------------------------------------- worker_vis_id | bigint | not null default nextval('worker_vis_ruleid_seq'::regclass) date_created | timestamp with time zone | not null default now() worker_id | bigint | auth_id | bigint | auth_class | smallint | defunct | smallint | not null default 0 audit_class | integer | audit_date | timestamp with time zone | not null default now() audit_desc | text | audit_id | bigint | audit_seq | integer | not null default nextval('audit_basic_audit_seq_seq'::regclass) audit_table | character varying(64) | audit_tid | bigint | audit_type | character varying(32) | audit_orig_id | integer | Indexes: "worker_vis_pkey" PRIMARY KEY, btree (worker_vis_id) "worker_vis_vis_idx" btree (worker_id, defunct, auth_id, auth_class) CLUSTER Triggers: worker_vis_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON worker_vis FOR EACH ROW EXECUTE PROCEDURE audit_table_go() Version Information: Following REL8.3_STABLE 2 versions tried both crash with a similar backtrace (both died in CopySnapshot): (currently running) Author: teodor <teodor> Date: Sun Mar 9 10:42:48 2008 +0000 Revert changes of CompareTSQ: it affects existing btree indexes. (was running but also crashed) Author: tgl <tgl> Date: Sun Mar 2 00:10:28 2008 +0000 Venezuela Time now means UTC-4:30, not UTC-4:00. Adjust our treatment of "VET" accordingly. Per bug #3997 from Aaron Mizrachi. Memory: 8G ECC Swap: 16GB (unused even at segfault time aka it does not seem to be running out of memmory) OS: Linux 2.6.24.3 select version(); PostgreSQL 8.3.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) built with: --disable-nls --with-openssl --without-python --without-tcl --without-krb5 --with-perl --enable-debug --enable-integer-datetimes --with-ossp-uuid postgresql.conf: listen_addresses = '*' port = 5432 max_connections = 500 superuser_reserved_connections = 3 ssl = on shared_buffers = 1000MB work_mem = 32MB maintenance_work_mem = 256MB max_stack_depth = 7MB max_fsm_pages = 304800 max_fsm_relations = 3000 vacuum_cost_delay = 30 bgwriter_lru_maxpages = 500 wal_sync_method = fdatasync wal_buffers = 256kB checkpoint_segments = 8 checkpoint_timeout = 10min archive_command = 'rsync -az -e "ssh" %p psql-sync@10.10.10.1:standby/cur/wal/%f </dev/null' random_page_cost = 2.6 effective_cache_size = 6GB default_statistics_target = 101 constraint_exclusion = on log_destination = 'syslog' client_min_messages = warning log_min_messages = warning silent_mode = on log_line_prefix = '[%m] %h:%d ' log_lock_waits = on log_temp_files = 20MB autovacuum_naptime = 3min autovacuum_vacuum_threshold = 200 autovacuum_analyze_threshold = 200 statement_timeout = '3min' datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' backslash_quote = off escape_string_warning = on sql_inheritance = off standard_conforming_strings = on cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 2000.000 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good bogomips : 3991.79 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 2000.000 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good bogomips : 3991.79 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(49502) BIND '. Program terminated with signal 11, Segmentation fault. #0 0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6 (gdb) bt #0 0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6 #1 0x000000000065fc88 in CopySnapshot (snapshot=0xa4e6e0) at tqual.c:1311 #2 0x000000000051cf34 in fmgr_sql (fcinfo=0x7fffb3f51160) at functions.c:299 #3 0x00000000005164cd in ExecMakeFunctionResult (fcache=0x10b4710, econtext=0x10b4ba0, isNull=0x7fffb3f515af "", isDone=0x0) at execQual.c:1351 #4 0x00000000005191ca in ExecEvalExprSwitchContext (expression=0x2b6f2c5dc078, econtext=0x10, isNull=0x24c8500 <Address 0x24c8500 out of bounds>, isDone=0x24c84c0) at execQual.c:3726 #5 0x00000000005690bd in evaluate_expr (expr=<value optimized out>, result_type=23, result_typmod=-1) at clauses.c:3273 #6 0x0000000000569d21 in simplify_function (funcid=2214548, result_type=23, result_typmod=-1, args=0xf8db40, allow_inline=1 '\001', context=0x7fffb3f51700) at clauses.c:2887 #7 0x000000000056a53a in eval_const_expressions_mutator (node=0xf24d90, context=0x7fffb3f51700) at clauses.c:1795 #8 0x000000000056b185 in estimate_expression_value (root=<value optimized out>, node=0x2b6f2c5dc078) at clauses.c:1719 #9 0x00000000005f8b86 in get_restriction_variable (root=0xf22d80, args=<value optimized out>, varRelid=0, vardata=0x7fffb3f51810, other=0x7fffb3f51848, varonleft=0x7fffb3f5185f "\001\236") at selfuncs.c:3570 #10 0x00000000005fa562 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169 #11 0x000000000064540c in OidFunctionCall4 (functionId=<value optimized out>, arg1=15871360, arg2=532, arg3=15879312, arg4=0) at fmgr.c:1615 #12 0x000000000056d0c2 in restriction_selectivity (root=0xf22d80, operator=532, args=0xf24c90, varRelid=0) at plancat.c:805 #13 0x000000000054e797 in clause_selectivity (root=0xf22d80, clause=0xf24de0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639 #14 0x000000000054e0df in clauselist_selectivity (root=0xf22d80, clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:123 #15 0x000000000054e448 in clause_selectivity (root=0xf22d80, clause=0xf8bb20, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576 #16 0x000000000054e72a in clause_selectivity (root=0xf22d80, clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:597 #17 0x000000000054e0df in clauselist_selectivity (root=0xf22d80, clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:123 #18 0x000000000054f99f in set_baserel_size_estimates (root=0x2b6f2c5dc078, rel=0xf26748) at costsize.c:2262 #19 0x000000000054d67c in set_rel_pathlist (root=0xf22d80, rel=0xf26748, rti=2, rte=0xec7c30) at allpaths.c:215 #20 0x000000000054df02 in make_one_rel (root=0xf22d80, joinlist=0xf26668) at allpaths.c:150 #21 0x000000000055f710 in query_planner (root=0xf22d80, tlist=<value optimized out>, tuple_fraction=0, limit_tuples=-1, cheapest_path=0x7fffb3f52220, sorted_path=0x7fffb3f52218, num_groups=0x7fffb3f52228) at planmain.c:249 #22 0x0000000000560041 in grouping_planner (root=0xf22d80, tuple_fraction=<value optimized out>) at planner.c:897 #23 0x00000000005610b8 in subquery_planner (glob=0xee11a0, parse=0xee1230, level=0, tuple_fraction=0, subroot=0x7fffb3f523e8) at planner.c:431 #24 0x00000000005614b1 in standard_planner (parse=0xee1230, cursorOptions=0, boundParams=0x0) at planner.c:158 #25 0x00000000005a1c61 in pg_plan_query (querytree=0xee1230, cursorOptions=0, boundParams=0x0) at postgres.c:681 #26 0x00000000005a1d13 in pg_plan_queries (querytrees=<value optimized out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at postgres.c:752 #27 0x0000000000633f93 in do_planning (querytrees=0xf22d50, cursorOptions=0) at plancache.c:560 #28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0, useResOwner=0 '\0') at plancache.c:484 #29 0x00000000005a3b05 in PostgresMain (argc=4, argv=<value optimized out>, username=0x8f3020 "docsysweb") at postgres.c:1605 #30 0x0000000000579d8f in ServerLoop () at postmaster.c:3207 #31 0x000000000057a85c in PostmasterMain (argc=3, argv=0x8ee2d0) at postmaster.c:1029 #32 0x00000000005345ce in main (argc=3, argv=<value optimized out>) at main.c:188 gdb) frame 28 #28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0, useResOwner=0 '\0') at plancache.c:484 484 slist = do_planning(slist, plansource->cursor_options); (gdb) print plansource->query_string $1 = 0xf08ea0 "select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.l"...
"Alex Hunsaker" <badalex@gmail.com> writes: > Problem: Apparently random segfaults apparently query agnostic, seem > to be more frequent when a pg_dump is running Hmm, seems from the backtrace that we're trying to do a replan with an invalid ActiveSnapshot. What sequence of operations is the connected client performing, exactly? (I'm not worried about the details of the SQL per se, but of how it's preparing/issuing queries.) regards, tom lane
On Tue, Mar 11, 2008 at 10:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Alex Hunsaker" <badalex@gmail.com> writes: > > Problem: Apparently random segfaults apparently query agnostic, seem > > to be more frequent when a pg_dump is running > > Hmm, seems from the backtrace that we're trying to do a replan with an > invalid ActiveSnapshot. What sequence of operations is the connected > client performing, exactly? (I'm not worried about the details of the > SQL per se, but of how it's preparing/issuing queries.) > > regards, tom lane > Its through perl-dbi using ->prepare, ->execute.. is that what you mean? Also this is with mod_perl and Apache::DBI. So a typical transacion looks like connect to postgres; begin; do stuff here; commit or rollback; (next page load) begin; .... so potentially a database connection/session is reused on the next page load if that makes a difference.
FYI right now Im trying: Author: tgl <tgl> Date: Sat Feb 2 22:26:17 2008 +0000 Fix WaitOnLock() to ensure that the process's "waiting" flag is reset after erroring out of a wait. We can use a PG_TRY block for this, but add a comment explaining why it'd be a bad idea to use it for any other state cleanup. Back-patch to 8.2. Prior releases had the same issue, but only with respect to the process title, which is likely to get reset almost immediately anyway after the transaction aborts, so it seems not worth changing them. In 8.2 and HEAD, the pg_stat_activity "waiting" flag could remain set incorrectly for a long time. Which was just after 8.3 was stamped. Ill see if anything happens tomorrow when the masses start using it again. I will also try to find a way to reproduce it.
Oops we actually use DBI->prepare_cached() not DBI->prepare() which to my understanding should be roughly equivalent to (because of Apache::DBI): prepare query ....; begin; execute query; commit; (next page load) begin; execute query; commit; I can turn that off and only use DBI->prepare() as a test. Or heck just cut DBI->prepare() out and just quote everything and send them through using DBI->do() instead. That is if you think that could be the culprit.
"Alex Hunsaker" <badalex@gmail.com> writes: > I can turn that off and only use DBI->prepare() as a test. Or heck > just cut DBI->prepare() out and just quote everything and send them > through using DBI->do() instead. That is if you think that could be > the culprit. The path of control that's dying cannot be entered at all unless you're using a prepared statement, so cutting out DBI->prepare should make the problem go away. Now personally, I am much more interested in *reproducing* the problem than merely dodging it. I can understand if you just need a workaround yesterday, but please see if you can get a reproducible test case ... regards, tom lane
On Wed, Mar 12, 2008 at 12:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now personally, I am much more interested in *reproducing* the problem > than merely dodging it. I can understand if you just need a workaround > yesterday, but please see if you can get a reproducible test case ... > > regards, tom lane > Yesterday would be nice ;). Ill keep trying to get a test case. In the mean time here is a new backtrace I just got (lord knows how... it seems to be as soon as I stop trying to make it crash and look away, thats when it does). This is 2 commits ( the one i reference in my later email) after 8.3 was stamped. Happens to be the same query as before... Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(54493) BIND '. Program terminated with signal 11, Segmentation fault. #0 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x000000000045a55e in heapgettup (scan=0xbd8bd0, dir=<value optimized out>, nkeys=0, key=0x0) at heapam.c:431 #2 0x000000000045a96d in heap_getnext (scan=0xbd8bd0, direction=14717080) at heapam.c:1259 #3 0x0000000000525478 in SeqNext (node=<value optimized out>) at nodeSeqscan.c:92 #4 0x000000000051a74f in ExecScan (node=0xbd7ed0, accessMtd=0x5253c0 <SeqNext>) at execScan.c:103 #5 0x0000000000514380 in ExecProcNode (node=0xbd7ed0) at execProcnode.c:334 #6 0x00000000005135ba in ExecutorRun (queryDesc=<value optimized out>, direction=ForwardScanDirection, count=1) at execMain.c:1233 #7 0x000000000051c82d in postquel_getnext (es=0xe09038, fcache=0xd0e4c0) at functions.c:378 #8 0x000000000051ce6d in fmgr_sql (fcinfo=0x7fffe6d7af90) at functions.c:479 #9 0x00000000005163bd in ExecMakeFunctionResult (fcache=0xd0df50, econtext=0xd0e3e0, isNull=0x7fffe6d7b3df "", isDone=0x0) at execQual.c:1351 #10 0x00000000005190ba in ExecEvalExprSwitchContext (expression=0x2ae1c608c658, econtext=0xe09098, isNull=0x32d <Address 0x32d out of bounds>, isDone=0x0) at execQual.c:3726 #11 0x0000000000568d4d in evaluate_expr (expr=<value optimized out>, result_type=23, result_typmod=-1) at clauses.c:3273 #12 0x00000000005699b1 in simplify_function (funcid=2214548, result_type=23, result_typmod=-1, args=0xcfbfa0, allow_inline=1 '\001', context=0x7fffe6d7b530) at clauses.c:2887 #13 0x000000000056a1ca in eval_const_expressions_mutator (node=0xc34170, context=0x7fffe6d7b530) at clauses.c:1795 #14 0x000000000056ae15 in estimate_expression_value (root=<value optimized out>, node=0x2ae1c608c658) at clauses.c:1719 #15 0x00000000005f8906 in get_restriction_variable (root=0xc32160, args=<value optimized out>, varRelid=0, vardata=0x7fffe6d7b640, other=0x7fffe6d7b678, varonleft=0x7fffe6d7b68f "\001\236") at selfuncs.c:3570 #16 0x00000000005fa2e2 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169 #17 0x0000000000644f4c in OidFunctionCall4 (functionId=<value optimized out>, arg1=12788064, arg2=532, arg3=12796016, arg4=0) at fmgr.c:1615 #18 0x000000000056cd52 in restriction_selectivity (root=0xc32160, operator=532, args=0xc34070, varRelid=0) at plancat.c:805 #19 0x000000000054e447 in clause_selectivity (root=0xc32160, clause=0xc341c0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639 #20 0x000000000054dd8f in clauselist_selectivity (root=0xc32160, clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:123 #21 0x000000000054e0f8 in clause_selectivity (root=0xc32160, clause=0xcf9f80, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576 #22 0x000000000054e3da in clause_selectivity (root=0xc32160, clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:597 #23 0x000000000054dd8f in clauselist_selectivity (root=0xc32160, clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at clausesel.c:123 #24 0x000000000054f64f in set_baserel_size_estimates (root=0x2ae1c608c658, rel=0xc35b28) at costsize.c:2262 #25 0x000000000054d32c in set_rel_pathlist (root=0xc32160, rel=0xc35b28, rti=2, rte=0xc1d110) at allpaths.c:215 #26 0x000000000054dbb2 in make_one_rel (root=0xc32160, joinlist=0xc35a48) at allpaths.c:150 #27 0x000000000055f3a0 in query_planner (root=0xc32160, tlist=<value optimized out>, tuple_fraction=0, limit_tuples=-1, cheapest_path=0x7fffe6d7c050, sorted_path=0x7fffe6d7c048, num_groups=0x7fffe6d7c058) at planmain.c:249 #28 0x000000000055fcd1 in grouping_planner (root=0xc32160, tuple_fraction=<value optimized out>) at planner.c:897 #29 0x0000000000560d48 in subquery_planner (glob=0xe19b20, parse=0xe19bb0, level=0, tuple_fraction=0, subroot=0x7fffe6d7c218) at planner.c:431 #30 0x0000000000561141 in standard_planner (parse=0xe19bb0, cursorOptions=0, boundParams=0x0) at planner.c:158 #31 0x00000000005a1971 in pg_plan_query (querytree=0xe19bb0, cursorOptions=0, boundParams=0x0) at postgres.c:681 #32 0x00000000005a1a23 in pg_plan_queries (querytrees=<value optimized out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at postgres.c:752 #33 0x0000000000633cf3 in do_planning (querytrees=0xc32130, cursorOptions=0) at plancache.c:560 #34 0x00000000006340cb in RevalidateCachedPlan (plansource=0xb58460, useResOwner=0 '\0') at plancache.c:484 #35 0x00000000005a3815 in PostgresMain (argc=4, argv=<value optimized out>, username=0x8f32f0 "docsysweb") at postgres.c:1605 #36 0x0000000000579a1f in ServerLoop () at postmaster.c:3207 #37 0x000000000057a4ec in PostmasterMain (argc=3, argv=0x8ee2d0) at postmaster.c:1029 #38 0x000000000053441e in main (argc=3, argv=<value optimized out>) at main.c:188
"Alex Hunsaker" <badalex@gmail.com> writes: > the mean time here is a new backtrace I just got (lord knows how... it > seems to be as soon as I stop trying to make it crash and look away, > thats when it does). Not sure that you are clear on what's happening here, but the train of events is something like - you prepare a statement - somebody modifies the schema of one of the tables used in the statement - you try to execute the statement, and updating the prepared plan crashes If you say "none of my stuff is changing any schemas", then I'd guess that the triggering event is a background autovacuum, which forces replan just like an intentional schema change would. Does stopping autovacuum make the problem go away? regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Work load is a web application where each page beings a transaction; > creates a temp table, does a few selects, inserts and updates and the > commits. Are you sure you are calling DBI->connect *after* the Apache children are created? Major problems like this can happen if not. The use of prepare_cached() may be adding to the problem as well, especially if you are using temp tables. In DBD::Pg, prepared statements are not actually prepared (in most cases) until just before the first execute, to account for late bindings and to be more efficient. Some related DBD::Pg attribs to look at are pg_server_prepare and pg_prepare_now. - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation PGP Key: 0x14964AC8 200803121121 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkfX9Q0ACgkQvJuQZxSWSsjEjACg6QNUdPIw5gczfTtFK3aUMh39 fUYAoLwRrFZ75z2Fbq7GDRYqgTlRsR9N =ngbh -----END PGP SIGNATURE-----
"Greg Sabino Mullane" <greg@turnstep.com> writes: > Are you sure you are calling DBI->connect *after* the Apache children > are created? Major problems like this can happen if not. Something like that might explain a crash in the Apache server, but hardly one in the PG backend. It looks to me like he's found a dangling-pointer problem, but we need to reproduce it before we can determine a fix... regards, tom lane
On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Alex Hunsaker" <badalex@gmail.com> writes: > > > the mean time here is a new backtrace I just got (lord knows how... it > > seems to be as soon as I stop trying to make it crash and look away, > > thats when it does). > > Not sure that you are clear on what's happening here, but the train of > events is something like > - you prepare a statement > - somebody modifies the schema of one of the tables used in the > statement > - you try to execute the statement, and updating the prepared > plan crashes > > If you say "none of my stuff is changing any schemas", then I'd guess > that the triggering event is a background autovacuum, which forces > replan just like an intentional schema change would. Does stopping > autovacuum make the problem go away? Yep turning off autovacuum seems to have fixed it. And once I manually vacuum analyze workers; it blows up almost instantly.
On Wed, Mar 12, 2008 at 9:22 AM, Greg Sabino Mullane <greg@turnstep.com> wrote: ommits. > > Are you sure you are calling DBI->connect *after* the Apache children > are created? Yes. Major problems like this can happen if not. The use of > prepare_cached() may be adding to the problem as well, especially if > you are using temp tables. The tables its failing on happen to not be temp tables (I only have 1 temp table and only do 1 insert into it for the entire transaction). >In DBD::Pg, prepared statements are not > actually prepared (in most cases) until just before the first execute, > to account for late bindings and to be more efficient. Some related > DBD::Pg attribs to look at are pg_server_prepare and pg_prepare_now. Hrm, well i dont ever prepare them in mass, I only prepare them and when im going to be calling execute right afterwords. But ill try turning on autovac and setting pg_prepare_now to 1 and see what happens.
"Alex Hunsaker" <badalex@gmail.com> writes: > On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If you say "none of my stuff is changing any schemas", then I'd guess >> that the triggering event is a background autovacuum, which forces >> replan just like an intentional schema change would. Does stopping >> autovacuum make the problem go away? > Yep turning off autovacuum seems to have fixed it. And once I > manually vacuum analyze workers; it blows up almost instantly. Yeah, I was going to suggest that you ought to be able to extract a test case involving doing a manual vacuum in between prepare and execute. I suspect it wouldn't even need to involve more than one session. regards, tom lane
On Wed, Mar 12, 2008 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Alex Hunsaker" <badalex@gmail.com> writes: > > On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> If you say "none of my stuff is changing any schemas", then I'd guess > >> that the triggering event is a background autovacuum, which forces > >> replan just like an intentional schema change would. Does stopping > >> autovacuum make the problem go away? > > > Yep turning off autovacuum seems to have fixed it. And once I > > manually vacuum analyze workers; it blows up almost instantly. > > Yeah, I was going to suggest that you ought to be able to extract > a test case involving doing a manual vacuum in between prepare and > execute. I suspect it wouldn't even need to involve more than one > session. Here is what im trying right now with no success: 3 clients doing this: while(1) { $db->begin_work(); my $sth = $db->prepare_cached('select * from junk left join junk as j on j.junk = junk.junk where junk.junk like ? limit 1;'); print "VAC!\n"; sleep 10; print "EX!\n"; $sth->execute('junk') || die "failed: $!"; $sth->fetchall_arrayref(); $db->commit(); $db->{'AutoCommit'} = 0; $db->{'AutoCommit'} = 1; } where when it prints VAC I : update junk set junk = 'junkab'; VACUUM ANALYZE verbose junk; (also tried deleting, and inserting a bunch of junk...) 3 other clients doing: while(1) { $db->begin_work(); my $sth = $db->prepare_cached('select * from junk left join junk as j on j.junk = junk.junk where junk.junk like ? limit 1;'); $sth->execute('junk') || die "failed: $!"; $sth->fetchall_arrayref(); $db->rollback(); } \d junk Table "public.junk" Column | Type | Modifiers --------+------+----------- junk | text |
Alex Hunsaker wrote: > Precedence: bulk > Sender: pgsql-bugs-owner@postgresql.org > > On Wed, Mar 12, 2008 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> "Alex Hunsaker" <badalex@gmail.com> writes: >> > On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >>>> If you say "none of my stuff is changing any schemas", then I'd guess >>>> >> >> that the triggering event is a background autovacuum, which forces >> >> replan just like an intentional schema change would. Does stopping >> >> autovacuum make the problem go away? >> >> > Yep turning off autovacuum seems to have fixed it. And once I >> > manually vacuum analyze workers; it blows up almost instantly. >> >> Yeah, I was going to suggest that you ought to be able to extract >> a test case involving doing a manual vacuum in between prepare and >> execute. I suspect it wouldn't even need to involve more than one >> session. >> > > Here is what im trying right now with no success: > > 3 clients doing this: > while(1) > { > $db->begin_work(); > my $sth = $db->prepare_cached('select * from junk left join > junk as j on j.junk = junk.junk where junk.junk like ? limit 1;'); > print "VAC!\n"; > sleep 10; > print "EX!\n"; > $sth->execute('junk') || die "failed: $!"; > $sth->fetchall_arrayref(); > $db->commit(); > $db->{'AutoCommit'} = 0; > $db->{'AutoCommit'} = 1; > } > > where when it prints VAC I : > update junk set junk = 'junkab'; > VACUUM ANALYZE verbose junk; > (also tried deleting, and inserting a bunch of junk...) > > 3 other clients doing: > while(1) > { > $db->begin_work(); > my $sth = $db->prepare_cached('select * from junk left join > junk as j on j.junk = junk.junk where junk.junk like ? limit 1;'); > $sth->execute('junk') || die "failed: $!"; > $sth->fetchall_arrayref(); > $db->rollback(); > } > > \d junk > Table "public.junk" > Column | Type | Modifiers > --------+------+----------- > junk | text | > > Hola, esto (like ? ) generalmente no funciona en un prepared statement, quizas tu problema este ahi. proba con un =? , (se que no es lo mismo) y si funciona el problema es el like Saludos Fernando
"Alex Hunsaker" <badalex@gmail.com> writes: > Here is what im trying right now with no success: > my $sth = $db->prepare_cached('select * from junk left join > junk as j on j.junk = junk.junk where junk.junk like ? limit 1;'); You need to duplicate more of the original query structure to provoke the problem, likely. The crash appeared to involve evaluation of an immutable SQL function ... regards, tom lane
On Wed, Mar 12, 2008 at 10:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Alex Hunsaker" <badalex@gmail.com> writes: > > > Here is what im trying right now with no success: > > > > my $sth = $db->prepare_cached('select * from junk left join > > junk as j on j.junk = junk.junk where junk.junk like ? limit 1;'); > > You need to duplicate more of the original query structure to provoke > the problem, likely. The crash appeared to involve evaluation of an > immutable SQL function ... Will do. Just for the record its defined as create or replace function data_class(text) returns integer as 'select data_class from data_classes where data_id = $1 and defunct = 0' language 'sql' stable strict;
Hrm still no luck. I created a snapshot of the database, moved it onto another server so i could play with it... Ive tried using just prepare on the console using the query that fails: prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active, w.last_deactive, round(extract(epoch from now()) - extract(epoch from w.last_deactive)) as time_off from workers as w left join worker_vis as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and ( ((wv.auth_id = $1) and (wv.auth_class = data_class('user_id'))) or ((wv.auth_id = $2) and (wv.auth_class = data_class('clinic_id')))) where wv.worker_vis_id is not null and w.defunct = 0 order by coalesce(w.alias, w.worker); update workers set last_active = now(); vacuum analyze workers; update worker_vis set worker_id = worker_id; vacuum analyze worker_vis; update data_classes set defunct = 0 where defunct = 0; vacuum analyze data_classes; execute wrk; That works as expected. I also tried each of those updates/vacuums separately. So now I'm trying the the "bad" query in the simple perl script i posted before, Ive tried just one instance, and multiple instances... I guess next ill try running all the sql each web session generates before it crashes... unless anyone has any other bright ideas for me to try. Perhaps my simple updates are not enough for analyze to invalidate the query plan? Should I be doing inserts/deletes or just more updates? Below are the table counts and the definition of data_classes. That should be everything the query uses, except for the actually data. Which I'm more than willing to provide (privately) if anyone thinks they have a great idea on how to reproduce it. SELECT count(1) from workers; count ------- 716 SELECT count(1) from worker_vis; count ------- 577 SELECT count(1) from data_classes; count ------- 75 \d data_classes Table "public.data_classes" Column | Type | Modifiers --------------+--------------------------+------------------------------------------------------------------- data_class | integer | not null default nextval('data_classes_data_class_seq'::regclass) data_id | character varying(80) | data_table | text | date_created | timestamp with time zone | default now() defunct | smallint | default 0 description | character varying(80) | Indexes: "data_classes_pkey" PRIMARY KEY, btree (data_class) "data_class_data_id_idx" UNIQUE, btree (data_id) "data_class_data_idx" btree (data_id) WHERE defunct = 0
"Alex Hunsaker" <badalex@gmail.com> writes: > Perhaps my simple updates are not enough for analyze to > invalidate the query plan? Should I be doing inserts/deletes or just > more updates? No, AFAICS the plan inval will happen after a vacuum regardless of whether anything actually changed. I'm thinking that there's some property of the query or the invoked function that's relevant, but it's not clear what. (We'll know soon enough once we can reproduce it easily...) regards, tom lane
On Wed, Mar 12, 2008 at 1:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Alex Hunsaker" <badalex@gmail.com> writes: > > > Perhaps my simple updates are not enough for analyze to > > invalidate the query plan? Should I be doing inserts/deletes or just > > more updates? > > No, AFAICS the plan inval will happen after a vacuum regardless of whether > anything actually changed. I'm thinking that there's some property > of the query or the invoked function that's relevant, but it's not > clear what. (We'll know soon enough once we can reproduce it easily...) > > regards, tom lane > Ok I got it... have not been able to reproduce it in pure pgsql yet so... maybe its a DBD::Pg bug? (CC'd Greg Sabino Mullane) This crashes everytime for me. and just about every part is required. That is If I take out the function call it works If I take out the junk = ? it works If I take out the nextval('junk_seq') it works If I take out the begin_work it works .... #!/usr/bin/perl use strict; use warnings; use DBI(); use DBD::Pg(); my $db = DBI->connect('dbi:Pg:dbname=test;', {'pg_server_prepare'=>1, 'pg_prepare_now'=>1}) || die "could not connect: $!"; $db->do('create table junk (junk text, junk_id int);'); $db->do('create sequence junk_seq;'); $db->do("create or replace function junk_func(int) returns integer as 'select junk_id from junk where junk_id = \$1;' language 'sql' stable strict;"); my $sth = $db->prepare('select * from junk where junk = ? and junk_id = junk_func(1) limit 1;'); $db->do('vacuum junk;'); $db->begin_work(); $db->do('select nextval(\'junk_seq\');'); $sth->execute('test') || die "failed: $!"; $sth->fetchall_arrayref(); $db->disconnect();
err that should be (forgot the username, password placeholders) > my $db = DBI->connect('dbi:Pg:dbname=test;', '', '', {'pg_server_prepare'=>1, 'pg_prepare_now'=>1}) || die "could not connect:$!"; > $db->do('create table junk (junk text, junk_id int);'); > $db->do('create sequence junk_seq;'); > $db->do("create or replace function junk_func(int) returns integer as > 'select junk_id from junk where junk_id = \$1;' language 'sql' stable > strict;"); > > my $sth = $db->prepare('select * from junk where junk = ? and junk_id > = junk_func(1) limit 1;'); > > $db->do('vacuum junk;'); > $db->begin_work(); > > $db->do('select nextval(\'junk_seq\');'); > > $sth->execute('test') || die "failed: $!"; > $sth->fetchall_arrayref(); > > $db->disconnect(); >
On Wed, Mar 12, 2008 at 1:59 PM, Alex Hunsaker <badalex@gmail.com> wrote: > $db->do("create or replace function junk_func(int) returns integer as > 'select junk_id from junk where junk_id = \$1;' language 'sql' stable > strict;"); > Another oddity works: create or replace function junk_func() returns integer as 'select 1;' language 'sql' volatile strict; works: create or replace function junk_func() returns integer as 'select 1 limit 1;' language 'sql' stable strict; segfaults: create or replace function junk_func() returns integer as 'select 1;' language 'sql' stable strict;
"Alex Hunsaker" <badalex@gmail.com> writes: > Ok I got it... have not been able to reproduce it in pure pgsql yet > so... maybe its a DBD::Pg bug? (CC'd Greg Sabino Mullane) Great, got it --- you need a protocol-level prepare to cause it (or so I expect) and pure psql won't do that. I was able to reproduce it with a quick test program using PQprepare though. Now to dig ... regards, tom lane
This patch should fix it ... regards, tom lane Index: postgres.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/tcop/postgres.c,v retrieving revision 1.544 diff -c -r1.544 postgres.c *** postgres.c 10 Mar 2008 12:55:13 -0000 1.544 --- postgres.c 12 Mar 2008 23:42:32 -0000 *************** *** 730,760 **** pg_plan_queries(List *querytrees, int cursorOptions, ParamListInfo boundParams, bool needSnapshot) { ! List *stmt_list = NIL; ! ListCell *query_list; ! foreach(query_list, querytrees) { ! Query *query = (Query *) lfirst(query_list); ! Node *stmt; ! if (query->commandType == CMD_UTILITY) { ! /* Utility commands have no plans. */ ! stmt = query->utilityStmt; ! } ! else ! { ! if (needSnapshot) { ! ActiveSnapshot = CopySnapshot(GetTransactionSnapshot()); ! needSnapshot = false; } ! stmt = (Node *) pg_plan_query(query, cursorOptions, boundParams); } ! stmt_list = lappend(stmt_list, stmt); } return stmt_list; } --- 730,778 ---- pg_plan_queries(List *querytrees, int cursorOptions, ParamListInfo boundParams, bool needSnapshot) { ! List * volatile stmt_list = NIL; ! Snapshot saveActiveSnapshot = ActiveSnapshot; ! /* PG_TRY to ensure previous ActiveSnapshot is restored on error */ ! PG_TRY(); { ! Snapshot mySnapshot = NULL; ! ListCell *query_list; ! foreach(query_list, querytrees) { ! Query *query = (Query *) lfirst(query_list); ! Node *stmt; ! ! if (query->commandType == CMD_UTILITY) ! { ! /* Utility commands have no plans. */ ! stmt = query->utilityStmt; ! } ! else { ! if (needSnapshot && mySnapshot == NULL) ! { ! mySnapshot = CopySnapshot(GetTransactionSnapshot()); ! ActiveSnapshot = mySnapshot; ! } ! stmt = (Node *) pg_plan_query(query, cursorOptions, ! boundParams); } ! ! stmt_list = lappend(stmt_list, stmt); } ! if (mySnapshot) ! FreeSnapshot(mySnapshot); ! } ! PG_CATCH(); ! { ! ActiveSnapshot = saveActiveSnapshot; ! PG_RE_THROW(); } + PG_END_TRY(); + ActiveSnapshot = saveActiveSnapshot; return stmt_list; }
On Wed, Mar 12, 2008 at 6:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This patch should fix it ... > > regards, tom lane Excellent seems to have fixed the problem. Thanks!