Обсуждение: 8.3.0 backend segfaults

Поиск
Список
Период
Сортировка

8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
(Sorry if duplicates show up this is the third time ive posted this in
the past 10 hours, Im assuming it got dropped because of the
attachments)

Problem: Apparently random segfaults apparently query agnostic, seem
  to be more frequent when a pg_dump is running

The most frequent query it segfaults with is:
  select w.worker_id, w.worker_id as printerid, w.worker, w.alias,
  coalesce(w.alias, w.worker) as name, w.active, w.last_active,
  w.last_deactive, round(extract(epoch from now()) - extract(epoch from
  w.last_deactive)) as time_off from workers as w left join worker_vis
  as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and (
  ((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or
  ((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id'))))
  where wv.worker_vis_id is not null and w.defunct = 0 order by
  coalesce(w.alias, w.worker);

  (sample arguments, they do not seem to make a difference)
    167214, 13
    167340, 16

  But have seen other simpler queries such as (sorry no backtraces for
  these... yet)
  insert into audit_transaction (action, open_user_id, page_load_id,
  user_id) values (?, ?, ?, ?); ARGS=suds, 509057, 15148365, 167217

If I select * from workers; and select * from workers_active;
everything looks good, no segfaults.  I can also take a pg_dump if
virtually nothing is using the database.  Otherwise pg_dump dies
randomly with an errors like:

Dumping the contents of table "clients_audit" failed: PQgetCopyData() failed.
Dumping the contents of table "file_data" failed: PQgetCopyData() failed.
Dumping the contents of table "workers_audit" failed: PQgetCopyData() failed.

(note these are probably 3 of the larger tables in the database, maybe
it has something to do with statement_timeout, i do have it set quite
low (3 min) ?)

If I manually prepare and execute the above queries they work fine.
prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as
  printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name,
  w.active, w.last_active, w.last_deactive, round(extract(epoch from
  now()) - extract(epoch from w.last_deactive)) as time_off from workers
  as w left join worker_vis as wv on wv.worker_id = w.worker_id and
  wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class =
  data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class =
  data_class('clinic_id')))) where wv.worker_vis_id is not null and
  w.defunct = 0 order by coalesce(w.alias, w.worker);
  execute worker (167214, 13);

Core dumps and binaries available if needed. I have about 6 core dumps
across 2 different servers.  Which leads me to believe its either
postgres bug or a corrupt database (the other server is a pitr slave
for the master which i switched over to see if i could reproduce the
segfault on the master)

Nothing in dmesg, nothing in mcelog.
Raid controller is a 3ware 9550SX-12 with 6 70GB WD Raptor hard drives
in a raid 10, smart tests pass fine... everything looks good.

Work load is a web application where each page beings a transaction;
creates a temp table, does a few selects, inserts and updates and the
commits.

  Postgresql 8.2.5 was working without any problems just prior to the upgrade.

  \d workers
                                           Table "public.workers"
    Column     |           Type           |
  Modifiers
  ---------------+--------------------------+-----------------------------------------------------------------
  worker_id     | bigint                   | not null default
  nextval('workers_worker_id_seq'::regclass)
  date_created  | timestamp with time zone | not null default now()
  clientid      | bigint                   | not null
  worker        | text                     | not null
  alias         | text                     |
  job_type      | smallint                 | not null
  active        | smallint                 | not null default 0
  last_active   | timestamp with time zone |
  last_deactive | timestamp with time zone |
  defunct       | smallint                 | not null default 0
  audit_class   | integer                  |
  audit_date    | timestamp with time zone | not null default now()
  audit_desc    | text                     |
  audit_id      | bigint                   |
  audit_seq     | integer                  | not null default
  nextval('audit_basic_audit_seq_seq'::regclass)
  audit_table   | character varying(64)    |
  audit_tid     | bigint                   |
  audit_type    | character varying(32)    |
  audit_orig_id | integer                  |
  path          | text                     |
  Indexes:
    "workers_pkey" PRIMARY KEY, btree (worker_id) CLUSTER
  Triggers:
    workers_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON workers FOR
  EACH ROW EXECUTE PROCEDURE audit_table_go()

\d worker_vis
                                         Table "public.worker_vis"
    Column     |           Type           |
Modifiers
---------------+--------------------------+-----------------------------------------------------------------
 worker_vis_id | bigint                   | not null default
nextval('worker_vis_ruleid_seq'::regclass)
 date_created  | timestamp with time zone | not null default now()
 worker_id     | bigint                   |
 auth_id       | bigint                   |
 auth_class    | smallint                 |
 defunct       | smallint                 | not null default 0
 audit_class   | integer                  |
 audit_date    | timestamp with time zone | not null default now()
 audit_desc    | text                     |
 audit_id      | bigint                   |
 audit_seq     | integer                  | not null default
nextval('audit_basic_audit_seq_seq'::regclass)
 audit_table   | character varying(64)    |
 audit_tid     | bigint                   |
 audit_type    | character varying(32)    |
 audit_orig_id | integer                  |
Indexes:
    "worker_vis_pkey" PRIMARY KEY, btree (worker_vis_id)
    "worker_vis_vis_idx" btree (worker_id, defunct, auth_id, auth_class) CLUSTER
Triggers:
    worker_vis_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON
worker_vis FOR EACH ROW EXECUTE PROCEDURE audit_table_go()

  Version Information:

  Following REL8.3_STABLE 2 versions tried both crash with a similar
  backtrace (both died in CopySnapshot):

  (currently running)
  Author: teodor <teodor>
  Date:   Sun Mar 9 10:42:48 2008 +0000

    Revert changes of CompareTSQ: it affects existing btree indexes.

  (was running but also crashed)
  Author: tgl <tgl>
  Date:   Sun Mar 2 00:10:28 2008 +0000

    Venezuela Time now means UTC-4:30, not UTC-4:00.  Adjust our treatment
    of "VET" accordingly.  Per bug #3997 from Aaron Mizrachi.

  Memory: 8G ECC
  Swap: 16GB (unused even at segfault time aka it does not seem to be
  running out of memmory)
  OS: Linux  2.6.24.3

  select version();
  PostgreSQL 8.3.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
  (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

  built with: --disable-nls --with-openssl --without-python
  --without-tcl --without-krb5 --with-perl --enable-debug
  --enable-integer-datetimes --with-ossp-uuid

  postgresql.conf:
  listen_addresses = '*'
  port = 5432
  max_connections = 500
  superuser_reserved_connections = 3
  ssl = on
  shared_buffers = 1000MB
  work_mem = 32MB
  maintenance_work_mem = 256MB
  max_stack_depth = 7MB
  max_fsm_pages = 304800
  max_fsm_relations = 3000
  vacuum_cost_delay = 30
  bgwriter_lru_maxpages = 500
  wal_sync_method = fdatasync
  wal_buffers = 256kB
  checkpoint_segments = 8
  checkpoint_timeout = 10min
  archive_command = 'rsync -az -e "ssh" %p
  psql-sync@10.10.10.1:standby/cur/wal/%f </dev/null'
  random_page_cost = 2.6
  effective_cache_size = 6GB
  default_statistics_target = 101
  constraint_exclusion = on
  log_destination = 'syslog'
  client_min_messages = warning
  log_min_messages = warning
  silent_mode = on
  log_line_prefix = '[%m] %h:%d '
  log_lock_waits = on
  log_temp_files = 20MB
  autovacuum_naptime = 3min
  autovacuum_vacuum_threshold = 200
  autovacuum_analyze_threshold = 200
  statement_timeout = '3min'
  datestyle = 'iso, mdy'
  lc_messages = 'C'
  lc_monetary = 'C'
  lc_numeric = 'C'
  lc_time = 'C'
  backslash_quote = off
  escape_string_warning = on
  sql_inheritance = off
  standard_conforming_strings = on

  cat /proc/cpuinfo
  processor       : 0
  vendor_id       : AuthenticAMD
  cpu family      : 15
  model           : 5
  model name      : AMD Opteron(tm) Processor 246
  stepping        : 10
  cpu MHz         : 2000.000
  cache size      : 1024 KB
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 1
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
  3dnowext 3dnow rep_good
  bogomips        : 3991.79
  TLB size        : 1024 4K pages
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 40 bits physical, 48 bits virtual
  power management: ts fid vid ttp

  processor       : 1
  vendor_id       : AuthenticAMD
  cpu family      : 15
  model           : 5
  model name      : AMD Opteron(tm) Processor 246
  stepping        : 10
  cpu MHz         : 2000.000
  cache size      : 1024 KB
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 1
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
  3dnowext 3dnow rep_good
  bogomips        : 3991.79
  TLB size        : 1024 4K pages
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 40 bits physical, 48 bits virtual
  power management: ts fid vid ttp

Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(49502) BIND   '.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6
(gdb) bt
#0  0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6
#1  0x000000000065fc88 in CopySnapshot (snapshot=0xa4e6e0) at tqual.c:1311
#2  0x000000000051cf34 in fmgr_sql (fcinfo=0x7fffb3f51160) at functions.c:299
#3  0x00000000005164cd in ExecMakeFunctionResult (fcache=0x10b4710,
econtext=0x10b4ba0, isNull=0x7fffb3f515af "", isDone=0x0) at
execQual.c:1351
#4  0x00000000005191ca in ExecEvalExprSwitchContext
(expression=0x2b6f2c5dc078, econtext=0x10, isNull=0x24c8500 <Address
0x24c8500 out of bounds>, isDone=0x24c84c0)
    at execQual.c:3726
#5  0x00000000005690bd in evaluate_expr (expr=<value optimized out>,
result_type=23, result_typmod=-1) at clauses.c:3273
#6  0x0000000000569d21 in simplify_function (funcid=2214548,
result_type=23, result_typmod=-1, args=0xf8db40, allow_inline=1
'\001', context=0x7fffb3f51700) at clauses.c:2887
#7  0x000000000056a53a in eval_const_expressions_mutator
(node=0xf24d90, context=0x7fffb3f51700) at clauses.c:1795
#8  0x000000000056b185 in estimate_expression_value (root=<value
optimized out>, node=0x2b6f2c5dc078) at clauses.c:1719
#9  0x00000000005f8b86 in get_restriction_variable (root=0xf22d80,
args=<value optimized out>, varRelid=0, vardata=0x7fffb3f51810,
other=0x7fffb3f51848,
    varonleft=0x7fffb3f5185f "\001\236") at selfuncs.c:3570
#10 0x00000000005fa562 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169
#11 0x000000000064540c in OidFunctionCall4 (functionId=<value
optimized out>, arg1=15871360, arg2=532, arg3=15879312, arg4=0) at
fmgr.c:1615
#12 0x000000000056d0c2 in restriction_selectivity (root=0xf22d80,
operator=532, args=0xf24c90, varRelid=0) at plancat.c:805
#13 0x000000000054e797 in clause_selectivity (root=0xf22d80,
clause=0xf24de0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639
#14 0x000000000054e0df in clauselist_selectivity (root=0xf22d80,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#15 0x000000000054e448 in clause_selectivity (root=0xf22d80,
clause=0xf8bb20, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576
#16 0x000000000054e72a in clause_selectivity (root=0xf22d80,
clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:597
#17 0x000000000054e0df in clauselist_selectivity (root=0xf22d80,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#18 0x000000000054f99f in set_baserel_size_estimates
(root=0x2b6f2c5dc078, rel=0xf26748) at costsize.c:2262
#19 0x000000000054d67c in set_rel_pathlist (root=0xf22d80,
rel=0xf26748, rti=2, rte=0xec7c30) at allpaths.c:215
#20 0x000000000054df02 in make_one_rel (root=0xf22d80,
joinlist=0xf26668) at allpaths.c:150
#21 0x000000000055f710 in query_planner (root=0xf22d80, tlist=<value
optimized out>, tuple_fraction=0, limit_tuples=-1,
cheapest_path=0x7fffb3f52220,
    sorted_path=0x7fffb3f52218, num_groups=0x7fffb3f52228) at planmain.c:249
#22 0x0000000000560041 in grouping_planner (root=0xf22d80,
tuple_fraction=<value optimized out>) at planner.c:897
#23 0x00000000005610b8 in subquery_planner (glob=0xee11a0,
parse=0xee1230, level=0, tuple_fraction=0, subroot=0x7fffb3f523e8) at
planner.c:431
#24 0x00000000005614b1 in standard_planner (parse=0xee1230,
cursorOptions=0, boundParams=0x0) at planner.c:158
#25 0x00000000005a1c61 in pg_plan_query (querytree=0xee1230,
cursorOptions=0, boundParams=0x0) at postgres.c:681
#26 0x00000000005a1d13 in pg_plan_queries (querytrees=<value optimized
out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at
postgres.c:752
#27 0x0000000000633f93 in do_planning (querytrees=0xf22d50,
cursorOptions=0) at plancache.c:560
#28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0,
useResOwner=0 '\0') at plancache.c:484
#29 0x00000000005a3b05 in PostgresMain (argc=4, argv=<value optimized
out>, username=0x8f3020 "docsysweb") at postgres.c:1605
#30 0x0000000000579d8f in ServerLoop () at postmaster.c:3207
#31 0x000000000057a85c in PostmasterMain (argc=3, argv=0x8ee2d0) at
postmaster.c:1029
#32 0x00000000005345ce in main (argc=3, argv=<value optimized out>) at
main.c:188
gdb) frame 28
#28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0,
useResOwner=0 '\0') at plancache.c:484
484                             slist = do_planning(slist,
plansource->cursor_options);
(gdb) print plansource->query_string
$1 = 0xf08ea0 "select w.worker_id, w.worker_id as printerid, w.worker,
w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active,
w.last_deactive, round(extract(epoch from now()) - extract(epoch from
w.l"...

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> Problem: Apparently random segfaults apparently query agnostic, seem
>   to be more frequent when a pg_dump is running

Hmm, seems from the backtrace that we're trying to do a replan with an
invalid ActiveSnapshot.  What sequence of operations is the connected
client performing, exactly?  (I'm not worried about the details of the
SQL per se, but of how it's preparing/issuing queries.)

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Tue, Mar 11, 2008 at 10:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Alex Hunsaker" <badalex@gmail.com> writes:
>  > Problem: Apparently random segfaults apparently query agnostic, seem
>  >   to be more frequent when a pg_dump is running
>
>  Hmm, seems from the backtrace that we're trying to do a replan with an
>  invalid ActiveSnapshot.  What sequence of operations is the connected
>  client performing, exactly?  (I'm not worried about the details of the
>  SQL per se, but of how it's preparing/issuing queries.)
>
>                         regards, tom lane
>

Its through perl-dbi using ->prepare, ->execute.. is that what you
mean?  Also this is with mod_perl and Apache::DBI.

So a typical transacion looks like

connect to postgres;
begin;
do stuff here;
commit or rollback;
(next page load)
begin;
....

so potentially a database connection/session is reused on the next
page load if that makes a difference.

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
FYI right now Im trying:

Author: tgl <tgl>
Date:   Sat Feb 2 22:26:17 2008 +0000

    Fix WaitOnLock() to ensure that the process's "waiting" flag is reset after
    erroring out of a wait.  We can use a PG_TRY block for this, but
add a comment
    explaining why it'd be a bad idea to use it for any other state cleanup.

    Back-patch to 8.2.  Prior releases had the same issue, but only with respect
    to the process title, which is likely to get reset almost immediately anyway
    after the transaction aborts, so it seems not worth changing them.  In 8.2
    and HEAD, the pg_stat_activity "waiting" flag could remain set incorrectly
    for a long time.

Which was just after 8.3 was stamped. Ill see if anything happens
tomorrow when the masses start using it again.  I will also try to
find a way to reproduce it.

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
Oops we actually use DBI->prepare_cached() not DBI->prepare() which to
my understanding should be roughly equivalent to (because of
Apache::DBI):

prepare query ....;
begin;
execute query;
commit;
(next page load)
begin;
execute query;
commit;

I can turn that off and only use DBI->prepare() as a test.  Or heck
just cut DBI->prepare() out and just quote everything and send them
through using DBI->do() instead.  That is if you think that could be
the culprit.

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> I can turn that off and only use DBI->prepare() as a test.  Or heck
> just cut DBI->prepare() out and just quote everything and send them
> through using DBI->do() instead.  That is if you think that could be
> the culprit.

The path of control that's dying cannot be entered at all unless you're
using a prepared statement, so cutting out DBI->prepare should make the
problem go away.

Now personally, I am much more interested in *reproducing* the problem
than merely dodging it.  I can understand if you just need a workaround
yesterday, but please see if you can get a reproducible test case ...

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 12:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  Now personally, I am much more interested in *reproducing* the problem
>  than merely dodging it.  I can understand if you just need a workaround
>  yesterday, but please see if you can get a reproducible test case ...
>
>                         regards, tom lane
>

Yesterday would be nice ;). Ill keep trying to get a test case.  In
the mean time here is a new backtrace I just got (lord knows how... it
seems to be as soon as I stop trying to make it crash and look away,
thats when it does).  This is 2 commits ( the one i reference in my
later email) after 8.3 was stamped.  Happens to be the same query as
before...

Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(54493) BIND   '.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000045a55e in heapgettup (scan=0xbd8bd0, dir=<value
optimized out>, nkeys=0, key=0x0) at heapam.c:431
#2  0x000000000045a96d in heap_getnext (scan=0xbd8bd0,
direction=14717080) at heapam.c:1259
#3  0x0000000000525478 in SeqNext (node=<value optimized out>) at
nodeSeqscan.c:92
#4  0x000000000051a74f in ExecScan (node=0xbd7ed0, accessMtd=0x5253c0
<SeqNext>) at execScan.c:103
#5  0x0000000000514380 in ExecProcNode (node=0xbd7ed0) at execProcnode.c:334
#6  0x00000000005135ba in ExecutorRun (queryDesc=<value optimized
out>, direction=ForwardScanDirection, count=1) at execMain.c:1233
#7  0x000000000051c82d in postquel_getnext (es=0xe09038,
fcache=0xd0e4c0) at functions.c:378
#8  0x000000000051ce6d in fmgr_sql (fcinfo=0x7fffe6d7af90) at functions.c:479
#9  0x00000000005163bd in ExecMakeFunctionResult (fcache=0xd0df50,
econtext=0xd0e3e0, isNull=0x7fffe6d7b3df "", isDone=0x0) at
execQual.c:1351
#10 0x00000000005190ba in ExecEvalExprSwitchContext
(expression=0x2ae1c608c658, econtext=0xe09098, isNull=0x32d <Address
0x32d out of bounds>, isDone=0x0) at execQual.c:3726
#11 0x0000000000568d4d in evaluate_expr (expr=<value optimized out>,
result_type=23, result_typmod=-1) at clauses.c:3273
#12 0x00000000005699b1 in simplify_function (funcid=2214548,
result_type=23, result_typmod=-1, args=0xcfbfa0, allow_inline=1
'\001', context=0x7fffe6d7b530) at clauses.c:2887
#13 0x000000000056a1ca in eval_const_expressions_mutator
(node=0xc34170, context=0x7fffe6d7b530) at clauses.c:1795
#14 0x000000000056ae15 in estimate_expression_value (root=<value
optimized out>, node=0x2ae1c608c658) at clauses.c:1719
#15 0x00000000005f8906 in get_restriction_variable (root=0xc32160,
args=<value optimized out>, varRelid=0, vardata=0x7fffe6d7b640,
other=0x7fffe6d7b678,
    varonleft=0x7fffe6d7b68f "\001\236") at selfuncs.c:3570
#16 0x00000000005fa2e2 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169
#17 0x0000000000644f4c in OidFunctionCall4 (functionId=<value
optimized out>, arg1=12788064, arg2=532, arg3=12796016, arg4=0) at
fmgr.c:1615
#18 0x000000000056cd52 in restriction_selectivity (root=0xc32160,
operator=532, args=0xc34070, varRelid=0) at plancat.c:805
#19 0x000000000054e447 in clause_selectivity (root=0xc32160,
clause=0xc341c0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639
#20 0x000000000054dd8f in clauselist_selectivity (root=0xc32160,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#21 0x000000000054e0f8 in clause_selectivity (root=0xc32160,
clause=0xcf9f80, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576
#22 0x000000000054e3da in clause_selectivity (root=0xc32160,
clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:597
#23 0x000000000054dd8f in clauselist_selectivity (root=0xc32160,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#24 0x000000000054f64f in set_baserel_size_estimates
(root=0x2ae1c608c658, rel=0xc35b28) at costsize.c:2262
#25 0x000000000054d32c in set_rel_pathlist (root=0xc32160,
rel=0xc35b28, rti=2, rte=0xc1d110) at allpaths.c:215
#26 0x000000000054dbb2 in make_one_rel (root=0xc32160,
joinlist=0xc35a48) at allpaths.c:150
#27 0x000000000055f3a0 in query_planner (root=0xc32160, tlist=<value
optimized out>, tuple_fraction=0, limit_tuples=-1,
cheapest_path=0x7fffe6d7c050, sorted_path=0x7fffe6d7c048,
    num_groups=0x7fffe6d7c058) at planmain.c:249
#28 0x000000000055fcd1 in grouping_planner (root=0xc32160,
tuple_fraction=<value optimized out>) at planner.c:897
#29 0x0000000000560d48 in subquery_planner (glob=0xe19b20,
parse=0xe19bb0, level=0, tuple_fraction=0, subroot=0x7fffe6d7c218) at
planner.c:431
#30 0x0000000000561141 in standard_planner (parse=0xe19bb0,
cursorOptions=0, boundParams=0x0) at planner.c:158
#31 0x00000000005a1971 in pg_plan_query (querytree=0xe19bb0,
cursorOptions=0, boundParams=0x0) at postgres.c:681
#32 0x00000000005a1a23 in pg_plan_queries (querytrees=<value optimized
out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at
postgres.c:752
#33 0x0000000000633cf3 in do_planning (querytrees=0xc32130,
cursorOptions=0) at plancache.c:560
#34 0x00000000006340cb in RevalidateCachedPlan (plansource=0xb58460,
useResOwner=0 '\0') at plancache.c:484
#35 0x00000000005a3815 in PostgresMain (argc=4, argv=<value optimized
out>, username=0x8f32f0 "docsysweb") at postgres.c:1605
#36 0x0000000000579a1f in ServerLoop () at postmaster.c:3207
#37 0x000000000057a4ec in PostmasterMain (argc=3, argv=0x8ee2d0) at
postmaster.c:1029
#38 0x000000000053441e in main (argc=3, argv=<value optimized out>) at
main.c:188

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> the mean time here is a new backtrace I just got (lord knows how... it
> seems to be as soon as I stop trying to make it crash and look away,
> thats when it does).

Not sure that you are clear on what's happening here, but the train of
events is something like
    - you prepare a statement
    - somebody modifies the schema of one of the tables used in the
      statement
    - you try to execute the statement, and updating the prepared
      plan crashes

If you say "none of my stuff is changing any schemas", then I'd guess
that the triggering event is a background autovacuum, which forces
replan just like an intentional schema change would.  Does stopping
autovacuum make the problem go away?

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Greg Sabino Mullane"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> Work load is a web application where each page beings a transaction;
> creates a temp table, does a few selects, inserts and updates and the
> commits.

Are you sure you are calling DBI->connect *after* the Apache children
are created? Major problems like this can happen if not. The use of
prepare_cached() may be adding to the problem as well, especially if
you are using temp tables. In DBD::Pg, prepared statements are not
actually prepared (in most cases) until just before the first execute,
to account for late bindings and to be more efficient. Some related
DBD::Pg attribs to look at are pg_server_prepare and pg_prepare_now.


- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation
PGP Key: 0x14964AC8 200803121121
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkfX9Q0ACgkQvJuQZxSWSsjEjACg6QNUdPIw5gczfTtFK3aUMh39
fUYAoLwRrFZ75z2Fbq7GDRYqgTlRsR9N
=ngbh
-----END PGP SIGNATURE-----

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Greg Sabino Mullane" <greg@turnstep.com> writes:
> Are you sure you are calling DBI->connect *after* the Apache children
> are created? Major problems like this can happen if not.

Something like that might explain a crash in the Apache server, but
hardly one in the PG backend.

It looks to me like he's found a dangling-pointer problem, but we need
to reproduce it before we can determine a fix...

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Alex Hunsaker" <badalex@gmail.com> writes:
>
> > the mean time here is a new backtrace I just got (lord knows how... it
>  > seems to be as soon as I stop trying to make it crash and look away,
>  > thats when it does).
>
>  Not sure that you are clear on what's happening here, but the train of
>  events is something like
>         - you prepare a statement
>         - somebody modifies the schema of one of the tables used in the
>           statement
>         - you try to execute the statement, and updating the prepared
>           plan crashes
>
>  If you say "none of my stuff is changing any schemas", then I'd guess
>  that the triggering event is a background autovacuum, which forces
>  replan just like an intentional schema change would.  Does stopping
>  autovacuum make the problem go away?

Yep turning off autovacuum seems to have fixed it.  And once I
manually vacuum analyze workers; it blows up almost instantly.

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 9:22 AM, Greg Sabino Mullane <greg@turnstep.com> wrote:
ommits.
>
>  Are you sure you are calling DBI->connect *after* the Apache children
>  are created?

Yes.

 Major problems like this can happen if not. The use of
>  prepare_cached() may be adding to the problem as well, especially if
>  you are using temp tables.

The tables its failing on happen to not be temp tables (I only have 1
temp table and only do 1 insert into it for the entire transaction).

>In DBD::Pg, prepared statements are not
>  actually prepared (in most cases) until just before the first execute,
>  to account for late bindings and to be more efficient. Some related
>  DBD::Pg attribs to look at are pg_server_prepare and pg_prepare_now.

Hrm, well i dont ever prepare them in mass, I only prepare them and
when im going to be calling execute right afterwords.  But ill try
turning on autovac and setting pg_prepare_now to 1 and see what
happens.

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If you say "none of my stuff is changing any schemas", then I'd guess
>> that the triggering event is a background autovacuum, which forces
>> replan just like an intentional schema change would.  Does stopping
>> autovacuum make the problem go away?

> Yep turning off autovacuum seems to have fixed it.  And once I
> manually vacuum analyze workers; it blows up almost instantly.

Yeah, I was going to suggest that you ought to be able to extract
a test case involving doing a manual vacuum in between prepare and
execute.  I suspect it wouldn't even need to involve more than one
session.

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Alex Hunsaker" <badalex@gmail.com> writes:
>  > On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> >> If you say "none of my stuff is changing any schemas", then I'd guess
>  >> that the triggering event is a background autovacuum, which forces
>  >> replan just like an intentional schema change would.  Does stopping
>  >> autovacuum make the problem go away?
>
>  > Yep turning off autovacuum seems to have fixed it.  And once I
>  > manually vacuum analyze workers; it blows up almost instantly.
>
>  Yeah, I was going to suggest that you ought to be able to extract
>  a test case involving doing a manual vacuum in between prepare and
>  execute.  I suspect it wouldn't even need to involve more than one
>  session.

Here is what im trying right now with no success:

3 clients doing this:
while(1)
{
        $db->begin_work();
        my $sth = $db->prepare_cached('select * from junk left join
junk as j on j.junk = junk.junk where junk.junk like ? limit 1;');
        print "VAC!\n";
        sleep 10;
        print "EX!\n";
        $sth->execute('junk') || die "failed: $!";
        $sth->fetchall_arrayref();
        $db->commit();
        $db->{'AutoCommit'} = 0;
        $db->{'AutoCommit'} = 1;
}

where when it prints VAC I :
update junk set junk = 'junkab';
VACUUM ANALYZE verbose junk;
(also tried deleting, and inserting a bunch of junk...)

3 other clients doing:
while(1)
{
        $db->begin_work();
        my $sth = $db->prepare_cached('select * from junk left join
junk as j on j.junk = junk.junk where junk.junk like ? limit 1;');
        $sth->execute('junk') || die "failed: $!";
        $sth->fetchall_arrayref();
        $db->rollback();
}

\d junk
    Table "public.junk"
 Column | Type | Modifiers
--------+------+-----------
 junk   | text |

Re: 8.3.0 backend segfaults

От
Rodriguez Fernando
Дата:
Alex Hunsaker wrote:
> Precedence: bulk
> Sender: pgsql-bugs-owner@postgresql.org
>
> On Wed, Mar 12, 2008 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> "Alex Hunsaker" <badalex@gmail.com> writes:
>>  > On Wed, Mar 12, 2008 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>
>>>> If you say "none of my stuff is changing any schemas", then I'd guess
>>>>
>>  >> that the triggering event is a background autovacuum, which forces
>>  >> replan just like an intentional schema change would.  Does stopping
>>  >> autovacuum make the problem go away?
>>
>>  > Yep turning off autovacuum seems to have fixed it.  And once I
>>  > manually vacuum analyze workers; it blows up almost instantly.
>>
>>  Yeah, I was going to suggest that you ought to be able to extract
>>  a test case involving doing a manual vacuum in between prepare and
>>  execute.  I suspect it wouldn't even need to involve more than one
>>  session.
>>
>
> Here is what im trying right now with no success:
>
> 3 clients doing this:
> while(1)
> {
>         $db->begin_work();
>         my $sth = $db->prepare_cached('select * from junk left join
> junk as j on j.junk = junk.junk where junk.junk like ? limit 1;');
>         print "VAC!\n";
>         sleep 10;
>         print "EX!\n";
>         $sth->execute('junk') || die "failed: $!";
>         $sth->fetchall_arrayref();
>         $db->commit();
>         $db->{'AutoCommit'} = 0;
>         $db->{'AutoCommit'} = 1;
> }
>
> where when it prints VAC I :
> update junk set junk = 'junkab';
> VACUUM ANALYZE verbose junk;
> (also tried deleting, and inserting a bunch of junk...)
>
> 3 other clients doing:
> while(1)
> {
>         $db->begin_work();
>         my $sth = $db->prepare_cached('select * from junk left join
> junk as j on j.junk = junk.junk where junk.junk like ? limit 1;');
>         $sth->execute('junk') || die "failed: $!";
>         $sth->fetchall_arrayref();
>         $db->rollback();
> }
>
> \d junk
>     Table "public.junk"
>  Column | Type | Modifiers
> --------+------+-----------
>  junk   | text |
>
>
Hola, esto (like ? ) generalmente no funciona en un prepared statement,
quizas tu problema este ahi.
proba con un =? , (se que no es lo mismo) y si funciona el problema es
el like

Saludos Fernando

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> Here is what im trying right now with no success:

>         my $sth = $db->prepare_cached('select * from junk left join
> junk as j on j.junk = junk.junk where junk.junk like ? limit 1;');

You need to duplicate more of the original query structure to provoke
the problem, likely.  The crash appeared to involve evaluation of an
immutable SQL function ...

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 10:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Alex Hunsaker" <badalex@gmail.com> writes:
>
> > Here is what im trying right now with no success:
>
>
> >         my $sth = $db->prepare_cached('select * from junk left join
>  > junk as j on j.junk = junk.junk where junk.junk like ? limit 1;');
>
>  You need to duplicate more of the original query structure to provoke
>  the problem, likely.  The crash appeared to involve evaluation of an
>  immutable SQL function ...

Will do.

Just for the record its defined as
create or replace function data_class(text) returns integer as 'select
data_class from data_classes where data_id = $1 and defunct = 0'
language 'sql' stable strict;

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
Hrm still no luck.

I created a snapshot of the database, moved it onto another server so
i could play with it...

Ive tried using just prepare on the console using the query that fails:
prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as
 printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name,
 w.active, w.last_active, w.last_deactive, round(extract(epoch from
 now()) - extract(epoch from w.last_deactive)) as time_off from workers
 as w left join worker_vis as wv on wv.worker_id = w.worker_id and
 wv.defunct = 0 and ( ((wv.auth_id = $1) and (wv.auth_class =
 data_class('user_id'))) or ((wv.auth_id = $2) and (wv.auth_class =
 data_class('clinic_id')))) where wv.worker_vis_id is not null and
 w.defunct = 0 order by coalesce(w.alias, w.worker);

update workers set last_active = now();
vacuum analyze workers;

update worker_vis set worker_id = worker_id;
vacuum analyze worker_vis;

update data_classes set defunct = 0 where defunct = 0;
vacuum analyze data_classes;

execute wrk;

That works as expected.  I also tried each of those updates/vacuums separately.

So now I'm trying the the "bad" query in the simple perl script i
posted before, Ive tried just one instance, and multiple instances...
I guess next ill try running all the sql each web session generates
before it crashes... unless anyone has any other bright ideas for me
to try.  Perhaps my simple updates are not enough for analyze to
invalidate the query plan?  Should I be doing inserts/deletes or just
more updates?

Below are the table counts and the definition of data_classes.  That
should be everything the query uses, except for the actually data.
Which I'm more than willing to provide (privately) if anyone thinks
they have a great idea on how to reproduce it.

 SELECT count(1) from workers;
 count
-------
   716
 SELECT count(1) from worker_vis;
 count
-------
   577

 SELECT count(1) from data_classes;
 count
-------
    75

 \d data_classes
                                         Table "public.data_classes"
    Column    |           Type           |
Modifiers
--------------+--------------------------+-------------------------------------------------------------------
 data_class   | integer                  | not null default
nextval('data_classes_data_class_seq'::regclass)
 data_id      | character varying(80)    |
 data_table   | text                     |
 date_created | timestamp with time zone | default now()
 defunct      | smallint                 | default 0
 description  | character varying(80)    |
Indexes:
    "data_classes_pkey" PRIMARY KEY, btree (data_class)
    "data_class_data_id_idx" UNIQUE, btree (data_id)
    "data_class_data_idx" btree (data_id) WHERE defunct = 0

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> Perhaps my simple updates are not enough for analyze to
> invalidate the query plan?  Should I be doing inserts/deletes or just
> more updates?

No, AFAICS the plan inval will happen after a vacuum regardless of whether
anything actually changed.  I'm thinking that there's some property
of the query or the invoked function that's relevant, but it's not
clear what.  (We'll know soon enough once we can reproduce it easily...)

            regards, tom lane

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 1:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Alex Hunsaker" <badalex@gmail.com> writes:
>
> > Perhaps my simple updates are not enough for analyze to
>  > invalidate the query plan?  Should I be doing inserts/deletes or just
>  > more updates?
>
>  No, AFAICS the plan inval will happen after a vacuum regardless of whether
>  anything actually changed.  I'm thinking that there's some property
>  of the query or the invoked function that's relevant, but it's not
>  clear what.  (We'll know soon enough once we can reproduce it easily...)
>
>                         regards, tom lane
>

Ok I got it... have not been able to reproduce it in pure pgsql yet
so... maybe its a DBD::Pg bug? (CC'd Greg Sabino Mullane)

This crashes everytime for me. and just about every part is required. That is
If I take out the function call it works
If I take out the junk = ? it works
If I take out the nextval('junk_seq') it works
If I take out the begin_work it works

....

#!/usr/bin/perl
use strict;
use warnings;

use DBI();
use DBD::Pg();

my $db = DBI->connect('dbi:Pg:dbname=test;', {'pg_server_prepare'=>1,
'pg_prepare_now'=>1}) || die "could not connect: $!";

$db->do('create table junk (junk text, junk_id int);');
$db->do('create sequence junk_seq;');
$db->do("create or replace function junk_func(int) returns integer as
'select junk_id from junk where junk_id = \$1;' language 'sql' stable
strict;");

my $sth = $db->prepare('select * from junk where junk = ? and junk_id
= junk_func(1) limit 1;');

$db->do('vacuum junk;');
$db->begin_work();

$db->do('select nextval(\'junk_seq\');');

$sth->execute('test') || die "failed: $!";
$sth->fetchall_arrayref();

$db->disconnect();

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
err that should be (forgot the username, password placeholders)

> my $db = DBI->connect('dbi:Pg:dbname=test;', '', '', {'pg_server_prepare'=>1, 'pg_prepare_now'=>1}) || die "could not
connect:$!"; 
>  $db->do('create table junk (junk text, junk_id int);');
>  $db->do('create sequence junk_seq;');
>  $db->do("create or replace function junk_func(int) returns integer as
>  'select junk_id from junk where junk_id = \$1;' language 'sql' stable
>  strict;");
>
>  my $sth = $db->prepare('select * from junk where junk = ? and junk_id
>  = junk_func(1) limit 1;');
>
>  $db->do('vacuum junk;');
>  $db->begin_work();
>
>  $db->do('select nextval(\'junk_seq\');');
>
>  $sth->execute('test') || die "failed: $!";
>  $sth->fetchall_arrayref();
>
>  $db->disconnect();
>

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 1:59 PM, Alex Hunsaker <badalex@gmail.com> wrote:
>  $db->do("create or replace function junk_func(int) returns integer as
>  'select junk_id from junk where junk_id = \$1;' language 'sql' stable
>  strict;");
>

Another oddity

works: create or replace function junk_func() returns integer as
'select 1;' language 'sql' volatile strict;
works: create or replace function junk_func() returns integer as
'select 1 limit 1;' language 'sql' stable strict;

segfaults: create or replace function junk_func() returns integer as
'select 1;' language 'sql' stable strict;

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
"Alex Hunsaker" <badalex@gmail.com> writes:
> Ok I got it... have not been able to reproduce it in pure pgsql yet
> so... maybe its a DBD::Pg bug? (CC'd Greg Sabino Mullane)

Great, got it --- you need a protocol-level prepare to cause it (or so
I expect) and pure psql won't do that.  I was able to reproduce it
with a quick test program using PQprepare though.  Now to dig ...

            regards, tom lane

Re: 8.3.0 backend segfaults

От
Tom Lane
Дата:
This patch should fix it ...

            regards, tom lane

Index: postgres.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tcop/postgres.c,v
retrieving revision 1.544
diff -c -r1.544 postgres.c
*** postgres.c    10 Mar 2008 12:55:13 -0000    1.544
--- postgres.c    12 Mar 2008 23:42:32 -0000
***************
*** 730,760 ****
  pg_plan_queries(List *querytrees, int cursorOptions, ParamListInfo boundParams,
                  bool needSnapshot)
  {
!     List       *stmt_list = NIL;
!     ListCell   *query_list;

!     foreach(query_list, querytrees)
      {
!         Query       *query = (Query *) lfirst(query_list);
!         Node       *stmt;

!         if (query->commandType == CMD_UTILITY)
          {
!             /* Utility commands have no plans. */
!             stmt = query->utilityStmt;
!         }
!         else
!         {
!             if (needSnapshot)
              {
!                 ActiveSnapshot = CopySnapshot(GetTransactionSnapshot());
!                 needSnapshot = false;
              }
!             stmt = (Node *) pg_plan_query(query, cursorOptions, boundParams);
          }

!         stmt_list = lappend(stmt_list, stmt);
      }

      return stmt_list;
  }
--- 730,778 ----
  pg_plan_queries(List *querytrees, int cursorOptions, ParamListInfo boundParams,
                  bool needSnapshot)
  {
!     List       * volatile stmt_list = NIL;
!     Snapshot    saveActiveSnapshot = ActiveSnapshot;

!     /* PG_TRY to ensure previous ActiveSnapshot is restored on error */
!     PG_TRY();
      {
!         Snapshot    mySnapshot = NULL;
!         ListCell   *query_list;

!         foreach(query_list, querytrees)
          {
!             Query       *query = (Query *) lfirst(query_list);
!             Node       *stmt;
!
!             if (query->commandType == CMD_UTILITY)
!             {
!                 /* Utility commands have no plans. */
!                 stmt = query->utilityStmt;
!             }
!             else
              {
!                 if (needSnapshot && mySnapshot == NULL)
!                 {
!                     mySnapshot = CopySnapshot(GetTransactionSnapshot());
!                     ActiveSnapshot = mySnapshot;
!                 }
!                 stmt = (Node *) pg_plan_query(query, cursorOptions,
!                                               boundParams);
              }
!
!             stmt_list = lappend(stmt_list, stmt);
          }

!         if (mySnapshot)
!             FreeSnapshot(mySnapshot);
!     }
!     PG_CATCH();
!     {
!         ActiveSnapshot = saveActiveSnapshot;
!         PG_RE_THROW();
      }
+     PG_END_TRY();
+     ActiveSnapshot = saveActiveSnapshot;

      return stmt_list;
  }

Re: 8.3.0 backend segfaults

От
"Alex Hunsaker"
Дата:
On Wed, Mar 12, 2008 at 6:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This patch should fix it ...
>
>                         regards, tom lane

Excellent seems to have fixed the problem.  Thanks!