8.3.0 backend segfaults

Поиск
Список
Период
Сортировка
От Alex Hunsaker
Тема 8.3.0 backend segfaults
Дата
Msg-id 34d269d40803112134l6e6b88eet3638312ffd628766@mail.gmail.com
обсуждение исходный текст
Ответы Re: 8.3.0 backend segfaults  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: 8.3.0 backend segfaults  ("Greg Sabino Mullane" <greg@turnstep.com>)
Список pgsql-bugs
(Sorry if duplicates show up this is the third time ive posted this in
the past 10 hours, Im assuming it got dropped because of the
attachments)

Problem: Apparently random segfaults apparently query agnostic, seem
  to be more frequent when a pg_dump is running

The most frequent query it segfaults with is:
  select w.worker_id, w.worker_id as printerid, w.worker, w.alias,
  coalesce(w.alias, w.worker) as name, w.active, w.last_active,
  w.last_deactive, round(extract(epoch from now()) - extract(epoch from
  w.last_deactive)) as time_off from workers as w left join worker_vis
  as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and (
  ((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or
  ((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id'))))
  where wv.worker_vis_id is not null and w.defunct = 0 order by
  coalesce(w.alias, w.worker);

  (sample arguments, they do not seem to make a difference)
    167214, 13
    167340, 16

  But have seen other simpler queries such as (sorry no backtraces for
  these... yet)
  insert into audit_transaction (action, open_user_id, page_load_id,
  user_id) values (?, ?, ?, ?); ARGS=suds, 509057, 15148365, 167217

If I select * from workers; and select * from workers_active;
everything looks good, no segfaults.  I can also take a pg_dump if
virtually nothing is using the database.  Otherwise pg_dump dies
randomly with an errors like:

Dumping the contents of table "clients_audit" failed: PQgetCopyData() failed.
Dumping the contents of table "file_data" failed: PQgetCopyData() failed.
Dumping the contents of table "workers_audit" failed: PQgetCopyData() failed.

(note these are probably 3 of the larger tables in the database, maybe
it has something to do with statement_timeout, i do have it set quite
low (3 min) ?)

If I manually prepare and execute the above queries they work fine.
prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as
  printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name,
  w.active, w.last_active, w.last_deactive, round(extract(epoch from
  now()) - extract(epoch from w.last_deactive)) as time_off from workers
  as w left join worker_vis as wv on wv.worker_id = w.worker_id and
  wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class =
  data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class =
  data_class('clinic_id')))) where wv.worker_vis_id is not null and
  w.defunct = 0 order by coalesce(w.alias, w.worker);
  execute worker (167214, 13);

Core dumps and binaries available if needed. I have about 6 core dumps
across 2 different servers.  Which leads me to believe its either
postgres bug or a corrupt database (the other server is a pitr slave
for the master which i switched over to see if i could reproduce the
segfault on the master)

Nothing in dmesg, nothing in mcelog.
Raid controller is a 3ware 9550SX-12 with 6 70GB WD Raptor hard drives
in a raid 10, smart tests pass fine... everything looks good.

Work load is a web application where each page beings a transaction;
creates a temp table, does a few selects, inserts and updates and the
commits.

  Postgresql 8.2.5 was working without any problems just prior to the upgrade.

  \d workers
                                           Table "public.workers"
    Column     |           Type           |
  Modifiers
  ---------------+--------------------------+-----------------------------------------------------------------
  worker_id     | bigint                   | not null default
  nextval('workers_worker_id_seq'::regclass)
  date_created  | timestamp with time zone | not null default now()
  clientid      | bigint                   | not null
  worker        | text                     | not null
  alias         | text                     |
  job_type      | smallint                 | not null
  active        | smallint                 | not null default 0
  last_active   | timestamp with time zone |
  last_deactive | timestamp with time zone |
  defunct       | smallint                 | not null default 0
  audit_class   | integer                  |
  audit_date    | timestamp with time zone | not null default now()
  audit_desc    | text                     |
  audit_id      | bigint                   |
  audit_seq     | integer                  | not null default
  nextval('audit_basic_audit_seq_seq'::regclass)
  audit_table   | character varying(64)    |
  audit_tid     | bigint                   |
  audit_type    | character varying(32)    |
  audit_orig_id | integer                  |
  path          | text                     |
  Indexes:
    "workers_pkey" PRIMARY KEY, btree (worker_id) CLUSTER
  Triggers:
    workers_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON workers FOR
  EACH ROW EXECUTE PROCEDURE audit_table_go()

\d worker_vis
                                         Table "public.worker_vis"
    Column     |           Type           |
Modifiers
---------------+--------------------------+-----------------------------------------------------------------
 worker_vis_id | bigint                   | not null default
nextval('worker_vis_ruleid_seq'::regclass)
 date_created  | timestamp with time zone | not null default now()
 worker_id     | bigint                   |
 auth_id       | bigint                   |
 auth_class    | smallint                 |
 defunct       | smallint                 | not null default 0
 audit_class   | integer                  |
 audit_date    | timestamp with time zone | not null default now()
 audit_desc    | text                     |
 audit_id      | bigint                   |
 audit_seq     | integer                  | not null default
nextval('audit_basic_audit_seq_seq'::regclass)
 audit_table   | character varying(64)    |
 audit_tid     | bigint                   |
 audit_type    | character varying(32)    |
 audit_orig_id | integer                  |
Indexes:
    "worker_vis_pkey" PRIMARY KEY, btree (worker_vis_id)
    "worker_vis_vis_idx" btree (worker_id, defunct, auth_id, auth_class) CLUSTER
Triggers:
    worker_vis_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON
worker_vis FOR EACH ROW EXECUTE PROCEDURE audit_table_go()

  Version Information:

  Following REL8.3_STABLE 2 versions tried both crash with a similar
  backtrace (both died in CopySnapshot):

  (currently running)
  Author: teodor <teodor>
  Date:   Sun Mar 9 10:42:48 2008 +0000

    Revert changes of CompareTSQ: it affects existing btree indexes.

  (was running but also crashed)
  Author: tgl <tgl>
  Date:   Sun Mar 2 00:10:28 2008 +0000

    Venezuela Time now means UTC-4:30, not UTC-4:00.  Adjust our treatment
    of "VET" accordingly.  Per bug #3997 from Aaron Mizrachi.

  Memory: 8G ECC
  Swap: 16GB (unused even at segfault time aka it does not seem to be
  running out of memmory)
  OS: Linux  2.6.24.3

  select version();
  PostgreSQL 8.3.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
  (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

  built with: --disable-nls --with-openssl --without-python
  --without-tcl --without-krb5 --with-perl --enable-debug
  --enable-integer-datetimes --with-ossp-uuid

  postgresql.conf:
  listen_addresses = '*'
  port = 5432
  max_connections = 500
  superuser_reserved_connections = 3
  ssl = on
  shared_buffers = 1000MB
  work_mem = 32MB
  maintenance_work_mem = 256MB
  max_stack_depth = 7MB
  max_fsm_pages = 304800
  max_fsm_relations = 3000
  vacuum_cost_delay = 30
  bgwriter_lru_maxpages = 500
  wal_sync_method = fdatasync
  wal_buffers = 256kB
  checkpoint_segments = 8
  checkpoint_timeout = 10min
  archive_command = 'rsync -az -e "ssh" %p
  psql-sync@10.10.10.1:standby/cur/wal/%f </dev/null'
  random_page_cost = 2.6
  effective_cache_size = 6GB
  default_statistics_target = 101
  constraint_exclusion = on
  log_destination = 'syslog'
  client_min_messages = warning
  log_min_messages = warning
  silent_mode = on
  log_line_prefix = '[%m] %h:%d '
  log_lock_waits = on
  log_temp_files = 20MB
  autovacuum_naptime = 3min
  autovacuum_vacuum_threshold = 200
  autovacuum_analyze_threshold = 200
  statement_timeout = '3min'
  datestyle = 'iso, mdy'
  lc_messages = 'C'
  lc_monetary = 'C'
  lc_numeric = 'C'
  lc_time = 'C'
  backslash_quote = off
  escape_string_warning = on
  sql_inheritance = off
  standard_conforming_strings = on

  cat /proc/cpuinfo
  processor       : 0
  vendor_id       : AuthenticAMD
  cpu family      : 15
  model           : 5
  model name      : AMD Opteron(tm) Processor 246
  stepping        : 10
  cpu MHz         : 2000.000
  cache size      : 1024 KB
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 1
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
  3dnowext 3dnow rep_good
  bogomips        : 3991.79
  TLB size        : 1024 4K pages
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 40 bits physical, 48 bits virtual
  power management: ts fid vid ttp

  processor       : 1
  vendor_id       : AuthenticAMD
  cpu family      : 15
  model           : 5
  model name      : AMD Opteron(tm) Processor 246
  stepping        : 10
  cpu MHz         : 2000.000
  cache size      : 1024 KB
  fpu             : yes
  fpu_exception   : yes
  cpuid level     : 1
  wp              : yes
  flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
  3dnowext 3dnow rep_good
  bogomips        : 3991.79
  TLB size        : 1024 4K pages
  clflush size    : 64
  cache_alignment : 64
  address sizes   : 40 bits physical, 48 bits virtual
  power management: ts fid vid ttp

Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(49502) BIND   '.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6
(gdb) bt
#0  0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6
#1  0x000000000065fc88 in CopySnapshot (snapshot=0xa4e6e0) at tqual.c:1311
#2  0x000000000051cf34 in fmgr_sql (fcinfo=0x7fffb3f51160) at functions.c:299
#3  0x00000000005164cd in ExecMakeFunctionResult (fcache=0x10b4710,
econtext=0x10b4ba0, isNull=0x7fffb3f515af "", isDone=0x0) at
execQual.c:1351
#4  0x00000000005191ca in ExecEvalExprSwitchContext
(expression=0x2b6f2c5dc078, econtext=0x10, isNull=0x24c8500 <Address
0x24c8500 out of bounds>, isDone=0x24c84c0)
    at execQual.c:3726
#5  0x00000000005690bd in evaluate_expr (expr=<value optimized out>,
result_type=23, result_typmod=-1) at clauses.c:3273
#6  0x0000000000569d21 in simplify_function (funcid=2214548,
result_type=23, result_typmod=-1, args=0xf8db40, allow_inline=1
'\001', context=0x7fffb3f51700) at clauses.c:2887
#7  0x000000000056a53a in eval_const_expressions_mutator
(node=0xf24d90, context=0x7fffb3f51700) at clauses.c:1795
#8  0x000000000056b185 in estimate_expression_value (root=<value
optimized out>, node=0x2b6f2c5dc078) at clauses.c:1719
#9  0x00000000005f8b86 in get_restriction_variable (root=0xf22d80,
args=<value optimized out>, varRelid=0, vardata=0x7fffb3f51810,
other=0x7fffb3f51848,
    varonleft=0x7fffb3f5185f "\001\236") at selfuncs.c:3570
#10 0x00000000005fa562 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169
#11 0x000000000064540c in OidFunctionCall4 (functionId=<value
optimized out>, arg1=15871360, arg2=532, arg3=15879312, arg4=0) at
fmgr.c:1615
#12 0x000000000056d0c2 in restriction_selectivity (root=0xf22d80,
operator=532, args=0xf24c90, varRelid=0) at plancat.c:805
#13 0x000000000054e797 in clause_selectivity (root=0xf22d80,
clause=0xf24de0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639
#14 0x000000000054e0df in clauselist_selectivity (root=0xf22d80,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#15 0x000000000054e448 in clause_selectivity (root=0xf22d80,
clause=0xf8bb20, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576
#16 0x000000000054e72a in clause_selectivity (root=0xf22d80,
clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:597
#17 0x000000000054e0df in clauselist_selectivity (root=0xf22d80,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#18 0x000000000054f99f in set_baserel_size_estimates
(root=0x2b6f2c5dc078, rel=0xf26748) at costsize.c:2262
#19 0x000000000054d67c in set_rel_pathlist (root=0xf22d80,
rel=0xf26748, rti=2, rte=0xec7c30) at allpaths.c:215
#20 0x000000000054df02 in make_one_rel (root=0xf22d80,
joinlist=0xf26668) at allpaths.c:150
#21 0x000000000055f710 in query_planner (root=0xf22d80, tlist=<value
optimized out>, tuple_fraction=0, limit_tuples=-1,
cheapest_path=0x7fffb3f52220,
    sorted_path=0x7fffb3f52218, num_groups=0x7fffb3f52228) at planmain.c:249
#22 0x0000000000560041 in grouping_planner (root=0xf22d80,
tuple_fraction=<value optimized out>) at planner.c:897
#23 0x00000000005610b8 in subquery_planner (glob=0xee11a0,
parse=0xee1230, level=0, tuple_fraction=0, subroot=0x7fffb3f523e8) at
planner.c:431
#24 0x00000000005614b1 in standard_planner (parse=0xee1230,
cursorOptions=0, boundParams=0x0) at planner.c:158
#25 0x00000000005a1c61 in pg_plan_query (querytree=0xee1230,
cursorOptions=0, boundParams=0x0) at postgres.c:681
#26 0x00000000005a1d13 in pg_plan_queries (querytrees=<value optimized
out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at
postgres.c:752
#27 0x0000000000633f93 in do_planning (querytrees=0xf22d50,
cursorOptions=0) at plancache.c:560
#28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0,
useResOwner=0 '\0') at plancache.c:484
#29 0x00000000005a3b05 in PostgresMain (argc=4, argv=<value optimized
out>, username=0x8f3020 "docsysweb") at postgres.c:1605
#30 0x0000000000579d8f in ServerLoop () at postmaster.c:3207
#31 0x000000000057a85c in PostmasterMain (argc=3, argv=0x8ee2d0) at
postmaster.c:1029
#32 0x00000000005345ce in main (argc=3, argv=<value optimized out>) at
main.c:188
gdb) frame 28
#28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0,
useResOwner=0 '\0') at plancache.c:484
484                             slist = do_planning(slist,
plansource->cursor_options);
(gdb) print plansource->query_string
$1 = 0xf08ea0 "select w.worker_id, w.worker_id as printerid, w.worker,
w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active,
w.last_deactive, round(extract(epoch from now()) - extract(epoch from
w.l"...

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #4027: backslash escaping not disabled in plpgsql
Следующее
От: Tom Lane
Дата:
Сообщение: Re: 8.3.0 backend segfaults