Re: [HACKERS] SIGSEGV in BRIN autosummarize

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема Re: [HACKERS] SIGSEGV in BRIN autosummarize
Дата
Msg-id 20171017063311.GB13880@telsasoft.com
обсуждение исходный текст
Ответ на Re: [HACKERS] SIGSEGV in BRIN autosummarize  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: [HACKERS] SIGSEGV in BRIN autosummarize  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
On Sun, Oct 15, 2017 at 02:44:58PM +0200, Tomas Vondra wrote:
> Thanks, but I'm not sure that'll help, at this point. We already know
> what happened (corrupted memory), we don't know "how". And core files
> are mostly just "snapshots" so are not very useful in answering that :-(

Is there anything I should be saving for these or hints how else to debug?  I
just got a segfault while looking for logs from an segfault from 2hr
earlier...but it was a duplicate from the first stacktrace...

< 2017-10-16 23:21:23.172 -04  >LOG:  server process (PID 31543) was terminated by signal 11: Segmentation fault
< 2017-10-16 23:21:23.172 -04  >DETAIL:  Failed process was running: autovacuum: BRIN summarize public.gtt 162175

Oct 16 23:21:22 database kernel: postmaster[31543] general protection ip:4bd467 sp:7ffe08a94890 error:0 in
postgres[400000+692000]
...
Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 000000000084a177 sp 00007ffe08a94a88 error 4 in
postgres[400000+692000]
Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-17-01:58:38-8646(7692288 bytes)
 
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' creation detected

Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to
/var/spool/abrt/ccpp-2017-10-17-01:58:38-8646(7692288 bytes)
 
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' creation detected
Oct 17 01:59:18 database abrtd: Generating core_backtrace
Oct 17 01:59:18 database abrtd: Duplicate: core backtrace
Oct 17 01:59:18 database abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2017-10-14-18:05:35-26500
Oct 17 01:59:18 database abrtd: Deleting problem directory ccpp-2017-10-17-01:58:38-8646 (dup of
ccpp-2017-10-14-18:05:35-26500)

postgres=# SELECT * FROM postgres_log_2017_10_16_2300 WHERE pid=31543;
-[ RECORD 1
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:22.021-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 2
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54376
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table "gtt.public.cdrs_eric_egsnpdprecord_2017_10_15"
-[ RECORD 2
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:22.021-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 3
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54377
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | processing work entry for relation
"gtt.public.cdrs_eric_msc_voice_2017_10_16_startofcharge_idx"
-[ RECORD 3
]----------+-------------------------------------------------------------------------------------------------
log_time               | 2017-10-16 23:21:21.003-04
pid                    | 31543
session_id             | 59e57677.7b37
session_line           | 1
session_start_time     | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54374
transaction_id         | 0
error_severity         | ERROR
sql_state_code         | 57014
message                | canceling autovacuum task
context                | automatic analyze of table "gtt.public.cdrs_eric_ggsnpdprecord_2017_10_15"

Note, the most recent LOCK TABLE I could find beofre 23:21:22 was actually
~10min earlier at 2017-10-16 23:12:16.519-04.

Core was generated by `postgres: autovacuum worker process   gtt             '.
Program terminated with signal 11, Segmentation fault.
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
178             LockRelId       relid = relation->rd_lockInfo.lockRelId;
Missing separate debuginfos, use: debuginfo-install audit-libs-2.4.5-6.el6.x86_64
cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64libgcc-4.4.7-18.el6.x86_64 libicu-4.2.1-14.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
libstdc++-4.4.7-18.el6.x86_64nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64
nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64nss-util-3.28.4-1.el6_9.x86_64 openldap-2.4.40-16.el6.x86_64
openssl-1.0.1e-57.el6.x86_64pam-1.1.1-24.el6.x86_64 zlib-1.2.3-29.el6.x86_64
 

(gdb) p relation
$1 = (struct RelationData *) 0x324647603246466
(gdb) p relation->rd_lockInfo
Cannot access memory at address 0x3246476032464aa

(gdb) bt
#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
#1  0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
#2  0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized out>, v2=<value optimized out>,
v3=<valueoptimized out>, v4=<value optimized out>) at catcache.c:1190
 
#3  0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at lsyscache.c:1730
#4  0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at autovacuum.c:2609
#5  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
#6  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at
autovacuum.c:1694
#7  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499
#8  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
#9  sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111
#10 <signal handler called>
#11 0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228

#0  index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178       relid = {relId = 21345368, dbId = 0}
#1  0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
No locals.
#2  0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized out>, v2=<value optimized out>,
v3=<valueoptimized out>, v4=<value optimized out>) at catcache.c:1190       cur_skey = {{sk_flags = 0, sk_attno = 1,
sk_strategy= 3, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0x7a8d50 <oideq>, fn_oid = 184, fn_nargs = 2,
fn_strict= 1 '\001',             fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x13cf3d8,
fn_expr= 0x0}, sk_argument = 618293329}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
sk_collation= 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats
=0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0},         {sk_flags = 0, sk_attno = 0,
sk_strategy= 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0
'\000',fn_retset = 0 '\000', fn_stats = 0 '\000',             fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0},
sk_argument= 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr =
0,fn_oid = 0, fn_nargs = 0,             fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra =
0x0,fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}}       hashValue = 884019       hashIndex = 51       iter = {cur =
<valueoptimized out>, end = 0x14276a8}       bucket = 0x14276a8       ct = <value optimized out>       relation =
0x7fd6b8788e58      scandesc = 0x145b3c8       ntp = <value optimized out>
 
#3  0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at lsyscache.c:1730       tp = <value optimized
out>
#4  0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at autovacuum.c:2609       cur_datname = 0x0
cur_nspname = 0x0       cur_relname = 0x0       __func__ = "perform_work_item"
 
#5  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533       workitem = 0x7fd7c7f94860       classRel =
0x7fd6b8788e58      tuple = <value optimized out>       relScan = <value optimized out>       dbForm = <value optimized
out>      table_oids = <value optimized out>       orphan_oids = 0x0       ctl = {num_partitions = 0, ssize = 0, dsize
=0, max_dsize = 0, ffactor = 0, keysize = 4, entrysize = 80, hash = 0, match = 0, keycopy = 0, alloc = 0, hcxt = 0x0,
hctl= 0x0}       table_toast_map = 0x14d5228       cell = 0x0       shared = 0x1499eb8       dbentry = 0x149a140
bstrategy= 0x156ef38       key = {sk_flags = 0, sk_attno = 16, sk_strategy = 3, sk_subtype = 0, sk_collation = 100,
sk_func= {fn_addr = 0x750430 <chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000',
fn_stats= 2 '\002', fn_extra = 0x0, fn_mcxt = 0x1400838, fn_expr = 0x0}, sk_argument = 116}       pg_class_desc =
0x14c3748      effective_multixact_freeze_max_age = 400000000       did_vacuum = <value optimized out>
found_concurrent_worker= <value optimized out>       i = <value optimized out>       __func__ = "do_autovacuum"
 
#6  0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at
autovacuum.c:1694      dbname =
"gtt\000\000\000\000\000\325I\001\327;\000\000\000\067{\000\000\000\000\000\000E\352\200\327;\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\374\024\316",
'\000'<repeats 12 times>       local_sigjmp_buf = {{__jmpbuf = {0, -6625136415178345871, 20928464, 1508210280, 31539,
1508209622,6625596434918027889, -6625136601089035663}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103,13184024, 1, 13182664, 0, 31539, 1508209622, 0, 257010229664, 1, 0, 140729043734800, 257014425560,
4,0, 4}}}}       dbid = 16400       __func__ = "AutoVacWorkerMain"
 
#7  0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499       worker_pid = 0       __func__ =
"StartAutoVacWorker"
#8  0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
bn= 0x13f57d0
 
#9  sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111       save_errno = 4
__func__= "sigusr1_handler"
 
#10 <signal handler called>
No symbol table info available.
#11 0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
No locals.
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
timeout = {tv_sec = 59, tv_usec = 981940}       rmask = {fds_bits = {120, 0 <repeats 15 times>}}       selres = <value
optimizedout>       now = <value optimized out>       readmask = {fds_bits = {120, 0 <repeats 15 times>}}
nSockets= 7       last_lockfile_recheck_time = 1508210280       last_touch_time = 1508209622
 
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361       opt = <value
optimizedout>       status = <value optimized out>       userDoption = <value optimized out>       listen_addr_saved =
<valueoptimized out>       i = <value optimized out>       output_config_variable = <value optimized out>
__func__= "PostmasterMain"
 
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Rafia Sabih
Дата:
Сообщение: Re: [HACKERS] [POC] Faster processing at Gather node
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: [HACKERS] [PATCH] pageinspect function to decode infomasks