Re: [HACKERS] asynchronous execution

Поиск
Список
Период
Сортировка
От Amit Langote
Тема Re: [HACKERS] asynchronous execution
Дата
Msg-id 9058d70b-a6b0-8b3c-091a-fe77ed0df580@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] asynchronous execution  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: [HACKERS] asynchronous execution  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Horiguchi-san,

On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#0  0x0000003969632625 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003969633e05 in *__GI_abort () at abort.c:92
#2  0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
"execAsync.c",   lineNumber=345) at assert.c:54
#3  0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4  0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5  0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6  0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7  0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8  0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9  0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
nodeAgg.c:587
#10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
nodeAgg.c:2272
#11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',   numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568,
<snip>

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;                     QUERY PLAN
------------------------------------------------------Sort  Sort Key: ((ptab.tableoid)::regclass)  ->  HashAggregate
   Group Key: (ptab.tableoid)::regclass, ptab.a        ->  Result              ->  Append                    ->
ForeignScan on ptab_00001                    ->  Foreign Scan on ptab_00002                    ->  Foreign Scan on
ptab_00003                   ->  Foreign Scan on ptab_00004                    ->  Foreign Scan on ptab_00005
        ->  Foreign Scan on ptab_00006                    ->  Foreign Scan on ptab_00007                    ->  Foreign
Scanon ptab_00008                    ->  Foreign Scan on ptab_00009                    ->  Foreign Scan on ptab_00010
          <snip>
 

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR:  relation "public.ptab_00010" does not exist
CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301                    lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0  0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1  0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3  0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')   at resowner.c:566
#4  0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5  0x0000000000524172 in AbortTransaction () at xact.c:2588
#6  0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7  0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8  0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9  0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228

There is a segfault when accessing the events variable, whose members seem
to be pfreed:

(gdb) f 2
#2  0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600            ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = { nevents = 2139062143, nevents_space = 2139062143, resowner = 0x7f7f7f7f7f7f7f7f, events = 0x7f7f7f7f7f7f7f7f,
latch= 0x7f7f7f7f7f7f7f7f, latch_pos = 2139062143, epoll_fd = 2139062143, epoll_ret_events = 0x7f7f7f7f7f7f7f7f
 
}

Thanks,
Amit





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Soref
Дата:
Сообщение: Re: [HACKERS] Possible spelling fixes
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)