Обсуждение: How to handle waitingForLock in LockWaitCancel()

Поиск
Список
Период
Сортировка

How to handle waitingForLock in LockWaitCancel()

От
Hiroshi Inoue
Дата:
Hi,

I sometimes encountered SEGV errors in my test case
when I canceled the execution.
Probably it's due to the almost simultaneous arrival
of multiple signals and the following patch seems to
fix the bug. However I'm afraid that the change should
cause another bug.

Comments ?

Regards,
Hiroshi Inoue


Index: proc.c
===================================================================
RCS file:
/home/projects/pgsql/cvsroot/pgsql/src/backend/storage/lmgr/proc.c,v
retrieving revision 1.98
diff -c -c -r1.98 proc.c
*** proc.c      2001/01/26 18:23:12     1.98
--- proc.c      2001/03/05 02:28:09
***************
*** 327,334 ****       if (!waitingForLock)               return false;

-       waitingForLock = false;
-       /* Turn off the deadlock timer, if it's still running (see
ProcSleep) */ #ifndef __BEOS__       {
--- 327,332 ----
***************
*** 345,350 ****
--- 343,349 ----
       /* Unlink myself from the wait queue, if on it (might not be
anymore!) *
/       LockLockTable();
+       waitingForLock = false;       if (MyProc->links.next != INVALID_OFFSET)
RemoveFromWaitQueue(MyProc);      UnlockLockTable();
 


Re: How to handle waitingForLock in LockWaitCancel()

От
Tom Lane
Дата:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> I sometimes encountered SEGV errors in my test case
> when I canceled the execution.

Can you provide backtraces from these SEGVs?

> Probably it's due to the almost simultaneous arrival
> of multiple signals and the following patch seems to
> fix the bug. However I'm afraid that the change should
> cause another bug.

I do not like that change at *all*.  In the first place, how could it
stop whatever is causing the SEGV?  The waitingForLock flag is not
examined anywhere else, so unless things are already broken this cannot
have any effect.  In the second place, postponing the reset of the
flag has the potential for an infinite loop, because this routine is
called during error exit.  Suppose LockLockTable causes an elog(ERROR)?

I think we need to look harder to find the cause of the SEGVs you are
seeing.
        regards, tom lane


Re: How to handle waitingForLock in LockWaitCancel()

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > I sometimes encountered SEGV errors in my test case
> > when I canceled the execution.
> 
> Can you provide backtraces from these SEGVs?
> 

ISTM the backtrace isn't sufficient to figure out
how the error occured. As far as I see the error
was caused by other backends which terimnated
safely with the similar sequence of signals.
I have a (about 7MB) postmaster log with the
trace_locks option on. I could send it to you
if you want. Note that I applied the following
patch to debug my suspicion. In addition I added
an SegvException Handler to postgres.c.

diff -c -r1.35 proc.c
*** storage/lmgr/proc.c 2001/01/29 10:00:58     1.35
--- storage/lmgr/proc.c 2001/03/05 07:56:56
***************
*** 327,332 ****
--- 327,334 ----       if (!waitingForLock)               return false; 
+       if (MyProc->links.next != INVALID_OFFSET)
+               fprintf(stderr, "LockWaitCancel pid=%d must
RemoveFromWaitQueue\n", MyProc->pid);       waitingForLock = false;        /* Turn off the deadlock timer, if it's
stillrunning (see
 
ProcSleep) */
And the backtrace is as follows.

#0  0x40130db7 in __libc_pause ()
#1  0x811981c in SegvExceptionHandler (postgres_signal_arg=11)   at postgres.c:966
#2  <signal handler called>
#3  0x8482 in ?? ()
#4  0x811547e in ProcLockWakeup (lockMethodTable=0x8204260,
lock=0x409d6bdc)   at proc.c:771
#5  0x8114500 in LockReleaseAll (lockmethod=1, proc=0x409d9d20,    allxids=1 '\001', xid=0) at lock.c:1438
#6  0x81150b4 in ProcKill () at proc.c:446
#7  0x810df3b in shmem_exit (code=0) at ipc.c:187
#8  0x810dead in proc_exit (code=0) at ipc.c:146
#9  0x815891e in elog (lev=1, fmt=0x81aee71 "The system is shutting
down")   at elog.c:465
#10 0x8119927 in ProcessInterrupts () at postgres.c:1039
#11 0x810c526 in s_lock (lock=0x40195016 "\001", file=0x81ac4ea
"spin.c",    line=156) at s_lock.c:140
#12 0x810fec7 in SpinAcquire (lockid=6) at spin.c:156
#13 0x8114f74 in LockWaitCancel () at proc.c:349
#14 0x81197d8 in die (postgres_signal_arg=15) at postgres.c:947
#15 <signal handler called>
#16 0x8482 in ?? ()
#17 0x400ffcd4 in _IO_old_file_xsputn (f=0x81ce3a8, data=0xbfffc288,
n=49)   at oldfileops.c:294
#18 0x400f1acf in buffered_vfprintf (s=0x81ce3a8,    format=0x81adba0 "LockWaitCancel pid=%d must
RemoveFromWaitQueue\n",   args=0xbfffe8c8) at vfprintf.c:1767
 
#19 0x400ed5cc in _IO_vfprintf (s=0x81ce3a8,    format=0x81adba0 "LockWaitCancel pid=%d must RemoveFromWaitQueue\n",
ap=0xbfffe8c8)at vfprintf.c:1029
 
#20 0x400f4f03 in fprintf (stream=0x81ce3a8,    format=0x81adba0 "LockWaitCancel pid=%d must RemoveFromWaitQueue\n")
atfprintf.c:32
 
#21 0x8114f25 in LockWaitCancel () at proc.c:331
#22 0x811988c in QueryCancelHandler (postgres_signal_arg=2) at
postgres.c:993
#23 <signal handler called>
#24 0x8482 in ?? ()
#25 0x810e22b in IpcSemaphoreLock (semId=32768, sem=8, interruptOK=1
'\001')   at ipc.c:423
#26 0x811530f in ProcSleep (lockMethodTable=0x8204260, lockmode=4,    lock=0x409d726c, holder=0x409d7e08) at
proc.c:669
#27 0x8112eb3 in WaitOnLock (lockmethod=1, lockmode=4, lock=0x409d726c,    holder=0x409d7e08) at lock.c:995
#28 0x81124d8 in LockAcquire (lockmethod=1, locktag=0xbfffeb5c,
xid=373273,    lockmode=4) at lock.c:779
#29 0x8111372 in XactLockTableWait (xid=373282) at lmgr.c:309
#30 0x80cf5ef in EvalPlanQual (estate=0x8260a0c, rti=1, tid=0xbfffebe0)   at execMain.c:1751
#31 0x80ceebd in ExecReplace (slot=0x8260bec, tupleid=0xbfffec3c,    estate=0x8260a0c) at execMain.c:1449
#32 0x80ceb2e in ExecutePlan (estate=0x8260a0c, plan=0x8260964,    operation=CMD_UPDATE, numberTuples=0,
direction=ForwardScanDirection,    destfunc=0x82611c0) at execMain.c:1119
#33 0x80cdf73 in ExecutorRun (queryDesc=0x82609f0, estate=0x8260a0c,    feature=3, count=0) at execMain.c:233
#34 0x811ac2b in ProcessQuery (parsetree=0x825379c, plan=0x8260964,    dest=Remote) at pquery.c:297
#35 0x811964b in pg_exec_query_string (   query_string=0x8253400 "update branches set bbalance = bbalance +
687 where bid = 1", dest=Remote, parse_context=0x8237640) at
postgres.c:810
#36 0x811a712 in PostgresMain (argc=9, argv=0xbfffee9c, real_argc=12,    real_argv=0xbffff774, username=0x82104b9
"reindex")at
 
postgres.c:1900
#37 0x80ffb03 in DoBackend (port=0x8210250) at postmaster.c:2080
#38 0x80ff6fa in BackendStartup (port=0x8210250) at postmaster.c:1863
#39 0x80fea96 in ServerLoop () at postmaster.c:963
#40 0x80fe527 in PostmasterMain (argc=12, argv=0xbffff774) at
postmaster.c:662
#41 0x80df8e9 in main (argc=12, argv=0xbffff774) at main.c:149

Regards,
Hiroshi Inoue


Re: How to handle waitingForLock in LockWaitCancel()

От
Tom Lane
Дата:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> [ backtrace snipped ]

Hmm, this is definitely not operating as intended: LockWaitCancel is
getting interrupted, because ProcessInterrupts may be called when it's
trying to acquire the lockmanager spinlock, and ProcessInterrupts will
see the ProcDiePending flag already set.  I think the correct fix (or
at least part of it) is in postgres.c's die():
       /*        * If it's safe to interrupt, and we're waiting for input or a lock,        * service the interrupt
immediately       */       if (ImmediateInterruptOK && InterruptHoldoffCount == 0 &&           CritSectionCount == 0)
   {
 
+           /* bump holdoff count to make ProcessInterrupts() a no-op */
+           /* until we are done getting ready for it */
+           InterruptHoldoffCount++;           DisableNotifyInterrupt();           /* Make sure HandleDeadLock won't
runwhile shutting down... */           LockWaitCancel();
 
+           InterruptHoldoffCount--;           ProcessInterrupts();       }

QueryCancelHandler probably needs similar additions.

I suspect you will find that these crashes occur during the window just
after the semop() call in IpcSemaphoreLock() --- see the comment
beginning at line 399 of ipc.c.  You could probably make the crash
easier to reproduce by inserting a delay there, if you want to test
more.
        regards, tom lane


Re: How to handle waitingForLock in LockWaitCancel()

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > [ backtrace snipped ]
> 
> Hmm, this is definitely not operating as intended: LockWaitCancel is
> getting interrupted, because ProcessInterrupts may be called when it's
> trying to acquire the lockmanager spinlock, and ProcessInterrupts will
> see the ProcDiePending flag already set.  I think the correct fix (or
> at least part of it) is in postgres.c's die():
> 
>         /*
>          * If it's safe to interrupt, and we're waiting for input or a lock,
>          * service the interrupt immediately
>          */
>         if (ImmediateInterruptOK && InterruptHoldoffCount == 0 &&
>             CritSectionCount == 0)
>         {
> +           /* bump holdoff count to make ProcessInterrupts() a no-op */
> +           /* until we are done getting ready for it */
> +           InterruptHoldoffCount++;
>             DisableNotifyInterrupt();
>             /* Make sure HandleDeadLock won't run while shutting down... */
>             LockWaitCancel();
> +           InterruptHoldoffCount--;
>             ProcessInterrupts();
>         }
> 
> QueryCancelHandler probably needs similar additions.
> 

Agreed. Adding similar code to QueryCancelHandler seems
sufficient.

> I suspect you will find that these crashes occur during the window just
> after

Sorry what does 'just after' mean ?
Isn't it during the semop() ?

> the semop() call in IpcSemaphoreLock() --- see the comment

Regards,
Hiroshi Inoue


Re: How to handle waitingForLock in LockWaitCancel()

От
Tom Lane
Дата:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>> I suspect you will find that these crashes occur during the window just
>> after

> Sorry what does 'just after' mean ?
> Isn't it during the semop() ?

>> the semop() call in IpcSemaphoreLock() --- see the comment

If an interrupt during the semop led to a crash, it would be easy to
reproduce.  I suspect that the crash condition arises only when the
interrupt occurs in a narrow time window ... such as the few
instructions just before or just after the semop call.  It's just a
hunch though.
        regards, tom lane


Re: How to handle waitingForLock in LockWaitCancel()

От
Hiroshi Inoue
Дата:
I Inoue wrote:
> 
> Tom Lane wrote:
> >
> > Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > > [ backtrace snipped ]
> >
> > Hmm, this is definitely not operating as intended: LockWaitCancel is
> > getting interrupted, because ProcessInterrupts may be called when it's
> > trying to acquire the lockmanager spinlock, and ProcessInterrupts will
> > see the ProcDiePending flag already set.  I think the correct fix (or
> > at least part of it) is in postgres.c's die():
> >
> >         /*
> >          * If it's safe to interrupt, and we're waiting for input or a lock,
> >          * service the interrupt immediately
> >          */
> >         if (ImmediateInterruptOK && InterruptHoldoffCount == 0 &&
> >             CritSectionCount == 0)
> >         {
> > +           /* bump holdoff count to make ProcessInterrupts() a no-op */
> > +           /* until we are done getting ready for it */
> > +           InterruptHoldoffCount++;
> >             DisableNotifyInterrupt();
> >             /* Make sure HandleDeadLock won't run while shutting down... */
> >             LockWaitCancel();
> > +           InterruptHoldoffCount--;
> >             ProcessInterrupts();
> >         }
> >
> > QueryCancelHandler probably needs similar additions.
> >
> 
> Agreed. Adding similar code to QueryCancelHandler seems
> sufficient.
> 

Is it OK to commit the change before 7.1 release ?
I want to do it before forgetting this issue.
(I've completely forgotten the CheckPoint hang problem
I reported once until I see your report today).

Regards,
Hiroshi Inoue


Re: How to handle waitingForLock in LockWaitCancel()

От
Tom Lane
Дата:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Is it OK to commit the change before 7.1 release ?
> I want to do it before forgetting this issue.

If that fixes the problem for you, then commit it.  I was waiting to
hear back whether you still saw a crash or not...
        regards, tom lane


Re: How to handle waitingForLock in LockWaitCancel()

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> > Is it OK to commit the change before 7.1 release ?
> > I want to do it before forgetting this issue.
> 
> If that fixes the problem for you, then commit it.  I was waiting to
> hear back whether you still saw a crash or not...
> 

I see no crash in my test case.
I would commit the change.

regards,
Hiroshi Inoue