Обсуждение: Possible problem with shm_mq spin lock

Поиск
Список
Период
Сортировка

Possible problem with shm_mq spin lock

От
Haribabu Kommi
Дата:
Hi Hackers,

I am thinking of a possible problem with shm_mq structure spin lock.
This is used for protecting the shm_mq structure.

During the processing of any code under the spin lock, if the process
receives SIGQUIT signal then it is leading to a dead lock situation.

SIGQUIT->proc_exit->shm_mq_detach->try to acquire spin lock. The spin
lock is already took by the process.

It is very dificult to reproduce the problem as because the code under
the lock is very minimal.
Please let me know if I missed anything.

Regards,
Hari Babu
Fujitsu Australia



Re: Possible problem with shm_mq spin lock

От
Andres Freund
Дата:
Hi,

On 2014-10-26 08:52:42 +1100, Haribabu Kommi wrote:
> I am thinking of a possible problem with shm_mq structure spin lock.
> This is used for protecting the shm_mq structure.
> 
> During the processing of any code under the spin lock, if the process
> receives SIGQUIT signal then it is leading to a dead lock situation.
> 
> SIGQUIT->proc_exit->shm_mq_detach->try to acquire spin lock. The spin
> lock is already took by the process.
> 
> It is very dificult to reproduce the problem as because the code under
> the lock is very minimal.
> Please let me know if I missed anything.

I think you missed the following bit in postgres.c:

/** quickdie() occurs when signalled SIGQUIT by the postmaster.** Some backend has bought the farm,* so we need to stop
whatwe're doing and exit.*/
 
void
quickdie(SIGNAL_ARGS)
{
.../* * We DO NOT want to run proc_exit() callbacks -- we're here because * shared memory may be corrupted, so we don't
wantto try to clean up our * transaction.  Just nail the windows shut and get out of town.  Now that * there's an
atexitcallback to prevent third-party code from breaking * things by calling exit() directly, we have to reset the
callbacks* explicitly to make this work as intended. */on_exit_reset();
 
..

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Possible problem with shm_mq spin lock

От
Haribabu Kommi
Дата:
On Sun, Oct 26, 2014 at 10:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> On 2014-10-26 08:52:42 +1100, Haribabu Kommi wrote:
>> I am thinking of a possible problem with shm_mq structure spin lock.
>> This is used for protecting the shm_mq structure.
>>
>> During the processing of any code under the spin lock, if the process
>> receives SIGQUIT signal then it is leading to a dead lock situation.
>>
>> SIGQUIT->proc_exit->shm_mq_detach->try to acquire spin lock. The spin
>> lock is already took by the process.
>>
>> It is very dificult to reproduce the problem as because the code under
>> the lock is very minimal.
>> Please let me know if I missed anything.
>
> I think you missed the following bit in postgres.c:
>
> /*
>  * quickdie() occurs when signalled SIGQUIT by the postmaster.
>  *
>  * Some backend has bought the farm,
>  * so we need to stop what we're doing and exit.
>  */
> void
> quickdie(SIGNAL_ARGS)
> {
> ...
>         /*
>          * We DO NOT want to run proc_exit() callbacks -- we're here because
>          * shared memory may be corrupted, so we don't want to try to clean up our
>          * transaction.  Just nail the windows shut and get out of town.  Now that
>          * there's an atexit callback to prevent third-party code from breaking
>          * things by calling exit() directly, we have to reset the callbacks
>          * explicitly to make this work as intended.
>          */
>         on_exit_reset();

Thanks for the details. I am sorry It is not proc_exit. It is the exit
callback functions
that can cause problem.

The following is the callstack where the problem can happen, if the signal
handler is called after the spin lock took by the worker.

Breakpoint 1, 0x000000000072dd83 in shm_mq_detach ()
(gdb) bt
#0  0x000000000072dd83 in shm_mq_detach ()
#1  0x000000000072e7db in shm_mq_detach_callback ()
#2  0x0000000000726d71 in dsm_detach ()
#3  0x0000000000726c43 in dsm_backend_shutdown ()
#4  0x0000000000727450 in shmem_exit ()
#5  0x00000000007272fc in proc_exit_prepare ()
#6  0x0000000000727501 in atexit_callback ()
#7  0x00000030ff435da2 in exit () from /lib64/libc.so.6
#8  0x00000000006ddaec in bgworker_quickdie ()
#9  <signal handler called>
#10 0x000000000072ce9a in shm_mq_sendv ()


Regards,
Hari Babu
Fujitsu Australia



Re: Possible problem with shm_mq spin lock

От
Tom Lane
Дата:
Haribabu Kommi <kommi.haribabu@gmail.com> writes:
> Thanks for the details. I am sorry It is not proc_exit. It is the exit
> callback functions that can cause problem.

> The following is the callstack where the problem can happen, if the signal
> handler is called after the spin lock took by the worker.

> Breakpoint 1, 0x000000000072dd83 in shm_mq_detach ()
> (gdb) bt
> #0  0x000000000072dd83 in shm_mq_detach ()
> #1  0x000000000072e7db in shm_mq_detach_callback ()
> #2  0x0000000000726d71 in dsm_detach ()
> #3  0x0000000000726c43 in dsm_backend_shutdown ()
> #4  0x0000000000727450 in shmem_exit ()
> #5  0x00000000007272fc in proc_exit_prepare ()
> #6  0x0000000000727501 in atexit_callback ()
> #7  0x00000030ff435da2 in exit () from /lib64/libc.so.6
> #8  0x00000000006ddaec in bgworker_quickdie ()

Or in other words, Robert broke it.  This control path should absolutely
not occur: the entire point of the on_exit_reset call in quickdie() is to
prevent any callbacks from being executed when we get to shmem_exit().
DSM-related functions DO NOT get an exemption.
        regards, tom lane



Re: Possible problem with shm_mq spin lock

От
Haribabu Kommi
Дата:
On Sun, Oct 26, 2014 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Haribabu Kommi <kommi.haribabu@gmail.com> writes:
>> Thanks for the details. I am sorry It is not proc_exit. It is the exit
>> callback functions that can cause problem.
>
>> The following is the callstack where the problem can happen, if the signal
>> handler is called after the spin lock took by the worker.
>
>> Breakpoint 1, 0x000000000072dd83 in shm_mq_detach ()
>> (gdb) bt
>> #0  0x000000000072dd83 in shm_mq_detach ()
>> #1  0x000000000072e7db in shm_mq_detach_callback ()
>> #2  0x0000000000726d71 in dsm_detach ()
>> #3  0x0000000000726c43 in dsm_backend_shutdown ()
>> #4  0x0000000000727450 in shmem_exit ()
>> #5  0x00000000007272fc in proc_exit_prepare ()
>> #6  0x0000000000727501 in atexit_callback ()
>> #7  0x00000030ff435da2 in exit () from /lib64/libc.so.6
>> #8  0x00000000006ddaec in bgworker_quickdie ()
>
> Or in other words, Robert broke it.  This control path should absolutely
> not occur: the entire point of the on_exit_reset call in quickdie() is to
> prevent any callbacks from being executed when we get to shmem_exit().
> DSM-related functions DO NOT get an exemption.

The "reset_on_dsm_detach" function is called to remove the DSM related
callbacks.
It's my mistake, I am really sorry, the code I am using is a wrong
one. Sorry for the noise.

Regards,
Hari Babu
Fujitsu Australia



Re: Possible problem with shm_mq spin lock

От
Robert Haas
Дата:
On Sat, Oct 25, 2014 at 9:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Haribabu Kommi <kommi.haribabu@gmail.com> writes:
>> Thanks for the details. I am sorry It is not proc_exit. It is the exit
>> callback functions that can cause problem.
>
>> The following is the callstack where the problem can happen, if the signal
>> handler is called after the spin lock took by the worker.
>
>> Breakpoint 1, 0x000000000072dd83 in shm_mq_detach ()
>> (gdb) bt
>> #0  0x000000000072dd83 in shm_mq_detach ()
>> #1  0x000000000072e7db in shm_mq_detach_callback ()
>> #2  0x0000000000726d71 in dsm_detach ()
>> #3  0x0000000000726c43 in dsm_backend_shutdown ()
>> #4  0x0000000000727450 in shmem_exit ()
>> #5  0x00000000007272fc in proc_exit_prepare ()
>> #6  0x0000000000727501 in atexit_callback ()
>> #7  0x00000030ff435da2 in exit () from /lib64/libc.so.6
>> #8  0x00000000006ddaec in bgworker_quickdie ()
>
> Or in other words, Robert broke it.  This control path should absolutely
> not occur: the entire point of the on_exit_reset call in quickdie() is to
> prevent any callbacks from being executed when we get to shmem_exit().
> DSM-related functions DO NOT get an exemption.

All true.  However, Robert also fixed it, in commit
cb9a0c7987466b130fbced01ab5d5481cf3a16df, when you complained about it
previously.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company