Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.
Дата
Msg-id 5245EA75.9090403@vmware.com
обсуждение исходный текст
Ответ на Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.  (Merlin Moncure <mmoncure@gmail.com>)
Ответы Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.  (Peter Geoghegan <pg@heroku.com>)
Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
On 27.09.2013 22:00, Merlin Moncure wrote:
> On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure<mmoncure@gmail.com>  wrote:
>> On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure<mmoncure@gmail.com>  wrote:
>>> On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund<andres@2ndquadrant.com>  wrote:
>>>> On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
>>>>> On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund<andres@2ndquadrant.com>  wrote:
>>>>>> On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
>>>>>>> + bool
>>>>>>> + RecoveryMightBeInProgress(void)
>>>>>>> + {
>>>>>>> +     /*
>>>>>>> +      * We check shared state each time only until we leave recovery mode. We
>>>>>>> +      * can't re-enter recovery, so there's no need to keep checking after the
>>>>>>> +      * shared variable has once been seen false.
>>>>>>> +      */
>>>>>>> +     if (!LocalRecoveryInProgress)
>>>>>>> +             return false;
>>>>>>> +     else
>>>>>>> +     {
>>>>>>> +             /* use volatile pointer to prevent code rearrangement */
>>>>>>> +             volatile XLogCtlData *xlogctl = XLogCtl;
>>>>>>> +
>>>>>>> +             /* Intentionally query xlogctl without spinlocking! */
>>>>>>> +             LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
>>>>>>> +
>>>>>>> +             return LocalRecoveryInProgress;
>>>>>>> +     }
>>>>>>> + }
>>>>>>
>>>>>> I don't think it's acceptable to *set* LocalRecoveryInProgress
>>>>>> here. That should only be done in the normal routine.
>>>>>
>>>>> quite right -- that was a major error -- you could bypass the
>>>>> initialization call to the xlog with some bad luck.
>>>>
>>>> I've seen this in profiles since, so I'd appreciate pushing this
>>>> forward.
>>>
>>> roger that -- will push ahead when I get into the office...
>>
>> attached is new version fixing some comment typos.
>
> Attached is simplified patch that replaces the spinlock with a read
> barrier based on a suggestion made by Andres offlist.  The approach
> has different performance characteristics -- a barrier call is being
> issued instead of a non-blocking read.   I don't have a performance
> test case in hand to prove that's better so I'm going with Andre's
> approach because it's simpler.

Can you think of a concrete example of what might go wrong if we simply 
removed the spinlock, and did an unprotected read through the volatile 
pointer? Is there some particular call sequence that might get optimized 
in an unfortunate way? I'm not suggesting that we do that - better safe 
than sorry even if we can't think of any particular scenario - but it 
would help me understand this.

I don't think a read-barrier is enough here. See README.barrier:

> When a memory barrier is being used to separate two
> loads, use pg_read_barrier(); when it is separating two stores, use
> pg_write_barrier(); when it is a separating a load and a store (in either
> order), use pg_memory_barrier().

The function RecoveryInProgress() function does just one load, to read 
the variable, and wouldn't even need a barrier by itself. The other load 
or store that needs to be protected by the barrier happens in the 
caller, before or after the function, and we can't say for sure if it's 
a load or a store. So, let's use pg_memory_barrier().

>  Aside: can this truly the only caller
> of pg_read_barrier()?

Yep. I only added the first caller of the barriers altogether in the 
xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, 
but it wasn't used until now, in 9.4.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.