Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Дата
Msg-id 20171019000334.r7f6p4ypkki3x5tq@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA  (Sokolov Yura <funny.falcon@postgrespro.ru>)
Ответы Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA  (Sokolov Yura <funny.falcon@postgrespro.ru>)
Список pgsql-hackers
Hi,

On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:
> From 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.falcon@postgrespro.ru>
> Date: Mon, 29 May 2017 09:25:41 +0000
> Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock.
> 
> SpinDelayStatus should be function global to count iterations correctly,
> and produce more correct delays.
> 
> Also if spin didn't spin, do not count it in spins_per_delay correction.
> It wasn't necessary before cause delayStatus were used only in contented
> cases.

I'm not convinced this is benefial. Adds a bit of overhead to the
casewhere LW_FLAG_LOCKED can be set immediately. OTOH, it's not super
likely to make a large difference if the lock has to be taken anyway...


> +
> +/* just shortcut to not declare lwlock_stats variable at the end of function */
> +static void
> +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)
> +{
> +    lwlock_stats *lwstats;
> +
> +    lwstats = get_lwlock_stats_entry(lock);
> +    lwstats->spin_delay_count += delayStatus->delays;
> +}
>  #endif                            /* LWLOCK_STATS */

This seems like a pretty independent change.


>  /*
>   * Internal function that tries to atomically acquire the lwlock in the passed
> - * in mode.
> + * in mode. If it could not grab the lock, it doesn't puts proc into wait
> + * queue.
>   *
> - * This function will not block waiting for a lock to become free - that's the
> - * callers job.
> + * It is used only in LWLockConditionalAcquire.
>   *
> - * Returns true if the lock isn't free and we need to wait.
> + * Returns true if the lock isn't free.
>   */
>  static bool
> -LWLockAttemptLock(LWLock *lock, LWLockMode mode)
> +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)

This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.


> +/*
> + * Internal function that tries to atomically acquire the lwlock in the passed
> + * in mode or put it self into waiting queue with waitmode.
> + * This function will not block waiting for a lock to become free - that's the
> + * callers job.
> + *
> + * Returns true if the lock isn't free and we are in a wait queue.
> + */
> +static inline bool
> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode)
> +{
> +    uint32        old_state;
> +    SpinDelayStatus delayStatus;
> +    bool        lock_free;
> +    uint32        mask,
> +                add;
> +
> +    AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
> +
> +    if (mode == LW_EXCLUSIVE)
> +    {
> +        mask = LW_LOCK_MASK;
> +        add = LW_VAL_EXCLUSIVE;
> +    }
> +    else
> +    {
> +        mask = LW_VAL_EXCLUSIVE;
> +        add = LW_VAL_SHARED;
> +    }
> +
> +    init_local_spin_delay(&delayStatus);

The way you moved this around has the disadvantage that we now do this -
a number of writes - even in the very common case where the lwlock can
be acquired directly.


> +    /*
> +     * Read once outside the loop. Later iterations will get the newer value
> +     * either via compare & exchange or with read after perform_spin_delay.
> +     */
> +    old_state = pg_atomic_read_u32(&lock->state);
> +    /* loop until we've determined whether we could acquire the lock or not */
> +    while (true)
> +    {
> +        uint32        desired_state;
> +
> +        desired_state = old_state;
> +
> +        lock_free = (old_state & mask) == 0;
> +        if (lock_free)
>          {
> -            if (lock_free)
> +            desired_state += add;
> +            if (pg_atomic_compare_exchange_u32(&lock->state,
> +                                               &old_state, desired_state))
>              {
>                  /* Great! Got the lock. */
>  #ifdef LOCK_DEBUG
>                  if (mode == LW_EXCLUSIVE)
>                      lock->owner = MyProc;
>  #endif
> -                return false;
> +                break;
> +            }
> +        }
> +        else if ((old_state & LW_FLAG_LOCKED) == 0)
> +        {
> +            desired_state |= LW_FLAG_LOCKED | LW_FLAG_HAS_WAITERS;
> +            if (pg_atomic_compare_exchange_u32(&lock->state,
> +                                               &old_state, desired_state))
> +            {
> +                LWLockQueueSelfLocked(lock, waitmode);
> +                break;
>              }
> -            else
> -                return true;    /* somebody else has the lock */
> +        }
> +        else
> +        {
> +            perform_spin_delay(&delayStatus);
> +            old_state = pg_atomic_read_u32(&lock->state);
>          }
>      }
> -    pg_unreachable();
> +#ifdef LWLOCK_STATS
> +    add_lwlock_stats_spin_stat(lock, &delayStatus);
> +#endif
> +
> +    /*
> +     * We intentionally do not call finish_spin_delay here, because the loop
> +     * above usually finished by queuing into the wait list on contention, and
> +     * doesn't reach spins_per_delay thereby doesn't sleep inside of
> +     * perform_spin_delay. Also, different LWLocks has very different
> +     * contention pattern, and it is wrong to update spin-lock statistic based
> +     * on LWLock contention.
> +     */

Huh? This seems entirely unconvincing. Without adjusting this here we'll
just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.


> From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.falcon@postgrespro.ru>
> Date: Tue, 30 May 2017 18:54:25 +0300
> Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .
> 
> ---
>  src/backend/storage/lmgr/lwlock.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
> index 334c2a2d96..0a41c2c4e2 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -62,16 +62,15 @@
>   * notice that we have to wait. Unfortunately by the time we have finished
>   * queuing, the former locker very well might have already finished it's
>   * work. That's problematic because we're now stuck waiting inside the OS.
> -
> - * To mitigate those races we use a two phased attempt at locking:
> - *     Phase 1: Try to do it atomically, if we succeed, nice
> - *     Phase 2: Add ourselves to the waitqueue of the lock
> - *     Phase 3: Try to grab the lock again, if we succeed, remove ourselves from
> - *              the queue
> - *     Phase 4: Sleep till wake-up, goto Phase 1
>   *
> - * This protects us against the problem from above as nobody can release too
> - *      quick, before we're queued, since after Phase 2 we're already queued.
> + * This race is avoided by taking a lock for the wait list using CAS with the old
> + * state value, so it would only succeed if lock is still held. Necessary flag
> + * is set using the same CAS.
> + *
> + * Inside LWLockConflictsWithVar the wait list lock is reused to protect the variable
> + * value. First the lock is acquired to check the variable value, then flags are
> + * set with a second CAS before queuing to the wait list in order to ensure that the lock was not
> + * released yet.
>   * -------------------------------------------------------------------------
>   */

I think this needs more extensive surgery.



> From cc74a849a64e331930a2285e15445d7f08b54169 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.falcon@postgrespro.ru>
> Date: Fri, 2 Jun 2017 11:34:23 +0000
> Subject: [PATCH 6/6] Make SpinDelayStatus a bit lighter.
> 
> It saves couple of instruction of fast path of spin-loop, and so makes
> fast path of LWLock a bit faster (and in other places where spinlock is
> used).

> Also it makes call to perform_spin_delay a bit slower, that could
> positively affect on spin behavior, especially if no `pause` instruction
> present.

Whaa? That seems pretty absurd reasoning.




One big advantage of this is that we should be able to make LW_SHARED
acquisition an xadd. I'd done that previously, when the separate
spinlock still existed, and it was quite beneficial for performance on
larger systems. But it was some fiddly code - should be easier now.
Requires some careful management when noticing that an xadd acquired a
shared lock even though there's a conflicting exclusive lock, but
increase the fastpath.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Badrul Chowdhury
Дата:
Сообщение: Re: [HACKERS] Re: protocol version negotiation (Re: LibpqPGRES_COPY_BOTH - version compatibility)
Следующее
От: Ronald Jewell
Дата:
Сообщение: [HACKERS] I've just started working on Full Text Search with version 10 onUbuntu 16