Re: Reduce ProcArrayLock contention

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: Reduce ProcArrayLock contention
Дата
Msg-id CAA4eK1JVwEpE8e+qz9tbF9HgFmJtj4qqDR3Vzu3VsDPP71H0QQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Reduce ProcArrayLock contention  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Ответы Re: Reduce ProcArrayLock contention  (Robert Haas <robertmhaas@gmail.com>)
Re: Reduce ProcArrayLock contention  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Список pgsql-hackers
On Fri, Jul 24, 2015 at 4:26 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
>
> On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>>
>> pgbench setup
>> ------------------------
>> scale factor - 300
>> Data is on magnetic disk and WAL on ssd.
>> pgbench -M prepared tpc-b
>>
>> Head : commit 51d0fe5d
>> Patch -1 : group_xid_clearing_at_trans_end_rel_v1
>>
>>
>> Client Count/TPS18163264128
>> HEAD814609210899199262363617812
>> Patch-11086648311093199083122028237
>>
>> The graph for the data is attached.
>>
>
> Numbers look impressive and definitely shows that the idea is worth pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8 clients, I did not see any improvement. 
>

I can't help in this because I think we need somewhat
bigger m/c to test the impact of patch.

> In fact, averages over 2 runs showed a slight 2-4% decline in the tps. Having said that, there is no reason to disbelieve your numbers and no much powerful machines, we might see the gains.
>
> BTW I ran the tests with, pgbench -s 10 -c 4 -T 300
>

I am not sure if this result is worth worrying to investigate as in
write tests (that too for short duration), such fluctuations can
occur and I think till we see complete results for multiple clients
(1, 4, 8 .. 64 or 128) (possible on some high end m/c), it is difficult
to draw any conclusion.
  
>
>> Points about performance data
>> ---------------------------------------------
>> 1.  Gives good performance improvement at or greater than 64 clients
>> and give somewhat moderate improvement at lower client count.  The
>> reason is that because the contention around ProcArrayLock is mainly
>> seen at higher client count.  I have checked that at higher client-count,
>> it started behaving lockless (which means performance with patch is
>> equivivalent to if we just comment out ProcArrayLock in
>> ProcArrayEndTransaction()).
>
>
> Well, I am not entirely sure if thats a correct way of looking at it. Sure, you would see less contention on the ProcArrayLock because the fact is that there are far fewer backends trying to acquire it.
>

I was telling that fact even without my patch. Basically I have
tried by commenting ProcArrayLock in ProcArrayEndTransaction.

> But those who don't get the lock will sleep and hence the contention is moved somewhere else, at least partially.  
>

Sure, if contention is reduced at one place it will move
to next lock.
  

>>
>> 4. The gains are visible when the data fits in shared_buffers as for other
>> workloads I/O starts dominating.
>
>
> Thats seems be perfectly expected.
>  
>>
>> 5. I have seen that effect of Patch is much more visible if we keep
>> autovacuum = off (do manual vacuum after each run) and keep
>> wal_writer_delay to lower value (say 20ms).
>
>
> Do you know why that happens? Is it because the contention moves somewhere else with autovacuum on?
>

No, autovacuum generates I/O due to which sometimes there
is more variation in Write tests.
  
> Regarding the design itself, I've an idea that may be we can create a general purpose infrastructure to use this technique. 
>

I think this could be beneficial if can comeup with
some clean interface.

> If its useful here, I'm sure there are other places where this can be applied with similar effect.
>

I also think so.

> For example, how about adding an API such as LWLockDispatchWork(lock, mode, function_ptr, data_ptr)? Here the data_ptr points to somewhere in shared memory that the function_ptr can work on once lock is available. If the lock is available in the requested mode then the function_ptr is 
> executed with the given data_ptr and the function returns.
>

I can do something like that if others also agree with this new
API in LWLock series, but personally I don't think LWLock.c is
the right place to expose API for this work.  Broadly the work
we are doing can be thought of below sub-tasks.

1. Advertise each backend's xid.
2. Push all backend's except one on global list.
3. wait till some-one wakes and check if the xid is cleared,
   repeat untll the xid is clear
4. Acquire the lock
5. Pop all the backend's and clear each one's xid and used
   their published xid to advance global latestCompleteXid.
6. Release Lock
7. Wake all the processes waiting for their xid to be cleared
   and before waking mark that Xid of the backend is clear.

So among these only step 2 can be common among different
algorithms, other's need some work specific to each optimization.

Does any one else see a better way to provide a generic API, so
that it can be used for other places if required in future?



> If the lock is not available then the work is dispatched to some Q (tracked on per-lock basis?) and the process goes to sleep. Whenever the lock becomes available in the requested mode, the work is executed by some other backedn and the primary process is woken up. This will most likely
 > happen in the LWLockRelease() path when the last holder is about to give up the lock so that it becomes available in the requested "mode". 
>

I am not able to follow what you want to achieve with this,
Why is 'Q' better than the current process to perform the
work specific to whole group and does 'Q' also wait on the
current lock, if yes how?

I think this will over complicate the stuff without any real
benefit, atleast for this optimization.

>
> Regarding the patch, the compare-and-exchange function calls that you've used would work only for 64-bit machines, right? You would need to use equivalent 32-bit calls on a 32-bit machine.
>

I thought that internal API will automatically take care of it,
example for msvc it uses _InterlockedCompareExchange64
which if doesn't work on 32-bit systems or is not defined, then
we have to use 32-bit version, but I am not certain about
that fact.


Note - This patch requires some updation in src/backend/access/transam/README.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: MultiXact member wraparound protections are now enabled
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Supporting TAP tests with MSVC and Windows