Re: Reduce ProcArrayLock contention

Поиск

Список

Период

Сортировка

От	Amit Kapila
Тема	Re: Reduce ProcArrayLock contention
Дата	25 июля 2015 г. 07:42:15
Msg-id	CAA4eK1JVwEpE8e+qz9tbF9HgFmJtj4qqDR3Vzu3VsDPP71H0QQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Reduce ProcArrayLock contention (Pavan Deolasee <pavan.deolasee@gmail.com>)
Ответы	Re: Reduce ProcArrayLock contention Re: Reduce ProcArrayLock contention
Список	pgsql-hackers

Дерево обсуждения

On Fri, Jul 24, 2015 at 4:26 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>
>
> On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>>
>> pgbench setup
>> ------------------------
>> scale factor - 300
>> Data is on magnetic disk and WAL on ssd.
>> pgbench -M prepared tpc-b
>>
>> Head : commit 51d0fe5d
>> Patch -1 : group_xid_clearing_at_trans_end_rel_v1
>>
>>
>> Client Count/TPS18163264128
>> HEAD814609210899199262363617812
>> Patch-11086648311093199083122028237
>>
>> The graph for the data is attached.
>>
>
> Numbers look impressive and definitely shows that the idea is worth pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8 clients, I did not see any improvement.

I can't help in this because I think we need somewhat

bigger m/c to test the impact of patch.

> In fact, averages over 2 runs showed a slight 2-4% decline in the tps. Having said that, there is no reason to disbelieve your numbers and no much powerful machines, we might see the gains.

> BTW I ran the tests with, pgbench -s 10 -c 4 -T 300
>

I am not sure if this result is worth worrying to investigate as in

write tests (that too for short duration), such fluctuations can

occur and I think till we see complete results for multiple clients

(1, 4, 8 .. 64 or 128) (possible on some high end m/c), it is difficult

to draw any conclusion.

>
>> Points about performance data
>> ---------------------------------------------
>> 1. Gives good performance improvement at or greater than 64 clients
>> and give somewhat moderate improvement at lower client count. The
>> reason is that because the contention around ProcArrayLock is mainly
>> seen at higher client count. I have checked that at higher client-count,
>> it started behaving lockless (which means performance with patch is
>> equivivalent to if we just comment out ProcArrayLock in
>> ProcArrayEndTransaction()).
>
>
> Well, I am not entirely sure if thats a correct way of looking at it. Sure, you would see less contention on the ProcArrayLock because the fact is that there are far fewer backends trying to acquire it.

I was telling that fact even without my patch. Basically I have

tried by commenting ProcArrayLock in ProcArrayEndTransaction.

> But those who don't get the lock will sleep and hence the contention is moved somewhere else, at least partially.
>

Sure, if contention is reduced at one place it will move

to next lock.

>>
>> 4. The gains are visible when the data fits in shared_buffers as for other
>> workloads I/O starts dominating.
>
>
> Thats seems be perfectly expected.
>
>>
>> 5. I have seen that effect of Patch is much more visible if we keep
>> autovacuum = off (do manual vacuum after each run) and keep
>> wal_writer_delay to lower value (say 20ms).
>
>
> Do you know why that happens? Is it because the contention moves somewhere else with autovacuum on?
>

No, autovacuum generates I/O due to which sometimes there

is more variation in Write tests.

> Regarding the design itself, I've an idea that may be we can create a general purpose infrastructure to use this technique.

I think this could be beneficial if can comeup with

some clean interface.

> If its useful here, I'm sure there are other places where this can be applied with similar effect.
>

I also think so.

> For example, how about adding an API such as LWLockDispatchWork(lock, mode, function_ptr, data_ptr)? Here the data_ptr points to somewhere in shared memory that the function_ptr can work on once lock is available. If the lock is available in the requested mode then the function_ptr is

> executed with the given data_ptr and the function returns.

I can do something like that if others also agree with this new

API in LWLock series, but personally I don't think LWLock.c is

the right place to expose API for this work. Broadly the work

we are doing can be thought of below sub-tasks.

1. Advertise each backend's xid.

2. Push all backend's except one on global list.

3. wait till some-one wakes and check if the xid is cleared,

repeat untll the xid is clear

4. Acquire the lock

5. Pop all the backend's and clear each one's xid and used

their published xid to advance global latestCompleteXid.

6. Release Lock

7. Wake all the processes waiting for their xid to be cleared

and before waking mark that Xid of the backend is clear.

So among these only step 2 can be common among different

algorithms, other's need some work specific to each optimization.

Does any one else see a better way to provide a generic API, so

that it can be used for other places if required in future?

> If the lock is not available then the work is dispatched to some Q (tracked on per-lock basis?) and the process goes to sleep. Whenever the lock becomes available in the requested mode, the work is executed by some other backedn and the primary process is woken up. This will most likely

> happen in the LWLockRelease() path when the last holder is about to give up the lock so that it becomes available in the requested "mode".

I am not able to follow what you want to achieve with this,

Why is 'Q' better than the current process to perform the

work specific to whole group and does 'Q' also wait on the

current lock, if yes how?

I think this will over complicate the stuff without any real

benefit, atleast for this optimization.

>
> Regarding the patch, the compare-and-exchange function calls that you've used would work only for 64-bit machines, right? You would need to use equivalent 32-bit calls on a 32-bit machine.
>

I thought that internal API will automatically take care of it,

example for msvc it uses _InterlockedCompareExchange64

which if doesn't work on 32-bit systems or is not defined, then

we have to use 32-bit version, but I am not certain about

that fact.

Note - This patch requires some updation in src/backend/access/transam/README.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Robert Haas
Дата: 25 июля 2015 г., 05:22:22
Сообщение: Re: MultiXact member wraparound protections are now enabled

Следующее

От: Michael Paquier
Дата: 25 июля 2015 г., 08:46:10
Сообщение: Re: Supporting TAP tests with MSVC and Windows

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Reduce ProcArrayLock contention

Предыдущее

Следующее