Обсуждение: LWLocks in DSM memory

Поиск
Список
Период
Сортировка

LWLocks in DSM memory

От
Thomas Munro
Дата:
Hi,

As already noted[1], LWLocks don't currently work in DSM segments,
because they use dlist for the list of waiters.  Even though all of
the waiter nodes are in PGPROC and therefore have stable addresses,
the dlist code internally constructs a circular list including
pointers to a special sentinel node inside the dlist_head object, and
those pointers may be invalid in other backends.

One solution could be to provide a non-circular variant of the dlist
interface that uses NULL list termination.  I've attached a quick
sketch of something like that which seems to work correctly.  It is
only lightly tested so far and probably buggy, but shows the general
idea.

Any thoughts on this approach, or better ways to solve this problem?
How valuable is the branch-free property of the circular push and
delete operations?

[1] https://www.postgresql.org/message-id/CA+Tgmobjia49CCJ0ZazbWaVv7nKgYt+1Zo5CwxkH9Aahgn2vPg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: LWLocks in DSM memory

От
Thomas Munro
Дата:
On Sun, Jul 24, 2016 at 1:10 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> One solution could be to provide a non-circular variant of the dlist
> interface that uses NULL list termination.  I've attached a quick
> sketch of something like that which seems to work correctly.  It is
> only lightly tested so far and probably buggy, but shows the general
> idea.

On reflection, it wouldn't make much sense to put "noncircular" in the
names of interfaces, as that is an internal detail.  Maybe
"relocatable" or "position independent" or something like that, since
that's a user-visible property of the dlist_head.  Here is a version
of the patch that uses _relocatable.

> Any thoughts on this approach, or better ways to solve this problem?
> How valuable is the branch-free property of the circular push and
> delete operations?

I investigated this a bit.  If I do this on my laptop (clang, no
asserts, -O2),  it takes 3895 milliseconds, or 4.8ns per push/delete
operation:

        dlist_init(&head);
        for (i = 0; i < 100000000; ++i)
        {
            dlist_push_head(&head, &nodes[0]);
            dlist_push_tail(&head, &nodes[1]);
            dlist_push_head(&head, &nodes[2]);
            dlist_push_tail(&head, &nodes[3]);
            dlist_delete(&nodes[2]);
            dlist_delete(&nodes[3]);
            dlist_delete(&nodes[0]);
            dlist_delete(&nodes[1]);
        }

The relocatable version takes 5907 milliseconds, or 7.4ns per
push/delete operation:

        dlist_init_relocatable(&head);
        for (i = 0; i < 100000000; ++i)
        {
            dlist_push_head_relocatable(&head, &nodes[0]);
            dlist_push_tail_relocatable(&head, &nodes[1]);
            dlist_push_head_relocatable(&head, &nodes[2]);
            dlist_push_tail_relocatable(&head, &nodes[3]);
            dlist_delete_relocatable(&head, &nodes[2]);
            dlist_delete_relocatable(&head, &nodes[3]);
            dlist_delete_relocatable(&head, &nodes[0]);
            dlist_delete_relocatable(&head, &nodes[1]);
        }

Those operations are ~50% slower.  So getting rid of dlist's clever
branch-free code generally seems like a bad idea.

Next I wondered if it would be a bad idea to use slower relocatable
dlist heads for all LWLocks.  Obviously LWLocks are used all over the
place and it would be bad to slow them down, but I wondered if maybe
dlist manipulation might not be a very significant part of their work.
So I put a LWLock and a counter in shared memory, and had N background
workers run a tight loop that locks, increments and unlocks
concurrently until the counter reaches 1 billion.  This creates
different degrees of contention and wait list sizes.  The worker loop
looks like this:

    while (!finished)
    {
        LWLockAcquire(&control->lock, LW_EXCLUSIVE);
        ++control->counter;
        if (control->counter >= control->goal)
            finished = true;
        LWLockRelease(&control->lock);
    }

I measured the following times for unpatched master, on my 4 core laptop:

16 workers = 73.067s, 74.869s, 75.338s
8 workers  = 65.846s, 67.622s, 68.039s
4 workers  = 68.763s, 68.980s, 69.035s <-- curiously slower here
3 workers  = 59.701s, 59.991s, 60.133s
2 workers  = 53.620s, 55.300s, 55.790s
1 worker   = 21.578s, 21.535s, 21.598s

With the attached patched I got:

16 workers = 75.341s, 77.445s, 77.635s <- +3.4%
8 workers  = 67.462s, 68.622s, 68.851s <- +1.4%
4 workers  = 64.983s, 65.021s, 65.496s <- -5.7%
3 workers  = 60.247s, 60.425s, 60.492s <- +0.7%
2 workers  = 57.544s, 57.626s, 58.133s <- +2.3%
1 worker   = 21.403s, 21.486s, 21.661s <- -0.2%

Somewhat noisy data and different effects at different numbers of workers.

I can post the source for those tests if anyone is interested.  If you
have any other ideas for access patterns to test, or clever ways to
keep push and delete branch-free while also avoiding internal pointers
back to dlist_head, I'm all ears.  Otherwise, if a change affecting
all LWLocks turns out to be unacceptable, maybe we would need to have
a different LWLock interface for relocatable LWLocks to make them
suitable for use in DSM segments.  Any thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: LWLocks in DSM memory

От
Thomas Munro
Дата:
On Mon, Jul 25, 2016 at 3:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I measured the following times for unpatched master, on my 4 core laptop:
>
> 16 workers = 73.067s, 74.869s, 75.338s
> 8 workers  = 65.846s, 67.622s, 68.039s
> 4 workers  = 68.763s, 68.980s, 69.035s <-- curiously slower here
> 3 workers  = 59.701s, 59.991s, 60.133s
> 2 workers  = 53.620s, 55.300s, 55.790s
> 1 worker   = 21.578s, 21.535s, 21.598s
>
> With the attached patched I got:
>
> 16 workers = 75.341s, 77.445s, 77.635s <- +3.4%
> 8 workers  = 67.462s, 68.622s, 68.851s <- +1.4%
> 4 workers  = 64.983s, 65.021s, 65.496s <- -5.7%
> 3 workers  = 60.247s, 60.425s, 60.492s <- +0.7%
> 2 workers  = 57.544s, 57.626s, 58.133s <- +2.3%
> 1 worker   = 21.403s, 21.486s, 21.661s <- -0.2%

Correction, that +2.3% for 2 workers should be +4.2%.  And to clarify,
I ran the test 3 times as shown and those percentage changes are based
on the middle times.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: LWLocks in DSM memory

От
Robert Haas
Дата:
On Sun, Jul 24, 2016 at 11:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Jul 24, 2016 at 1:10 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> One solution could be to provide a non-circular variant of the dlist
>> interface that uses NULL list termination.  I've attached a quick
>> sketch of something like that which seems to work correctly.  It is
>> only lightly tested so far and probably buggy, but shows the general
>> idea.
>
> On reflection, it wouldn't make much sense to put "noncircular" in the
> names of interfaces, as that is an internal detail.  Maybe
> "relocatable" or "position independent" or something like that, since
> that's a user-visible property of the dlist_head.  Here is a version
> of the patch that uses _relocatable.
>
>> Any thoughts on this approach, or better ways to solve this problem?
>> How valuable is the branch-free property of the circular push and
>> delete operations?
>
> I investigated this a bit.  If I do this on my laptop (clang, no
> asserts, -O2),  it takes 3895 milliseconds, or 4.8ns per push/delete
> operation:
>
>         dlist_init(&head);
>         for (i = 0; i < 100000000; ++i)
>         {
>             dlist_push_head(&head, &nodes[0]);
>             dlist_push_tail(&head, &nodes[1]);
>             dlist_push_head(&head, &nodes[2]);
>             dlist_push_tail(&head, &nodes[3]);
>             dlist_delete(&nodes[2]);
>             dlist_delete(&nodes[3]);
>             dlist_delete(&nodes[0]);
>             dlist_delete(&nodes[1]);
>         }
>
> The relocatable version takes 5907 milliseconds, or 7.4ns per
> push/delete operation:
>
>         dlist_init_relocatable(&head);
>         for (i = 0; i < 100000000; ++i)
>         {
>             dlist_push_head_relocatable(&head, &nodes[0]);
>             dlist_push_tail_relocatable(&head, &nodes[1]);
>             dlist_push_head_relocatable(&head, &nodes[2]);
>             dlist_push_tail_relocatable(&head, &nodes[3]);
>             dlist_delete_relocatable(&head, &nodes[2]);
>             dlist_delete_relocatable(&head, &nodes[3]);
>             dlist_delete_relocatable(&head, &nodes[0]);
>             dlist_delete_relocatable(&head, &nodes[1]);
>         }
>
> Those operations are ~50% slower.  So getting rid of dlist's clever
> branch-free code generally seems like a bad idea.
>
> Next I wondered if it would be a bad idea to use slower relocatable
> dlist heads for all LWLocks.  Obviously LWLocks are used all over the
> place and it would be bad to slow them down, but I wondered if maybe
> dlist manipulation might not be a very significant part of their work.
> So I put a LWLock and a counter in shared memory, and had N background
> workers run a tight loop that locks, increments and unlocks
> concurrently until the counter reaches 1 billion.  This creates
> different degrees of contention and wait list sizes.  The worker loop
> looks like this:
>
>     while (!finished)
>     {
>         LWLockAcquire(&control->lock, LW_EXCLUSIVE);
>         ++control->counter;
>         if (control->counter >= control->goal)
>             finished = true;
>         LWLockRelease(&control->lock);
>     }
>
> I measured the following times for unpatched master, on my 4 core laptop:
>
> 16 workers = 73.067s, 74.869s, 75.338s
> 8 workers  = 65.846s, 67.622s, 68.039s
> 4 workers  = 68.763s, 68.980s, 69.035s <-- curiously slower here
> 3 workers  = 59.701s, 59.991s, 60.133s
> 2 workers  = 53.620s, 55.300s, 55.790s
> 1 worker   = 21.578s, 21.535s, 21.598s
>
> With the attached patched I got:
>
> 16 workers = 75.341s, 77.445s, 77.635s <- +3.4%
> 8 workers  = 67.462s, 68.622s, 68.851s <- +1.4%
> 4 workers  = 64.983s, 65.021s, 65.496s <- -5.7%
> 3 workers  = 60.247s, 60.425s, 60.492s <- +0.7%
> 2 workers  = 57.544s, 57.626s, 58.133s <- +2.3%
> 1 worker   = 21.403s, 21.486s, 21.661s <- -0.2%
>
> Somewhat noisy data and different effects at different numbers of workers.
>
> I can post the source for those tests if anyone is interested.  If you
> have any other ideas for access patterns to test, or clever ways to
> keep push and delete branch-free while also avoiding internal pointers
> back to dlist_head, I'm all ears.  Otherwise, if a change affecting
> all LWLocks turns out to be unacceptable, maybe we would need to have
> a different LWLock interface for relocatable LWLocks to make them
> suitable for use in DSM segments.  Any thoughts?

Thanks for looking into this.  Any theory on why this is slower in the
abstract but not slower, perhaps even faster, for LWLocks?  I wonder
if it has to do with how likely it is for a list to be completely
empty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: LWLocks in DSM memory

От
Andres Freund
Дата:
Hi,

On 2016-07-25 15:02:41 +1200, Thomas Munro wrote:
> > Any thoughts on this approach, or better ways to solve this problem?
> > How valuable is the branch-free property of the circular push and
> > delete operations?

I think it's generally valuable, but I suspect that it doesn't really
matter much for lwlocks, given that this is already the slow path, where
we enter the kernel and such. I suspect that the performance difference
here is a more due to code movement than anything.

> Next I wondered if it would be a bad idea to use slower relocatable
> dlist heads for all LWLocks.  Obviously LWLocks are used all over the
> place and it would be bad to slow them down, but I wondered if maybe
> dlist manipulation might not be a very significant part of their work.

I'm pretty sure thats's the case.


> So I put a LWLock and a counter in shared memory, and had N background
> workers run a tight loop that locks, increments and unlocks
> concurrently until the counter reaches 1 billion.  This creates
> different degrees of contention and wait list sizes.  The worker loop
> looks like this:
> 
>     while (!finished)
>     {
>         LWLockAcquire(&control->lock, LW_EXCLUSIVE);
>         ++control->counter;
>         if (control->counter >= control->goal)
>             finished = true;
>         LWLockRelease(&control->lock);
>     }

You really need shared lwlocks here as well, because exclusive lwlocks
will only ever trigger a single list manipulation (basically a pop from
the wait queue), further minimizing the list manipulation overhead.


I think the better fix here would acutally be to get rid of a pointer
based list here, and a) replace the queue with integer offsets b) make
the queue lock-free.  But I understand that you might not want to tackle
that :P

Greetings,

Andres Freund



Re: LWLocks in DSM memory

От
Thomas Munro
Дата:
On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote:
> I think the better fix here would acutally be to get rid of a pointer
> based list here, and a) replace the queue with integer offsets ...

Here is a prototype of that idea.  It replaces that dlist with a
proclist, a new specialised doubly-linked list for pgprocno values,
using INVALID_PGPROCNO for termination.  It works with proclist_node
objects inside PGPROC at an arbitrary offset which must be provided
when you initialise the proclist.

It could be made more general than just the PGPROC array by taking the
base address and stride of the array, perhaps even allowing for a
different base address in each backend for arrays that live in DSM,
but I happened to see https://xkcd.com/974/ today and didn't pursue
that thought.

> ... b) make
> the queue lock-free.  But I understand that you might not want to tackle
> that :P

This may be complicated by the not-strictly-queue-like access pattern.
I haven't looked into it yet, but I can see that it requires some
serious study (Harris or Zhang techniques, maybe with extra tagging to
avoid ABA problems on concurrent removal of the same node?, and if so
that might require a wider CAS than we have?).

I wonder if a proclist with an optional lock-free interface would also
be interesting for syncrep queues and others.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: LWLocks in DSM memory

От
Robert Haas
Дата:
On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote:
>> I think the better fix here would acutally be to get rid of a pointer
>> based list here, and a) replace the queue with integer offsets ...
>
> Here is a prototype of that idea.  It replaces that dlist with a
> proclist, a new specialised doubly-linked list for pgprocno values,
> using INVALID_PGPROCNO for termination.  It works with proclist_node
> objects inside PGPROC at an arbitrary offset which must be provided
> when you initialise the proclist.

Aside from the fact that this allows LWLocks inside DSM segments,
which I definitely want to support, this seems to have the nice
property of reducing the size of an LWLock by 8 bytes.  We need to
consider what to do about LWLOCK_MINIMAL_SIZE.  We could (a) decide to
still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce
false sharing, and rename this constant not to imply minimality; or
(b) alter the macro definition to allow for 16 bytes as a possible
result.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: LWLocks in DSM memory

От
Thomas Munro
Дата:
On Fri, Jul 29, 2016 at 2:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 28, 2016 at 12:12 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Tue, Jul 26, 2016 at 6:00 AM, Andres Freund <andres@anarazel.de> wrote:
>>> I think the better fix here would acutally be to get rid of a pointer
>>> based list here, and a) replace the queue with integer offsets ...
>>
>> Here is a prototype of that idea.  It replaces that dlist with a
>> proclist, a new specialised doubly-linked list for pgprocno values,
>> using INVALID_PGPROCNO for termination.  It works with proclist_node
>> objects inside PGPROC at an arbitrary offset which must be provided
>> when you initialise the proclist.
>
> Aside from the fact that this allows LWLocks inside DSM segments,
> which I definitely want to support, this seems to have the nice
> property of reducing the size of an LWLock by 8 bytes.  We need to
> consider what to do about LWLOCK_MINIMAL_SIZE.  We could (a) decide to
> still pad all arrays of LWLocks to 32 bytes per LWLock so as to reduce
> false sharing, and rename this constant not to imply minimality; or
> (b) alter the macro definition to allow for 16 bytes as a possible
> result.

That version didn't actually make LWLock any smaller, because of the
additional offset stored in proclist_head when initialising the list.
Here is a new version that requires callers to provide it when
accessing the list, bringing sizeof(LWLock) down to 16 on LP64/LLP64
systems.  In the spirit of the dlist_container macro, I added macros
so you can name the link member rather than providing its offset:
proclist_push_tail(&list, procno, lwWaitLink).

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: LWLocks in DSM memory

От
Robert Haas
Дата:
On Thu, Jul 28, 2016 at 6:51 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> That version didn't actually make LWLock any smaller, because of the
> additional offset stored in proclist_head when initialising the list.
> Here is a new version that requires callers to provide it when
> accessing the list, bringing sizeof(LWLock) down to 16 on LP64/LLP64
> systems.  In the spirit of the dlist_container macro, I added macros
> so you can name the link member rather than providing its offset:
> proclist_push_tail(&list, procno, lwWaitLink).

I have reviewed this patch.  I like the design and overall I would say
that the patch looks quite elegant.  The only problem I found is that
proclist_types.h doesn't seem to need to #include <stddef.h>.  I tried
taking that out, and, at least on my machine, it still compiles just
fine.

Of course, performance is a concern, as with any change that touches
deep internals.  Andres and Thomas seem to be in agreement that this
patch should cause a performance loss but that it should be
undetectable in practice, since only the slow path where we're
entering the kernel anyway is affected.  I agree with that conclusion.
I spent a little bit of time trying to think of a case where this
would have a measurable impact, and I'm not coming up with anything.
We've done a lot of work over the last few releases to try to
eliminate LWLock contention, and the only way you could even
theoretically see an impact from this is if you can come up with a
workload with furious LWLock contention so that there is lots and lots
of linked-list manipulation going on inside lwlock.c.  But I don't
really know of a workload that behaves that way, and even if I did, I
think the number of cycles we're talking about here is too small to be
measurable.  Having to put processes to sleep is going to be multiple
orders of magnitude more costly than any possible details of how we
handle the linked list manipulation.

I also agree with the goal of the patch: the reason why I developed
the idea of LWLock tranches was with the goal of being able to put
LWLocks inside DSM segments, and that worked fine up until Andres
changed lwlock.c to use dlists.  This change will get us back to being
able to use LWLocks inside of DSM segments.  Of course, we have no
code like that today; if we did, it would be broken.  But Thomas will
be submitting code that does exactly that in the near future, and I
suspect other people will want to do it, too. As a fringe benefit, I
have another patch I'm working on that can make use of this proclist
infrastructure.

Therefore, I plan to commit this patch, removing the #include
<stddef.h> unless someone convinces me we need it, shortly after
development for v10 opens, unless there are objections before then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: LWLocks in DSM memory

От
Robert Haas
Дата:
On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Therefore, I plan to commit this patch, removing the #include
> <stddef.h> unless someone convinces me we need it, shortly after
> development for v10 opens, unless there are objections before then.

Hearing no objections, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: LWLocks in DSM memory

От
Andres Freund
Дата:
On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Therefore, I plan to commit this patch, removing the #include
> > <stddef.h> unless someone convinces me we need it, shortly after
> > development for v10 opens, unless there are objections before then.
> 
> Hearing no objections, done.

I'd have objected, if I hadn't been on vacation.  While I intuitively
*do* think that the increased wait-list overhead won't be relevant, I
also know that my intuition has frequently been wrong around the lwlock
code.  This needs some benchmarks on a 4+ socket machine,
first. Something exercising the slow path obviously. E.g. a pgbench with
a small number of writers, and a large number of writers.

Regards,

Andres



Re: LWLocks in DSM memory

От
Robert Haas
Дата:
On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
>> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > Therefore, I plan to commit this patch, removing the #include
>> > <stddef.h> unless someone convinces me we need it, shortly after
>> > development for v10 opens, unless there are objections before then.
>>
>> Hearing no objections, done.
>
> I'd have objected, if I hadn't been on vacation.  While I intuitively
> *do* think that the increased wait-list overhead won't be relevant, I
> also know that my intuition has frequently been wrong around the lwlock
> code.  This needs some benchmarks on a 4+ socket machine,
> first. Something exercising the slow path obviously. E.g. a pgbench with
> a small number of writers, and a large number of writers.

I have to admit that I totally blanked about you being on vacation.
Thanks for mentioning the workload you think might be adversely
affected, but to be honest, even if there's some workload where this
causes a small regression, I'm not really sure what you think we
should do instead.  Should we have a separate copy of lwlock.c just
for parallel query and other stuff that uses DSM?  Won't that slow
down every error-handling path in the system, if they all have to
release two kinds of lwlocks rather than one?  And bloat the binary?
Or are you going to argue that parallel query doesn't really need
LWLocks?  I'm sure that's not true.  We got by without it for this
release, but that's because the only truly parallel operation as yet
is Parallel Seq Scan whose requirements are simple enough to be
handled with a spinlock.

Anyway, I guess we should wait for the benchmark results and then see,
but if we're not going to do this then we need some reasonable
alternative.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: LWLocks in DSM memory

От
Robert Haas
Дата:
On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
>> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > Therefore, I plan to commit this patch, removing the #include
>> > <stddef.h> unless someone convinces me we need it, shortly after
>> > development for v10 opens, unless there are objections before then.
>>
>> Hearing no objections, done.
>
> I'd have objected, if I hadn't been on vacation.  While I intuitively
> *do* think that the increased wait-list overhead won't be relevant, I
> also know that my intuition has frequently been wrong around the lwlock
> code.  This needs some benchmarks on a 4+ socket machine,
> first. Something exercising the slow path obviously. E.g. a pgbench with
> a small number of writers, and a large number of writers.

Amit just pointed out to me that you wrote "a small number of writers,
and a large number of writers".  I assume one of those is supposed to
say "readers"?  Probably the second one?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: LWLocks in DSM memory

От
Andres Freund
Дата:
On 2016-08-17 08:31:36 -0400, Robert Haas wrote:
> On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
> >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > Therefore, I plan to commit this patch, removing the #include
> >> > <stddef.h> unless someone convinces me we need it, shortly after
> >> > development for v10 opens, unless there are objections before then.
> >>
> >> Hearing no objections, done.
> >
> > I'd have objected, if I hadn't been on vacation.  While I intuitively
> > *do* think that the increased wait-list overhead won't be relevant, I
> > also know that my intuition has frequently been wrong around the lwlock
> > code.  This needs some benchmarks on a 4+ socket machine,
> > first. Something exercising the slow path obviously. E.g. a pgbench with
> > a small number of writers, and a large number of writers.
> 
> Amit just pointed out to me that you wrote "a small number of writers,
> and a large number of writers".  I assume one of those is supposed to
> say "readers"?  Probably the second one?

Yes. I want a long wait list, modified in bulk - which should be the
case with the above.



Re: LWLocks in DSM memory

От
Andres Freund
Дата:
On 2016-08-16 23:09:07 -0400, Robert Haas wrote:
> On Tue, Aug 16, 2016 at 5:03 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-08-15 18:15:23 -0400, Robert Haas wrote:
> >> On Thu, Aug 11, 2016 at 2:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > Therefore, I plan to commit this patch, removing the #include
> >> > <stddef.h> unless someone convinces me we need it, shortly after
> >> > development for v10 opens, unless there are objections before then.
> >>
> >> Hearing no objections, done.
> >
> > I'd have objected, if I hadn't been on vacation.  While I intuitively
> > *do* think that the increased wait-list overhead won't be relevant, I
> > also know that my intuition has frequently been wrong around the lwlock
> > code.  This needs some benchmarks on a 4+ socket machine,
> > first. Something exercising the slow path obviously. E.g. a pgbench with
> > a small number of writers, and a large number of writers.
> 
> I have to admit that I totally blanked about you being on vacation.
> Thanks for mentioning the workload you think might be adversely
> affected, but to be honest, even if there's some workload where this
> causes a small regression, I'm not really sure what you think we
> should do instead.

Well, you convincingly argued against that approach in a nearby thread
;). Joking aside: I do think that we should make such a change
knowingly. It might also be possible to address it somehow.

I really hope there's no slowdown.


> Should we have a separate copy of lwlock.c just
> for parallel query and other stuff that uses DSM?

No, that'd be horrible.


> Or are you going to argue that parallel query doesn't really need
> LWLocks?

Definitely not.


- Andres



Re: LWLocks in DSM memory

От
Mithun Cy
Дата:
On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:
>Yes. I want a long wait list, modified in bulk - which should be the
>case with the above.

I ran some pgbench. And, I do not see much difference in performance, small variance in perf can be attributed to variance in probability of drawing the particular built-in script.

Server configuration:
                ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

pgbench configuration: initialized with scale_factor 300
./pgbench -c $threads -j $threads -T 1800 -M prepared -b simple-update@1 -b  select-only@20 postgres


Simple-update : select-only = 5:95


cilents864128
Current Code38279.663784258196.067342283150.495921
After Patch revert37316.09022268285.506338280077.913954
% diff -2.51719442843.9076656356-1.0851409449




Simple-update : selec-only = 20:80


cilents864128
Current Code23169.021469134791.996882154057.101004
After Patch revert23892.91539135091.645551150402.80543
%diff3.12440437750.2223044958-2.3720396854


And this was done on our 8 socket  intel machine machine

--
Thanks and Regards
Mithun C Y

Re: LWLocks in DSM memory

От
Amit Kapila
Дата:
On Fri, Aug 19, 2016 at 4:29 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:
>Yes. I want a long wait list, modified in bulk - which should be the
>case with the above.

I ran some pgbench. And, I do not see much difference in performance, small variance in perf can be attributed to variance in probability of drawing the particular built-in script.


Can you specify the m/c details as Andres wants tests to be conducted on some high socket m/c?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: LWLocks in DSM memory

От
Mithun Cy
Дата:

On Fri, Aug 19, 2016 at 4:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>Can you specify the m/c details as Andres wants tests to be conducted on some high socket m/c?

As I have specified at the last line of my mail it is a 8 socket intel machine.

available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 50308 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 44594 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 61814 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 55258 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 46705 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 40948 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 47468 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 17285 MB
 


--
Thanks and Regards
Mithun C Y