Обсуждение: Avoiding repeated snapshot computation

Поиск

Список

Период

Сортировка

Avoiding repeated snapshot computation

От

Pavan Deolasee

Дата:

26 ноября 2011 г., 11:53:27

On some recent benchmarks and profile data, I saw GetSnapshotData
figures at the very top or near top. For lesser number of clients, it
can account for 10-20% of time, but more number of clients I have seen
it taking up as much as 40% of sample time. Unfortunately, the machine
of which I was running these tests is currently not available and so I
don't have the exact numbers. But the observation is almost correct.
Our recent work on separating the hot members of PGPROC in a separate
array would definitely reduce data cache misses ans reduce the
GetSnapshotData time, but it probably still accounts for a large
enough critical section for a highly contended lock.

I think now that we have reduced the run time of the function itself,
we should now try to reduce the number of times the function is
called. Robert proposed a way to reduce the number of calls per
transaction. I think we can go one more step further and reduce the
number for across the transactions.

One major problem today could be because the way LWLock works. If the
lock is currently held in SHARED mode by some backend and some other
backend now requests it in SHARED mode, it will immediately get it.
Thats probably the right thing to do because you don't want the reader
to really wait when the lock is readily available. But in the case of
GetSnapshotData(), every reader is doing exactly the same thing; they
are computing a snapshot based on the same shared state and would
compute exactly the same snapshot (if we ignore the fact that we don't
include caller's XID in xip array, but thats a minor detail). And
because the way LWLock works, more and more readers would get in to
compute the snapshot, until the exclusive waiters get a window to
sneak in, either because more and more processes slowly start sleeping
for exclusive access. To depict it, the four transactions make
overlapping calls for GetSnapshotData() and hence the total critical
section starts when the first caller enters it and the ends when the
last caller exits.

Txn1 ------[ SHARED ]---------------------
Txn2 --------[ SHARED ]-------------------
Txn3 -----------------[ SHARED ]-------------
Txn4 -------------------------------------------[ SHARED ]--------- |<---------------Total Time
------------------------------------>|

Couple of ideas come to mind to solve this issue.

A snapshot once computed will remain valid for every call irrespective
of its origin until at least one transaction ends. So we can store the
last computed snapshot in some shared area and reuse it for all
subsequent GetSnapshotData calls. The shared snapshot will get
invalidated when some transaction ends by calling
ProcArrayEndTransaction(). I tried this approach and saw a 15%
improvement for 32-80 clients on the 32 core HP IA box with pgbench -s
100 -N tests. Not bad, but I think this can be improved further.

What we can do is when a transaction comes to compute its snapshot, it
checks if some other transaction is already computing a snapshot for
itself. If so, it just sleeps on the lock. When the other process
finishes computing the snapshot, it saves the snapshot is a shared
area and wakes up all processes waiting for the snapshot. All those
processes then just copy the snapshot from the shared area and they
are done. This will not only reduce the total CPU consumption by
avoiding repetitive work, but would also reduce the total time for
which ProcArrayLock is held in SHARED mode by avoiding pipeline of
GetSnapshotData calls. I am currently trying the shared work queue
mechanism to implement this, but I am sure we can do it this in some
other way too.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Avoiding repeated snapshot computation

От

Robert Haas

Дата:

26 ноября 2011 г., 13:14:03

On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> What we can do is when a transaction comes to compute its snapshot, it
> checks if some other transaction is already computing a snapshot for
> itself. If so, it just sleeps on the lock. When the other process
> finishes computing the snapshot, it saves the snapshot is a shared
> area and wakes up all processes waiting for the snapshot. All those
> processes then just copy the snapshot from the shared area and they
> are done. This will not only reduce the total CPU consumption by
> avoiding repetitive work, but would also reduce the total time for
> which ProcArrayLock is held in SHARED mode by avoiding pipeline of
> GetSnapshotData calls. I am currently trying the shared work queue
> mechanism to implement this, but I am sure we can do it this in some
> other way too.

I don't quite understand how this is going to work.  Transaction A
ends, invaliding the shared snapshot.  Now transactions B, C, and D
come along wishing to take snapshots.  B begins taking a snapshot, so
C and D wait (blech!) for that to be complete.  Then, they start
copying the snapshot.  Meanwhile, transaction E ends, invalidating the
shared snapshot, and then transaction F comes along, wanting to take a
snapshot.  If F constructs a new shared snapshot, then doesn't that
overwrite the same memory area C and D were busy copying?  It seems
like F needs some way to know that C and D are done with the old
shared snapshot before it starts writing a new one.  Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!).  That
could maybe be a gain at very high concurrency when spinlock
contention is eating us alive, but it doesn't seem like a good idea in
general.  Maybe I'm missing something; did you mean to attach a patch?

I had a different idea for avoiding repeated snapshot computation:
save the snapshot in backend-private memory.  When a process takes
ProcArrayLock in exclusive mode, it increments a 64-byte counter.
When a process takes ProcArrayLock in shared mode, it can check
whether the counter has changed; if not, it can reuse the snapshot it
took before.  I think there might be away to tinker with the snapshot
management so as to do this without any more copying than we do
presently, since CurrentSnapshotData and SecondarySnapshotData are
basically treated as scratch-pads anyhow.

Another idea would be to try to avoid taking ProcArrayLock at all.
For example, suppose we again have a 64-bit counter in shared memory.
A transaction wishing to do ProcArrayEndTransaction() takes
ProcArrayLock in exclusive mode, increments the counter, memory
barrier, clears its XID, memory barrier, increments the counter,
releases ProcArrayLock.  A transaction wishing to take a snapshot
first reads the counter.  If the value read from the counter is even,
then we continue like this: memory barrier, take snapshot, memory
barrier, recheck counter.  If we find that the counter is unchanged,
then nobody did ProcArrayEndTransaction() while we were taking the
snapshot, and we're good to go.  If the counter changed then we take
ProcArrayLock in shared mode and retake the snapshot.  (We also take
ProcArrayLock in shared mode if the initial value we read was odd.)
This seems like it would all but eliminate contention on the spinlock
protecting ProcArrayLock, but I'm not sure how much overhead it would
add in other cases.  In particular, if we have to retake the snapshot
more than very occasionally, it's not going to be good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding repeated snapshot computation

От

Pavan Deolasee

Дата:

26 ноября 2011 г., 13:32:53

On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>> What we can do is when a transaction comes to compute its snapshot, it
>> checks if some other transaction is already computing a snapshot for
>> itself. If so, it just sleeps on the lock. When the other process
>> finishes computing the snapshot, it saves the snapshot is a shared
>> area and wakes up all processes waiting for the snapshot. All those
>> processes then just copy the snapshot from the shared area and they
>> are done. This will not only reduce the total CPU consumption by
>> avoiding repetitive work, but would also reduce the total time for
>> which ProcArrayLock is held in SHARED mode by avoiding pipeline of
>> GetSnapshotData calls. I am currently trying the shared work queue
>> mechanism to implement this, but I am sure we can do it this in some
>> other way too.
>
> I don't quite understand how this is going to work.

I will try, keeping it simple.

> Transaction A
> ends, invaliding the shared snapshot.  Now transactions B, C, and D
> come along wishing to take snapshots.  B begins taking a snapshot, so
> C and D wait (blech!)

Yeah, C and D waits. But thats better than concurrently doing the same
computation.

> for that to be complete.  Then, they start
> copying the snapshot.

And they are holding the ProcArrayLock in shared mode.

> Meanwhile, transaction E ends, invalidating the
> shared snapshot,

E can't end until B, C and D are done with copying the snapshot.

> and then transaction F comes along, wanting to take a
> snapshot.  If F constructs a new shared snapshot, then doesn't that
> overwrite the same memory area C and D were busy copying?  It seems
> like F needs some way to know that C and D are done with the old
> shared snapshot before it starts writing a new one.

Right. And that can easily be achieved by holding shared lock on
ProcArray. The snapshot is invalidated by holding exclusive lock and
set/copied while holding shared lock. I am assuming setting a boolean
(valid/invalid) can safely be done with a shared lock. But I may be
wrong.

>  Furthermore, it's
> hard to understand how this could be a net improvement in the general
> case, because now both B and F are copying everything twice (once to
> the shared area and one to backend-local memory) instead of just once
> (to backend-local memory) and C and D are sleeping (yikes!).

Yes, B and F pay a price of double copy. But I think it can be a net
saving because C and D (and many more hopefully) don't need to
recompute the snapshot again by going over a potentially large
ProcArray. So as I demonstrated in the illustration, the total time
for which ProcArray lock is held will be significantly less with large
number of clients.

>  That
> could maybe be a gain at very high concurrency when spinlock
> contention is eating us alive, but it doesn't seem like a good idea in
> general.  Maybe I'm missing something; did you mean to attach a patch?
>

I have a patch that I am attaching. It contains some unrelated changes
(don't know why; may be I took a diff with some other branch), but you
will get the idea. This needs improvement though because it just
checks if a valid shared snapshot is available and copies that. If
not, it goes ahead and computes one for itself. We need something more
intelligent to know that a snapshot computation is in progress and we
should wait instead of building our own. This patch had shown 15-20%
improvement on the HP box that we are using for the benchmarks.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Вложения

shared-snapshot-v2.patch

Re: Avoiding repeated snapshot computation

От

Tom Lane

Дата:

26 ноября 2011 г., 14:56:45

Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Furthermore, it's
>> hard to understand how this could be a net improvement in the general
>> case, because now both B and F are copying everything twice (once to
>> the shared area and one to backend-local memory) instead of just once
>> (to backend-local memory) and C and D are sleeping (yikes!).

> Yes, B and F pay a price of double copy. But I think it can be a net
> saving because C and D (and many more hopefully) don't need to
> recompute the snapshot again by going over a potentially large
> ProcArray.

Like Robert, I'm finding it hard to believe that this isn't going to be
a net pessimization in as many or more cases as it'd be a win.  Also,
didn't we just get rid of most of the issue of "going over a large
ProcArray" with the move of hot members to a separate array?

In the end, getting a snapshot is exactly about copying information out
of shared memory.  Perhaps it would be more productive to think about
how we could further modify the ProcArray representation to reduce the
impedance mismatch between it and the snapshot representation, rather
than introducing a second shared-memory copy of the same information.
        regards, tom lane

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

26 ноября 2011 г., 16:46:33

Hi,

On Saturday, November 26, 2011 04:52:50 PM Pavan Deolasee wrote:
> I think now that we have reduced the run time of the function itself,
> we should now try to reduce the number of times the function is
> called. Robert proposed a way to reduce the number of calls per
> transaction. I think we can go one more step further and reduce the
> number for across the transactions.
You could also try if it makes a difference reducing SnapshotData to one 
instead of two cachelines. The data itself fits into one without problems.
Trivial patch attached.

Generally I think we should check that for most of the more commonly used 
strutures, we have many with too much padding.

Andres

Re: Avoiding repeated snapshot computation

От

Tom Lane

Дата:

26 ноября 2011 г., 16:52:34

Andres Freund <andres@anarazel.de> writes:
> You could also try if it makes a difference reducing SnapshotData to one 
> instead of two cachelines. The data itself fits into one without problems.
> Trivial patch attached.

On what grounds do you argue that this patch gets SnapshotData into one
cacheline today (and on what hardware), or that it will continue to do
so in the future?  If we're this close to the edge then any future
addition to the struct will destroy the point.  I'd just as soon keep
the fields in a logical order.
        regards, tom lane

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

26 ноября 2011 г., 17:18:46

On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > You could also try if it makes a difference reducing SnapshotData to one
> > instead of two cachelines. The data itself fits into one without
> > problems. Trivial patch attached.
> On what grounds do you argue that this patch gets SnapshotData into one
> cacheline today (and on what hardware), or that it will continue to do
> so in the future?  If we're this close to the edge then any future
> addition to the struct will destroy the point.  I'd just as soon keep
> the fields in a logical order.
To my knowledge there is no current and supported (i.e. I don't count Itanium) 
hardware that doesn't have 64bit cachelines except in the embedded market.
I also don't see a movement towards 128bit cpus or anything else requiring 
bigger pointers.
If platforms with 128bit cachelines or such would gain popularity - nothing 
would be lost by that change.
Which change are you "afraid" of?

Now the holes I talked about obviously were on a 64bit machine. But there are 
a few situations where improving layout so it fits with 8byte alignment for 
pointers makes the situation worse for 32bit. Looking a bit careful at the 
changes one does should catch those problems.

Also its not *that* close to overflowing again. It was 72 bytes before, now 
its 56 on a 64bit x86 machine with linux abi.

I agree that logical order is important. I don't propose changing all structs 
in pg mechanically.+

Its sad that there is no sensible method to let the compiler safely reorder 
struct members accross compiler (-versions) and platforms...

Andres

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

26 ноября 2011 г., 18:28:56

On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.

Andres

Re: Avoiding repeated snapshot computation

От

Robert Haas

Дата:

26 ноября 2011 г., 18:39:39

On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
> On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
>> I'd just as soon keep the fields in a logical order.
> Btw, I don't think the new order is necessarily worse than the old one.

You forget to attach the benchmark results.

My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.

But testing is even better than guessing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

26 ноября 2011 г., 18:49:39

On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> >> I'd just as soon keep the fields in a logical order.
> > 
> > Btw, I don't think the new order is necessarily worse than the old one.
> 
> You forget to attach the benchmark results.
> 
> My impression is that cache lines on modern hardware are 64 or 128
> *bytes*, in which case this wouldn't matter a bit.
All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry 
for that.
And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte 
cachelines?

And yea. I didn't add benchmark results. I don't think I *have* to do that 
when making suggestions to somebody trying to improve something specific. I 
also currently don't have hardware where I can sensibly run at a high enough 
concurrency to see that GetSnapshotData takes ~40% of runtime.
Additional cacheline references around synchronized access can hurt to my 
knowledge...

Andres

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

26 ноября 2011 г., 19:51:27

On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> >> I'd just as soon keep the fields in a logical order.
> > 
> > Btw, I don't think the new order is necessarily worse than the old one.
> 
> You forget to attach the benchmark results.
> 
> My impression is that cache lines on modern hardware are 64 or 128
> *bytes*, in which case this wouldn't matter a bit.
> 
> But testing is even better than guessing.
Being prodded like that I ran a very quick benchmark on my workstation. 
Unfortunately that means I cannot work during the time which is why I kept it 
rather short...

That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) * 
2(threads) and 24GB of ram.

Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 pgbench


pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60

origsnap:       92825.743958 93145.110901 93389.915474 93175.482351
reordersnap:    93560.183272 93613.333494 93495.263012 93523.368489

pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60

origsnap:       81846.743329 81545.175672 81702.755226 81576.576435 
81228.154119 81546.047708 81421.436262
reordersnap:    81823.479196 81787.784508 81820.242145 81790.263415 
81762.421592 81496.333144 81732.088876

At that point I noticed I had accidentally run with a nearly stock config... 
An even shorter run with a more approrioate config yielded:


pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20

origsnap:       102234.664020 102003.449741 102119.509053 101722.410387 
101973.651318 102056.440561
reordersnap:    103444.877879 103385.888808 103302.318923 103372.659486 
103330.157612 103313.833821



Looks like a win to me. Even on this comparatively small machine.

Andres

Re: Avoiding repeated snapshot computation

От

"Kevin Grittner"

Дата:

26 ноября 2011 г., 19:59:12

Andres Freund  wrote:
> All current x86 cpus use 64bytes.
That matches what I found in recent research on the topic.
-Kevin

Re: Avoiding repeated snapshot computation

От

Gurjeet Singh

Дата:

28 ноября 2011 г., 15:56:08

On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund <andres@anarazel.de> wrote:

On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> >> I'd just as soon keep the fields in a logical order.
> >
> > Btw, I don't think the new order is necessarily worse than the old one.
>
> You forget to attach the benchmark results.
>
> My impression is that cache lines on modern hardware are 64 or 128
> *bytes*, in which case this wouldn't matter a bit.
>
> But testing is even better than guessing.
Being prodded like that I ran a very quick benchmark on my workstation.
Unfortunately that means I cannot work during the time which is why I kept it
rather short...

That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) *
2(threads) and 24GB of ram.

Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 pgbench

pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60

origsnap: 92825.743958 93145.110901 93389.915474 93175.482351
reordersnap: 93560.183272 93613.333494 93495.263012 93523.368489

pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60

origsnap: 81846.743329 81545.175672 81702.755226 81576.576435
81228.154119 81546.047708 81421.436262
reordersnap: 81823.479196 81787.784508 81820.242145 81790.263415
81762.421592 81496.333144 81732.088876

At that point I noticed I had accidentally run with a nearly stock config...
An even shorter run with a more approrioate config yielded:

pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20

origsnap: 102234.664020 102003.449741 102119.509053 101722.410387
101973.651318 102056.440561
reordersnap: 103444.877879 103385.888808 103302.318923 103372.659486
103330.157612 103313.833821

Looks like a win to me. Even on this comparatively small machine.

This may not be necessary, but can you please share the modified config you used for the last run.

I tabulated your last results to make it more readable, and added columns to show the improvement.

origsnap         reordersnap      diff           %age improvement
------------------------------------------------------------------
102234.66402     103444.877879    1210.213859    1.1837607827
102003.449741    103385.888808    1382.439067    1.3552865815
102119.509053    103302.318923    1182.80987     1.1582604352
101722.410387    103372.659486    1650.249099    1.6223063263
101973.651318    103330.157612    1356.506294    1.3302517626
102056.440561    103313.833821    1257.39326     1.2320567454

That looks like a win to me too. We're getting a little over 1% improvement for free!

Maybe submitting this patch to the commitfest might help get some serious consideration from a reviewer.

Regards,
--

Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

28 ноября 2011 г., 21:42:04

Hi Gurjeet,

On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote:
> On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund <andres@anarazel.de> wrote:
> > On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> > > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de>
> > 
> > wrote:
> > > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> > > >> I'd just as soon keep the fields in a logical order.
> > > > 
> > > > Btw, I don't think the new order is necessarily worse than the old
> > > > one.
> > > 
> > > You forget to attach the benchmark results.
> > > 
> > > My impression is that cache lines on modern hardware are 64 or 128
> > > *bytes*, in which case this wouldn't matter a bit.
> > > 
> > > But testing is even better than guessing.
> > 
> > Being prodded like that I ran a very quick benchmark on my workstation.
> > Unfortunately that means I cannot work during the time which is why I
> > kept it
> > rather short...
> > 
> > That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) *
> > 2(threads) and 24GB of ram.
> > 
> > Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20
> > pgbench
> > 
> > 
> > pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60
> > ...
> > Looks like a win to me. Even on this comparatively small machine.
> This may not be necessary, but can you please share the modified config you
> used for the last run.
Can't right now because I don't have access from here, but I will do so. I 
don't think its particularly interesting though.

> I tabulated your last results to make it more readable, and added columns
> to show the improvement.
> 
> origsnap         reordersnap      diff           %age improvement
> ------------------------------------------------------------------
> 102234.66402     103444.877879    1210.213859    1.1837607827
> 102003.449741    103385.888808    1382.439067    1.3552865815
> 102119.509053    103302.318923    1182.80987     1.1582604352
> 101722.410387    103372.659486    1650.249099    1.6223063263
> 101973.651318    103330.157612    1356.506294    1.3302517626
> 102056.440561    103313.833821    1257.39326     1.2320567454
> 
> That looks like a win to me too. We're getting a little over 1% improvement
> for free!
At least if we can convince Tom that such a change would be ok ;)

I would like to see somebody running a benchmark on a machine with higher 
concurrency...

> Maybe submitting this patch to the commitfest might help get some serious
> consideration from a reviewer.
It wasn't actually ment as an actual patch but a comment, but well ;). Will 
add it.

Andres

Re: Avoiding repeated snapshot computation

От

"Kevin Grittner"

Дата:

28 ноября 2011 г., 23:01:00

Andres Freund  wrote:
> I would like to see somebody running a benchmark on a machine with
> higher concurrency...
Yeah, me too.  I don't find it at all hard to believe that we will
see significant performance benefit by changing the PGPROC structure
so that all parts of it can be accessed through one cache line rather
than two.  The fact that we saw improvement when changing it down to
two, even though there is extra indirection in some places, seems
like pretty solid evidence on the face of it.
I went in to the office on Sunday to try to get a few hours of
benchmarks for this on our new monster machine, but the DBA
responsible for putting it into production was on it first, loading
it with production data.  At this point it will get really hard to
schedule any more of the 20-hour runs that were the basis of most of
my recent numbers, but I may be able to shut down production use for
a two or three hour window on a weekend to run an abbreviated set,
focusing on the higher concurrency levels.  Ideally that would be on
top of the other pending performance patches.
Based on my earlier rounds of benchmarking, I expect that this
approach will show the greatest benefit at the higher concurrency
levels, where cache lines are most stressed.
-Kevin

Re: Avoiding repeated snapshot computation

От

Pavan Deolasee

Дата:

29 ноября 2011 г., 00:52:34

On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Andres Freund  wrote:
>
>> I would like to see somebody running a benchmark on a machine with
>> higher concurrency...
>
> Yeah, me too.  I don't find it at all hard to believe that we will
> see significant performance benefit by changing the PGPROC structure
> so that all parts of it can be accessed through one cache line rather
> than two.  The fact that we saw improvement when changing it down to
> two, even though there is extra indirection in some places, seems
> like pretty solid evidence on the face of it.
>

I think there is fundamental difference between the PGPROC patch that
we did and the rearranging SnapshotData members that Andres has
proposed. The PGPROC structure is a shared memory area, very heavily
accessed by GetSnapshotData (and some others). There was a linear
increase in the access as number of clients go up. The SnapshotData
structure is mostly from a backend local memory (unless we do
something what I suggested in the OP) and hence fitting that in a
single cache line may or may not have much impact. We don't even
guarantee that it starts at a cacheline which means in most cases it
will still be spread on multiple cache lines.

Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Avoiding repeated snapshot computation

От

Pavan Deolasee

Дата:

29 ноября 2011 г., 01:13:17

On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pavan Deolasee <pavan.deolasee@gmail.com> writes:
>> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Furthermore, it's
>>> hard to understand how this could be a net improvement in the general
>>> case, because now both B and F are copying everything twice (once to
>>> the shared area and one to backend-local memory) instead of just once
>>> (to backend-local memory) and C and D are sleeping (yikes!).
>
>> Yes, B and F pay a price of double copy. But I think it can be a net
>> saving because C and D (and many more hopefully) don't need to
>> recompute the snapshot again by going over a potentially large
>> ProcArray.
>
> Like Robert, I'm finding it hard to believe that this isn't going to be
> a net pessimization in as many or more cases as it'd be a win.  Also,
> didn't we just get rid of most of the issue of "going over a large
> ProcArray" with the move of hot members to a separate array?
>

Yeah, separating the PGPROC members has helped a lot. But that does
not reduce the number of GetSnapshotData calls. It only makes each
call less expensive. As I said, I had seen 15-20% improvement with not
even a slightly tuned patch, so I am optimistic about the potential of
the approach.

> In the end, getting a snapshot is exactly about copying information out
> of shared memory.  Perhaps it would be more productive to think about
> how we could further modify the ProcArray representation to reduce the
> impedance mismatch between it and the snapshot representation, rather
> than introducing a second shared-memory copy of the same information.
>

I think that a good idea. We need a representation that needs minimum
processing to derive the snapshot. One idea is to maintain a bitmap of
currently running transactions in the shared memory. The size of the
bitmap has to be significantly larger than the ProcArray itself
because there could be gaps. But even then since its just a bit per
XID, we can have a bitmap to represent a window of say 16K or 32K
consecutive XIDs. We can also have a summary bitmap to quickly find if
there is at least one running transaction in any given chunk. If there
are long running transactions which don't fit in the bitmap, we can
track them separately in an array. Snapshot is then just a copy of
this bitmap along with some additional information.

Will try to think more about it and see if I can test this idea. But
any comments welcome.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Avoiding repeated snapshot computation

От

Bruce Momjian

Дата:

29 ноября 2011 г., 11:19:32

Pavan Deolasee wrote:
> On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> >> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>> Furthermore, it's
> >>> hard to understand how this could be a net improvement in the general
> >>> case, because now both B and F are copying everything twice (once to
> >>> the shared area and one to backend-local memory) instead of just once
> >>> (to backend-local memory) and C and D are sleeping (yikes!).
> >
> >> Yes, B and F pay a price of double copy. But I think it can be a net
> >> saving because C and D (and many more hopefully) don't need to
> >> recompute the snapshot again by going over a potentially large
> >> ProcArray.
> >
> > Like Robert, I'm finding it hard to believe that this isn't going to be
> > a net pessimization in as many or more cases as it'd be a win. ?Also,
> > didn't we just get rid of most of the issue of "going over a large
> > ProcArray" with the move of hot members to a separate array?
> >
> 
> Yeah, separating the PGPROC members has helped a lot. But that does
> not reduce the number of GetSnapshotData calls. It only makes each
> call less expensive. As I said, I had seen 15-20% improvement with not
> even a slightly tuned patch, so I am optimistic about the potential of
> the approach.

Agreed.  I think there is great potential here.  We have been stumped at
how to reduce the overhead of this for years, and it seems you are now
on a promising path.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

29 ноября 2011 г., 16:56:20

On Tuesday, November 29, 2011 05:51:40 AM Pavan Deolasee wrote:
> On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner
> 
> <Kevin.Grittner@wicourts.gov> wrote:
> > Andres Freund  wrote:
> >> I would like to see somebody running a benchmark on a machine with
> >> higher concurrency...
> > 
> > Yeah, me too.  I don't find it at all hard to believe that we will
> > see significant performance benefit by changing the PGPROC structure
> > so that all parts of it can be accessed through one cache line rather
> > than two.  The fact that we saw improvement when changing it down to
> > two, even though there is extra indirection in some places, seems
> > like pretty solid evidence on the face of it.
> 
> I think there is fundamental difference between the PGPROC patch that
> we did and the rearranging SnapshotData members that Andres has
> proposed. The PGPROC structure is a shared memory area, very heavily
> accessed by GetSnapshotData (and some others). There was a linear
> increase in the access as number of clients go up. The SnapshotData
> structure is mostly from a backend local memory (unless we do
> something what I suggested in the OP) and hence fitting that in a
> single cache line may or may not have much impact. We don't even
> guarantee that it starts at a cacheline which means in most cases it
> will still be spread on multiple cache lines.
Well, I could measure a ~1.3% benefit on a modestly concurrent machine (see 
the earlier reformatted mail from Gurjeet) and the benefits were definitely 
smaller without concurrency. But I aggree - the benefits wont be even remotely 
as big as the PGPROC splitup patch.
Youre also right about the alignment not being predictable enough.

Perhaps pg should start using/introducing a force_cacheline_align macro which 
can easily implemented on most compilers? Especially for SnapshotData and 
consorts which are aligned statically that should be good enough.

Something along thelines of

#ifdef _GNUC__
#define force_align_to(alignment) __attribute__((align(alignment)))
#elif __MSVC__
#define force_align_to(alignment) __declspec(align(alignment))
#else
#define force_align_to(alignment)
#endif

#define CACHELIGN_ALIGNMENT 64
#define force_cacheline_align force_align_to(CACHELING_ALIGNMENT)

Looks complete enough for now.


Back to PGPROC:

Looking at the current PGPROC layout on x86-64 with linux abi (generated by 
pahole):

struct PGPROC {       SHM_QUEUE                  links;                /*     0    16 */       PGSemaphoreData
 sem;                  /*    16     8 */       int                        waitStatus;           /*    24     4 */
Latch                     procLatch;            /*    28    12 */       LocalTransactionId         lxid;
/*    40     4 */       int                        pid;                  /*    44     4 */       int
   pgprocno;             /*    48     4 */       BackendId                  backendId;            /*    52     4 */
 Oid                        databaseId;           /*    56     4 */       Oid                        roleId;
  /*    60     4 */       /* --- cacheline 1 boundary (64 bytes) --- */       bool
recoveryConflictPending;/*    64     1 */       bool                       lwWaiting;            /*    65     1 */
bool                       lwExclusive;          /*    66     1 */
 
       /* XXX 5 bytes hole, try to pack */
       struct PGPROC *            lwWaitLink;           /*    72     8 */       LOCK *                     waitLock;
        /*    80     8 */       PROCLOCK *                 waitProcLock;         /*    88     8 */       LOCKMODE
           waitLockMode;         /*    96     4 */       LOCKMASK                   heldLocks;            /*   100
4*/       XLogRecPtr                 waitLSN;              /*   104     8 */       int
syncRepState;        /*   112     4 */
 
       /* XXX 4 bytes hole, try to pack */
       SHM_QUEUE                  syncRepLinks;         /*   120    16 */       /* --- cacheline 2 boundary (128 bytes)
was8 bytes ago --- */       SHM_QUEUE                  myProcLocks[16];      /*   136   256 */       /* --- cacheline 6
boundary(384 bytes) was 8 bytes ago --- */       struct XidCache            subxids;              /*   392   256 */
 /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */       LWLockId                   backendLock;
/*  648     4 */
 
       /* XXX 4 bytes hole, try to pack */
       uint64                     fpLockBits;           /*   656     8 */       Oid                        fpRelId[16];
        /*   664    64 */       /* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */       bool
         fpVXIDLock;           /*   728     1 */
 
       /* XXX 3 bytes hole, try to pack */
       LocalTransactionId         fpLocalTransactionId; /*   732     4 */
       /* size: 736, cachelines: 12, members: 28 */       /* sum members: 720, holes: 4, sum holes: 16 */       /* last
cacheline:32 bytes */
 
}

It seems possible that reordering it to avoid sharing unrelated stuff on the 
same cacheline might be beneficial. E.g. moving fastpath lock stuf away from 
the more common lwlock infrastructure might be a good idea. Currently those 
lines will bounce continually even though the fastpath stuff will mostly be 
accessed locally.
Subxids and syncRepLinks look like good candidates too.


Anyway, sorry for the topic drift.

Andres


Andres

Re: Avoiding repeated snapshot computation

От

Andres Freund

Дата:

29 ноября 2011 г., 18:01:26

Hi,

On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote:
> This may not be necessary, but can you please share the modified config you
> used for the last run.
I just copied the .conf I had for some independent development.

max_connections = 100
listen_addresses = ''
port=5432
shared_buffers = 2GB
temp_buffers = 64MB
work_mem = 96MB
maintenance_work_mem = 1GB
effective_cache_size = 20GB

log_line_prefix = '%d %a %c %v %x %m '            # special values:
log_statement = 'ddl'

#decreases variance
track_activities = off
track_counts = off
autovacuum = off

update_process_title = off
logging_collector = off
log_destination = stderr
log_min_messages = error
log_checkpoints = on
log_autovacuum_min_duration=0
synchronous_commit = off

checkpoint_segments = 30
checkpoint_timeout = 30min
checkpoint_completion_target = 0.8
log_error_verbosity = verbose

But about the only thing really important is the shared_buffers difference 
(commenting it out nearly yielded the former speed)

Andres

Re: Avoiding repeated snapshot computation

От

Ants Aasma

Дата:

29 ноября 2011 г., 20:03:46

On Tue, Nov 29, 2011 at 7:12 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> I think that a good idea. We need a representation that needs minimum
> processing to derive the snapshot.

I was looking over the generated code for GetSnapshotData to see if there
is any low hanging fruit for micro-optimization. The assembly mostly looks
pretty tight, but there are 3 function calls to TransactionIdPrecedes and
TransactionIdFollowsOrEquals. All the parameters are known to be normal
xids, so there are duplicated checks for that and a lot of movs for the calling
convention. I wonder if replacing them with special case macros would be
a good idea. In that case the whole check will compile down to one cmp
instruction. I'm running a set of benchmarks now on my laptop, but I guess
the difference will mostly become noticeable on beefier hardware when
ProcArray lock is heavily contended. Attached is a patch, if anyone wishes
to give it a go.

--
Ants Aasma

Вложения

opt-snapshot-txcmp.patch

Re: Avoiding repeated snapshot computation

От

Robert Haas

Дата:

13 декабря 2011 г., 13:52:48

On Sat, Nov 26, 2011 at 5:49 PM, Andres Freund <andres@anarazel.de> wrote:
> On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
>> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
>> >> I'd just as soon keep the fields in a logical order.
>> >
>> > Btw, I don't think the new order is necessarily worse than the old one.
>>
>> You forget to attach the benchmark results.
>>
>> My impression is that cache lines on modern hardware are 64 or 128
>> *bytes*, in which case this wouldn't matter a bit.
> All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry
> for that.
> And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte
> cachelines?

OK, so I got around to looking at this again today; sorry for the
delay.  I agree that 72 -> 56 bytes is very relevant for 64-byte cache
lines.  I had not realized the structure was as big as that.

> And yea. I didn't add benchmark results. I don't think I *have* to do that
> when making suggestions to somebody trying to improve something specific. I
> also currently don't have hardware where I can sensibly run at a high enough
> concurrency to see that GetSnapshotData takes ~40% of runtime.
> Additional cacheline references around synchronized access can hurt to my
> knowledge...

I tested this on Nate Boley's 32-core box today with the 32 clients
doing a select-only pgbench test.  Results of individual 5 minute
runs:

results.master.32:tps = 171701.947919 (including connections establishing)
results.master.32:tps = 227777.430112 (including connections establishing)
results.master.32:tps = 218257.478461 (including connections establishing)
results.master.32:tps = 226425.964855 (including connections establishing)
results.master.32:tps = 218687.662373 (including connections establishing)
results.master.32:tps = 219819.451605 (including connections establishing)
results.master.32:tps = 216800.131634 (including connections establishing)
results.snapshotdata-one-cacheline.32:tps = 181997.531357 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 171505.145440 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 226970.348605 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 169725.115763 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 219548.690522 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 175440.705722 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 217154.173823 (including
connections establishing)

There's no statistically significant difference between those two data
sets; if anything, the results with the patch look like they might be
worse, although I believe that's an artifact - some runs seem to
mysteriously come out in the 170-180 rangeinstead of the 215-225
range, with nothing in between.  But even if you only average the good
runs out of each set there's no material difference.

Having said that, I am coming around to the view that we should apply
this patch anyway, just because it reduces memory consumption.  Since
the size change crosses a power-of-two boundary, I believe it will
actually cut down the size of a palloc'd chunk for a SnapshotData
object from 128 bytes to 64 bytes.  I doubt we can measure the benefit
of that even on a microbenchmark unless someone has a better idea for
making PostgreSQL take ridiculous numbers of snapshots than I do.  It
still seems like a good idea, though: a penny saved is a penny earned.

With response to Tom's objections downthread, I don't think that the
new ordering is significantly more confusing than the old one.
xcnt/subxcnt/xip/subxip doesn't seem measurably less clear than
xcnt/xip/subxcnt/subxip, and we're not normally in the habit of
letting such concerns get in the way of squeezing out alignment
padding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding repeated snapshot computation

От

Bruce Momjian

Дата:

17 августа 2012 г., 01:02:12

Did we ever make a decision on this patch?

---------------------------------------------------------------------------

On Sat, Nov 26, 2011 at 09:22:50PM +0530, Pavan Deolasee wrote:
> On some recent benchmarks and profile data, I saw GetSnapshotData
> figures at the very top or near top. For lesser number of clients, it
> can account for 10-20% of time, but more number of clients I have seen
> it taking up as much as 40% of sample time. Unfortunately, the machine
> of which I was running these tests is currently not available and so I
> don't have the exact numbers. But the observation is almost correct.
> Our recent work on separating the hot members of PGPROC in a separate
> array would definitely reduce data cache misses ans reduce the
> GetSnapshotData time, but it probably still accounts for a large
> enough critical section for a highly contended lock.
> 
> I think now that we have reduced the run time of the function itself,
> we should now try to reduce the number of times the function is
> called. Robert proposed a way to reduce the number of calls per
> transaction. I think we can go one more step further and reduce the
> number for across the transactions.
> 
> One major problem today could be because the way LWLock works. If the
> lock is currently held in SHARED mode by some backend and some other
> backend now requests it in SHARED mode, it will immediately get it.
> Thats probably the right thing to do because you don't want the reader
> to really wait when the lock is readily available. But in the case of
> GetSnapshotData(), every reader is doing exactly the same thing; they
> are computing a snapshot based on the same shared state and would
> compute exactly the same snapshot (if we ignore the fact that we don't
> include caller's XID in xip array, but thats a minor detail). And
> because the way LWLock works, more and more readers would get in to
> compute the snapshot, until the exclusive waiters get a window to
> sneak in, either because more and more processes slowly start sleeping
> for exclusive access. To depict it, the four transactions make
> overlapping calls for GetSnapshotData() and hence the total critical
> section starts when the first caller enters it and the ends when the
> last caller exits.
> 
> Txn1 ------[          SHARED        ]---------------------
> Txn2 --------[          SHARED        ]-------------------
> Txn3 -----------------[            SHARED        ]-------------
> Txn4 -------------------------------------------[           SHARED
>     ]---------
>               |<---------------Total Time ------------------------------------>|
> 
> Couple of ideas come to mind to solve this issue.
> 
> A snapshot once computed will remain valid for every call irrespective
> of its origin until at least one transaction ends. So we can store the
> last computed snapshot in some shared area and reuse it for all
> subsequent GetSnapshotData calls. The shared snapshot will get
> invalidated when some transaction ends by calling
> ProcArrayEndTransaction(). I tried this approach and saw a 15%
> improvement for 32-80 clients on the 32 core HP IA box with pgbench -s
> 100 -N tests. Not bad, but I think this can be improved further.
> 
> What we can do is when a transaction comes to compute its snapshot, it
> checks if some other transaction is already computing a snapshot for
> itself. If so, it just sleeps on the lock. When the other process
> finishes computing the snapshot, it saves the snapshot is a shared
> area and wakes up all processes waiting for the snapshot. All those
> processes then just copy the snapshot from the shared area and they
> are done. This will not only reduce the total CPU consumption by
> avoiding repetitive work, but would also reduce the total time for
> which ProcArrayLock is held in SHARED mode by avoiding pipeline of
> GetSnapshotData calls. I am currently trying the shared work queue
> mechanism to implement this, but I am sure we can do it this in some
> other way too.
> 
> Thanks,
> Pavan
> 
> -- 
> Pavan Deolasee
> EnterpriseDB     http://www.enterprisedb.com
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Avoiding repeated snapshot computation

От

Robert Haas

Дата:

20 августа 2012 г., 15:39:10

On Thu, Aug 16, 2012 at 9:02 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Did we ever make a decision on this patch?

I committed it as 1fc3d18faa8f4476944bc6854be0f7f6adf4aec8.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Avoiding repeated snapshot computation

Вложения

Вложения