Обсуждение: Question about LWLockAcquire's use of semaphores instead of spinlocks
Question about LWLockAcquire's use of semaphores instead of spinlocks
От
"Robert E. Bruccoleri"
Дата:
On SGI multiprocessor machines, I suspect that a spinlock implementation of LWLockAcquire would give better performance than using IPC semaphores. Is there any specific reason that a spinlock could not be used in this context? +-----------------------------+------------------------------------+ | Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org | | P.O. Box 314 | URL: http://www.congen.com/~bruc | | Pennington, NJ 08534 | | +-----------------------------+------------------------------------+
"Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes: > On SGI multiprocessor machines, I suspect that a spinlock > implementation of LWLockAcquire would give better performance than > using IPC semaphores. Is there any specific reason that a spinlock > could not be used in this context? Are you confusing LWLockAcquire with TAS spinlocks? If you're saying that we don't have an implementation of TAS for SGI hardware, then feel free to contribute one. If you are wanting to replace LWLocks with spinlocks, then you are sadly mistaken, IMHO. regards, tom lane
Re: Question about LWLockAcquire's use of semaphores instead of spinlocks
От
"Robert E. Bruccoleri"
Дата:
Tom Lane writes: > > > "Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes: > > On SGI multiprocessor machines, I suspect that a spinlock > > implementation of LWLockAcquire would give better performance than > > using IPC semaphores. Is there any specific reason that a spinlock > > could not be used in this context? > > Are you confusing LWLockAcquire with TAS spinlocks? No. > If you're saying that we don't have an implementation of TAS for > SGI hardware, then feel free to contribute one. If you are wanting to > replace LWLocks with spinlocks, then you are sadly mistaken, IMHO. This touches on my question. Why am I mistaken? I don't understand. BTW, about 5 years ago, I rewrote the TAS spinlocks for the SGI platform to make it work correctly. The current implementation is fine. +-----------------------------+------------------------------------+ | Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org | | P.O. Box 314 | URL: http://www.congen.com/~bruc | | Pennington, NJ 08534 | | +-----------------------------+------------------------------------+
Re: Question about LWLockAcquire's use of semaphores instead of spinlocks
От
"Luis Alberto Amigo Navarro"
Дата:
Hi Bob: We're have been working with an sproc version of postgres and it has improve performance over a NUMA3 origin 3000 due to IRIX implements round_robin by default on memory placement instead of first touch as it did on fork. We're been wondering about replacing IPC shmem with a shared arena to help performance improve on IRIX. I dont´know if people here in postgres are interested on specifical ports but it could help you improve your performance. Regards ----- Original Message ----- From: "Robert E. Bruccoleri" <bruc@stone.congenomics.com> To: "Tom Lane" <tgl@sss.pgh.pa.us> Cc: <bruc@acm.org>; <pgsql-hackers@postgresql.org> Sent: Sunday, July 28, 2002 5:45 AM Subject: Re: [HACKERS] Question about LWLockAcquire's use of semaphores instead of spinlocks > Tom Lane writes: > > > > > > "Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes: > > > On SGI multiprocessor machines, I suspect that a spinlock > > > implementation of LWLockAcquire would give better performance than > > > using IPC semaphores. Is there any specific reason that a spinlock > > > could not be used in this context? > > > > Are you confusing LWLockAcquire with TAS spinlocks? > > No. > > > If you're saying that we don't have an implementation of TAS for > > SGI hardware, then feel free to contribute one. If you are wanting to > > replace LWLocks with spinlocks, then you are sadly mistaken, IMHO. > > This touches on my question. Why am I mistaken? I don't understand. > > BTW, about 5 years ago, I rewrote the TAS spinlocks for the > SGI platform to make it work correctly. The current implementation > is fine. > > +-----------------------------+------------------------------------+ > | Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org | > | P.O. Box 314 | URL: http://www.congen.com/~bruc | > | Pennington, NJ 08534 | | > +-----------------------------+------------------------------------+ > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > >
"Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes: > Tom Lane writes: >> If you're saying that we don't have an implementation of TAS for >> SGI hardware, then feel free to contribute one. If you are wanting to >> replace LWLocks with spinlocks, then you are sadly mistaken, IMHO. > This touches on my question. Why am I mistaken? I don't understand. Because we just got done replacing spinlocks with LWLocks ;-). I don't believe that reverting that change will improve matters. It will certainly hurt on SMP machines, and I believe that it would at best be a breakeven proposition on uniprocessors. See the discussions last fall that led up to development of the LWLock mechanism. The problem with TAS spinlocks is that they are suitable only for implementing locks that will be held for *very short* periods, ie, actual contention is rare. Over time we had allowed that mechanism to be abused for locking fairly large and complex shared-memory data structures (eg, the lock manager, the buffer manager). The next step up, a lock-manager lock, is very expensive and certainly can't be used by the lock manager itself anyway. LWLocks are an intermediate mechanism that is marginally more expensive than a spinlock but behaves much more gracefully in the presence of contention. LWLocks also allow us to distinguish shared and exclusive lock modes, thus further reducing contention in some cases. BTW, now that I reread the title of your message, I wonder if you haven't just misunderstood what's happening in lwlock.c. There is no IPC semaphore call in the fast (no-contention) path of control. A semaphore call occurs only when we are forced to wait, ie, yield the processor. Substituting a spinlock for that cannot improve matters; it would essentially result in wasting the remainder of our timeslice in a busy-loop, rather than yielding the CPU at once to some other process that can get some useful work done. regards, tom lane
Re: Question about LWLockAcquire's use of semaphores instead of spinlocks
От
"Robert E. Bruccoleri"
Дата:
Dear Tom,Thank you for the explanation. I did not understand what was going on in lwlock.c.My systems are all SGI Origins having between 8 and 32 processors, and I've been running PostgreSQL on them for about 5 years. These machines do provide a number of good mechanisms for high performance shared memory parallelism that I don't think are found elsewhere. I wish that I had the time to understand and tune PostgreSQL to run really well on them.I have a question for you and other developers with regard to my SGI needs. If I made a functional Origin 2000 system available to you with hardware support, would the group be willing to tailor the SGI port for better performance? Sincerely, Bob > > > "Robert E. Bruccoleri" <bruc@stone.congenomics.com> writes: > > Tom Lane writes: > >> If you're saying that we don't have an implementation of TAS for > >> SGI hardware, then feel free to contribute one. If you are wanting to > >> replace LWLocks with spinlocks, then you are sadly mistaken, IMHO. > > > This touches on my question. Why am I mistaken? I don't understand. > > Because we just got done replacing spinlocks with LWLocks ;-). I don't > believe that reverting that change will improve matters. It will > certainly hurt on SMP machines, and I believe that it would at best > be a breakeven proposition on uniprocessors. See the discussions last > fall that led up to development of the LWLock mechanism. > > The problem with TAS spinlocks is that they are suitable only for > implementing locks that will be held for *very short* periods, ie, > actual contention is rare. Over time we had allowed that mechanism to > be abused for locking fairly large and complex shared-memory data > structures (eg, the lock manager, the buffer manager). The next step > up, a lock-manager lock, is very expensive and certainly can't be used > by the lock manager itself anyway. LWLocks are an intermediate > mechanism that is marginally more expensive than a spinlock but behaves > much more gracefully in the presence of contention. LWLocks also allow > us to distinguish shared and exclusive lock modes, thus further reducing > contention in some cases. > > BTW, now that I reread the title of your message, I wonder if you > haven't just misunderstood what's happening in lwlock.c. There is no > IPC semaphore call in the fast (no-contention) path of control. A > semaphore call occurs only when we are forced to wait, ie, yield the > processor. Substituting a spinlock for that cannot improve matters; > it would essentially result in wasting the remainder of our timeslice > in a busy-loop, rather than yielding the CPU at once to some other > process that can get some useful work done. > > regards, tom lane > +-----------------------------+------------------------------------+ | Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org | | P.O. Box 314 | URL: http://www.congen.com/~bruc | | Pennington, NJ 08534 | | +-----------------------------+------------------------------------+
Re: Question about LWLockAcquire's use of semaphores instead of spinlocks
От
"Robert E. Bruccoleri"
Дата:
Dear Luis,I would be very interested. Replacing the IPC shared memory with an arena make a lot of sense. --Bob > > Hi Bob: > We're have been working with an sproc version of postgres and it has improve > performance over a NUMA3 origin 3000 due to IRIX implements round_robin by > default on memory placement instead of first touch as it did on fork. We're > been wondering about replacing IPC shmem with a shared arena to help > performance improve on IRIX. I dont´know if people here in postgres are > interested on specifical ports but it could help you improve your > performance. > Regards +-----------------------------+------------------------------------+ | Robert E. Bruccoleri, Ph.D. | email: bruc@acm.org | | P.O. Box 314 | URL: http://www.congen.com/~bruc | | Pennington, NJ 08534 | | +-----------------------------+------------------------------------+
Re: Question about LWLockAcquire's use of semaphores instead of spinlocks
От
"Luis Alberto Amigo Navarro"
Дата:
----- Original Message ----- From: "Robert E. Bruccoleri" <bruc@stone.congenomics.com> To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> Cc: <bruc@acm.org>; <tgl@sss.pgh.pa.us>; <pgsql-hackers@postgresql.org> Sent: Monday, July 29, 2002 2:48 AM Subject: Re: [HACKERS] Question about LWLockAcquire's use of semaphores instead of spinlocks > Dear Luis, > I would be very interested. Replacing the IPC shared memory > with an arena make a lot of sense. --Bob > On old PowerChallenge postgres works really fine, but in new NUMA architectures postgres works so badly, as we have known, forked backends don't allow IRIX to manage memory as it would be desired. Leaving First Touch placement algorithm means that almost every useful data is placed on the first node the process is run. Trying to use more than one node with this schema results in a false sharing, secondary cache hits ratio drops below 85% due to latency on a second node is about 6 times bigger than in the first node even worse if you have more than 4 nodes. All of this causes that you're almost only working with a node (4 cpus in origin 3000). Implementing Round-Robin placement algorithms causes that memory pages are placed each one in one node, this causes that all nodes have the same chance to work with some pages locally and some pages remotely. The more the number of nodes, the more advantage you can take with round-robin. You can enable round-robin recompiling postgres, setting before the enviroment variable _DSM_ROUND_ROBIN=TRUE it works fine with fork(), and it is not necessary using sprocs. Changing IPC shared memory for a shared arena could improve performance because it's the native shared segment on IRIX. it's something we're willing to do, but by now it is only a project. Hope it helps
Robert E. Bruccoleri wrote: > Dear Tom, > Thank you for the explanation. I did not understand what was > going on in lwlock.c. Yes, as Tom said, using the pre-7.2 code on SMP machines, if one backend had a spinlock, the other backend would TAS loop trying to get the lock until its timeslice ended or the other backend released the lock. Now, we TAS, then sleep on a semaphore and get woken up when the first backend releases the lock. We worked hard on that logic, I can tell you that and it was a huge discussion topic on the Fall of 2001. > My systems are all SGI Origins having between 8 and 32 > processors, and I've been running PostgreSQL on them for about 5 > years. These machines do provide a number of good mechanisms for high > performance shared memory parallelism that I don't think are found > elsewhere. I wish that I had the time to understand and tune > PostgreSQL to run really well on them. > I have a question for you and other developers with regard to > my SGI needs. If I made a functional Origin 2000 system available to > you with hardware support, would the group be willing to tailor the > SGI port for better performance? We would have to understand how the SGI code is better than our existing code on SMP machines. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> We would have to understand how the SGI code is better than our existing > code on SMP machines. there is a big problem with postgres on SGI NUMA architectures, on UMA systems postgres works fine, but NUMA Origins need a native shared memory management. It scales fine over old challenges, but scales very poorly on NUMA architectures, giving fine speed-up only within a single node. For more than one node throughput drops greatly, implementing Round-robin memory placement algorithms it gets a bit better, changing from forks to native sprocs(medium-weighted processes) makes it work better, but not good enough, if you want postgres to run fine on this machines I think (it's not tested yet) it would be neccesary to implement native shared arenas instead of IPC shared memory in order to let IRIX make a fine load-balance. I take advantage of this message to say that there is a cuple of things that we have to insert on FAQ-IRIX about using 32 bits or 64 bits objects, because it is a known issue that using 32 bit objects on IRIX do not allow to use more than 1,2 Gb of shared memory because system management is unable to find a single segment of this size. Regards
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> writes: > if you want postgres to run fine on this machines I think (it's not tested > yet) it would be neccesary to implement native shared arenas instead of IPC > shared memory in order to let IRIX make a fine load-balance. In CVS tip, the direct dependencies on SysV shared-memory calls have been split into a separate file, src/backend/port/sysv_shmem.c. It would not be difficult to make a crude port to some other shared-memory API, if you want to do some performance testing. A not-so-crude port would perhaps be more difficult. One thing that we depend on is being able to detect whether old backends are still running in a particular database cluster (this could happen if the postmaster itself crashes, leaving orphaned backends behind). Currently this is done by recording the shmem key in the postmaster's lockfile, and then during restart looking to see if any process is still attached to that shmem segment. So we are relying on the fact that SysV shmem segments (a) are not anonymous, and (b) accept a syscall to find out whether any other processes are attached to them. If the shared-memory API you want to use doesn't support similar capabilities, then there's a problem. You might be able to think of a different way to provide the same kind of interlock, though. > I take advantage of this message to say that there is a cuple of things that > we have to insert on FAQ-IRIX about using 32 bits or 64 bits objects, Send a patch ;-) regards, tom lane
----- Original Message ----- From: "Bruce Momjian" <pgman@candle.pha.pa.us> To: <bruc@acm.org> Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; <pgsql-hackers@postgresql.org> > > We would have to understand how the SGI code is better than our existing > code on SMP machines. > I've been searching for data from SGI's Origin presentation to illustrate what am I saying, this graph only covers Memory bandwith, but take present that as distance between nodes increase, memory access latency is also increased: