Re: spinlocks on HP-UX

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: spinlocks on HP-UX
Дата
Msg-id 22039.1314573597@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: spinlocks on HP-UX  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: spinlocks on HP-UX  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
I wrote:
> Yeah, I figured out that was probably what you meant a little while
> later.  I found a 64-CPU IA64 machine in Red Hat's test labs and am
> currently trying to replicate your results; report to follow.

OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2.  I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200.  The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:

pgbench -c 1 -j 1 -S -T 30 bench

Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.

The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.

Today's git head:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 21309.615001 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 23285.271257 (including ...

With modified TAS macro (see patch 1 below):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 110231.305282 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 109046.121191 (including ...

With unlocked test in s_lock.c delay loop only (patch 2 below):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 106533.630986 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 106110.215138 (including ...

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

We ought to do similar tests on other architectures.  I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors.  Anybody have big
iron with other non-Intel chips?
        regards, tom lane


Patch 1: change TAS globally, non-HPUX code:

*** src/include/storage/s_lock.h.orig    Sat Jan  1 13:27:24 2011
--- src/include/storage/s_lock.h    Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ---- {     long int    ret; 
+     /*
+      * Use a non-locking test before the locking instruction proper.  This
+      * appears to be a very significant win on many-core IA64.
+      */
+     if (*lock)
+         return 1;
+      __asm__ __volatile__(         "    xchg4     %0=%1,%2    \n" :        "=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ---- {     int        ret; 
+     /*
+      * Use a non-locking test before the locking instruction proper.  This
+      * appears to be a very significant win on many-core IA64.
+      */
+     if (*lock)
+         return 1;
+      ret = _InterlockedExchange(lock,1);    /* this is a xchg asm macro */      return ret;

Patch 2: change s_lock only (same as Robert's quick hack):

*** src/backend/storage/lmgr/s_lock.c.orig    Sat Jan  1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c    Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****     int            delays = 0;     int            cur_delay = 0; 
!     while (TAS(lock))     {         /* CPU-specific delay each time through the loop */         SPIN_DELAY();
--- 96,102 ----     int            delays = 0;     int            cur_delay = 0; 
!     while (*lock ? 1 : TAS(lock))     {         /* CPU-specific delay each time through the loop */
SPIN_DELAY();


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: Why buildfarm member anchovy is failing on 8.2 and 8.3 branches
Следующее
От: Robert Haas
Дата:
Сообщение: Re: spinlocks on HP-UX