Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)
Дата
Msg-id 20150708125512.GL10242@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)  (Andres Freund <andres@anarazel.de>)
Ответы Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)  (Andres Freund <andres@anarazel.de>)
Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
On 2015-07-08 11:12:38 +0200, Andres Freund wrote:
> On 2015-07-07 21:13:04 -0400, Tom Lane wrote:
> > There is some discussion going on about improving the scalability of
> > snapshot acquisition, but nothing will happen in that line before 9.6
> > at the earliest.
>
> 9.5 should be less bad at it than 9.5, at least if it's mostly read-only
> ProcArrayLock acquisitions which sounds like it should be the case here.

test 3:
master:
1 clients: 3112.7
2 clients: 6806.7
4 clients: 13441.2
8 clients: 15765.4
16 clients: 21102.2

9.4:
1 clients: 2524.2
2 clients: 5903.2
4 clients: 11756.8
8 clients: 14583.3
16 clients: 19309.2

So there's an interesting "dip" between 4 and 8 clients. A perf profile
doesn't show any actual lock contention on master. Not that surprising,
there shouldn't be any exclusive locks here.

One interesting thing in exactly such cases is to consider intel's
turboboost. Disabling it (echo 0 >
/sys/devices/system/cpu/cpufreq/boost) gives us these results:
test 3:
master:
1 clients: 2926.6
2 clients: 6634.3
4 clients: 13905.2
8 clients: 15718.9

so that's not it in this case.

comparing stats between the 4 and 8 client runs shows (removing boring data):

4 clients:
      90859.517328      task-clock (msec)         #    3.428 CPUs utilized
   109,655,985,749      stalled-cycles-frontend   #   54.27% frontend cycles idle     (27.79%)
    62,906,918,008      stalled-cycles-backend    #   31.14% backend  cycles idle     (27.78%)
   219,063,494,214      instructions              #    1.08  insns per cycle
                                                  #    0.50  stalled cycles per insn  (33.32%)
    41,664,400,828      branches                  #  458.558 M/sec                    (33.32%)
       374,426,805      branch-misses             #    0.90% of all branches          (33.32%)
    62,504,845,665      L1-dcache-loads           #  687.928 M/sec                    (27.78%)
     1,224,842,848      L1-dcache-load-misses     #    1.96% of all L1-dcache hits    (27.81%)
       321,981,924      LLC-loads                 #    3.544 M/sec                    (22.33%)
        23,219,438      LLC-load-misses           #    7.21% of all LL-cache hits     (5.52%)

      26.507528305 seconds time elapsed

8 clients:
     165168.247631      task-clock (msec)         #    6.824 CPUs utilized
   247,231,674,170      stalled-cycles-frontend   #   67.04% frontend cycles idle     (27.84%)
   101,354,900,788      stalled-cycles-backend    #   27.48% backend  cycles idle     (27.83%)
   285,829,642,649      instructions              #    0.78  insns per cycle
                                                  #    0.86  stalled cycles per insn  (33.39%)
    54,503,992,461      branches                  #  329.991 M/sec                    (33.39%)
       761,911,056      branch-misses             #    1.40% of all branches          (33.38%)
    81,373,091,784      L1-dcache-loads           #  492.668 M/sec                    (27.74%)
     4,419,307,036      L1-dcache-load-misses     #    5.43% of all L1-dcache hits    (27.72%)
       510,940,577      LLC-loads                 #    3.093 M/sec                    (21.86%)
        26,963,120      LLC-load-misses           #    5.28% of all LL-cache hits     (5.37%)

      24.205675255 seconds time elapsed


It's quite visible that all caches have considerably worse
characteristics on the 8 clients case, and that "instructions per cycle"
has gone down considerably. Presumably because more frontend cycles were
idle, which in turn is probably caused by the higher cache miss
ratios. L1 going from 1.96% misses to 5.43% misses is quite a drastic
difference.

Now, looking at where cache misses happen:
4 clients:
+    7.64%  postgres         postgres                       [.] AllocSetAlloc
+    3.90%  postgres         postgres                       [.] LWLockAcquire
+    3.40%  postgres         plpgsql.so                     [.] plpgsql_exec_function
+    2.64%  postgres         postgres                       [.] GetCachedPlan
+    2.20%  postgres         postgres                       [.] slot_deform_tuple
+    2.16%  postgres         libc-2.19.so                   [.] _int_free
+    2.08%  postgres         libc-2.19.so                   [.] __memcpy_sse2_unaligned

8 clients:
+    6.34%  postgres       postgres                      [.] AllocSetAlloc
+    4.89%  postgres       plpgsql.so                    [.] plpgsql_exec_function
+    2.63%  postgres       libc-2.19.so                  [.] _int_free
+    2.60%  postgres       libc-2.19.so                  [.] __memcpy_sse2_unaligned
+    2.50%  postgres       postgres                      [.] ExecLimit
+    2.47%  postgres       postgres                      [.] LWLockAcquire
+    2.18%  postgres       postgres                      [.] ExecProject

So the characteristics interestingly change quite a bit between 4/8. I
reproduced this a number of times to make sure it's not just a temporary
issue.

The memcpy rising is mainly:
      + 80.27% SearchCatCache
      + 10.56% appendBinaryStringInfo
      + 6.51% socket_putmessage
      + 0.78% pgstat_report_activity

So at least on the hardware available to me right now this isn't caused
by actual lock contention.


Hm. I've a patch addressing the SearchCatCache memcpy() cost
somewhere...

Andres

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)
Следующее
От: Andres Freund
Дата:
Сообщение: Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)