Re: Speed up Clog Access by increasing CLOG buffers

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Speed up Clog Access by increasing CLOG buffers
Дата
Msg-id a87bfbfb-6511-b559-bab6-5966b7aabb8e@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Speed up Clog Access by increasing CLOG buffers  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Speed up Clog Access by increasing CLOG buffers  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi,

On 09/19/2016 09:10 PM, Robert Haas wrote:
 >
> It's possible that the effect of this patch depends on the number of
> sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets. I assume Dilip's tests were run on one of those two,
> although he doesn't seem to have mentioned which one. Your system is
> probably 2 or 4 sockets, which might make a difference. Results
> might also depend on CPU architecture; power2 is, unsurprisingly, a
> POWER system, whereas I assume you are testing x86. Maybe somebody
> who has access should test on hydra.pg.osuosl.org, which is a
> community POWER resource. (Send me a private email if you are a known
> community member who wants access for benchmarking purposes.)
>

Yes, I'm using x86 machines:

1) large but slightly old
- 4 sockets, e5-4620 (so a bit old CPU, 32 cores in total)
- kernel 3.2.80

2) smaller but fresh
- 2 sockets, e5-2620 v4 (newest type of Xeons, 16 cores in total)
- kernel 4.8.0

> Personally, I find the results so far posted on this thread
> thoroughly unimpressive. I acknowledge that Dilip's results appear
> to show that in a best-case scenario these patches produce a rather
> large gain. However, that gain seems to happen in a completely
> contrived scenario: astronomical client counts, unlogged tables, and
> a test script that maximizes pressure on CLogControlLock. If you
> have to work that hard to find a big win, and tests under more
> reasonable conditions show no benefit, it's not clear to me that it's
> really worth the time we're all spending benchmarking and reviewing
> this, or the risk of bugs, or the damage to the SLRU abstraction
> layer. I think there's a very good chance that we're better off
> moving on to projects that have a better chance of helping in the
> real world.

I'm posting results from two types of workloads - traditional r/w
pgbench and Dilip's transaction. With synchronous_commit on/off.

Full results (including script driving the benchmark) are available
here, if needed:

     https://bitbucket.org/tvondra/group-clog-benchmark/src

It'd be good if someone could try reproduce this on a comparable
machine, to rule out my stupidity.


2 x e5-2620 v4 (16 cores, 32 with HT)
=====================================

On the "smaller" machine the results look like this - I have only tested
up to 64 clients, as higher values seem rather uninteresting on a
machine with only 16 physical cores.

These are averages of 5 runs, where the min/max for each group are
within ~5% in most cases (see the "spread" sheet). The "e5-2620" sheet
also shows the numbers as % compared to master.


  dilip / sync=off      1        4        8       16       32       64
----------------------------------------------------------------------
  master             4756    17672    35542    57303    74596    82138
  granular-locking   4745    17728    35078    56105    72983    77858
  no-content-lock    4646    17650    34887    55794    73273    79000
  group-update       4582    17757    35383    56974    74387    81794

  dilip / sync=on       1        4        8       16       32       64
----------------------------------------------------------------------
  master             4819    17583    35636    57437    74620    82036
  granular-locking   4568    17816    35122    56168    73192    78462
  no-content-lock    4540    17662    34747    55560    73508    79320
  group-update       4495    17612    35474    57095    74409    81874

  pgbench / sync=off    1        4        8       16       32       64
----------------------------------------------------------------------
  master             3791    14368    27806    43369    54472    62956
  granular-locking   3822    14462    27597    43173    56391    64669
  no-content-lock    3725    14212    27471    43041    55431    63589
  group-update       3895    14453    27574    43405    56783    62406

  pgbench / sync=on     1        4        8       16       32       64
----------------------------------------------------------------------
  master             3907    14289    27802    43717    56902    62916
  granular-locking   3770    14503    27636    44107    55205    63903
  no-content-lock    3772    14111    27388    43054    56424    64386
  group-update       3844    14334    27452    43621    55896    62498

There's pretty much no improvement at all - most of the results are
within 1-2% of master, in both directions. Hardly a win.

Actually, with 1 client there seems to be ~5% regression, but it might
also be noise and verifying it would require further testing.


4 x e5-4620 v1 (32 cores, 64 with HT)
=====================================

These are averages of 10 runs, and there are a few strange things here.

Firstly, for Dilip's workload the results get much (much) worse between
64 and 128 clients, for some reason. I suspect this might be due to
fairly old kernel (3.2.80), so I'll reboot the machine with 4.5.x kernel
and try again.

Secondly, the min/max differences get much larger than the ~5% on the
smaller machine - with 128 clients, the (max-min)/average is often
 >100%. See the "spread" or "spread2" sheets in the attached file.

But for some reason this only affects Dilip's workload, and apparently
the patches make it measurably worse (master is ~75%, patches ~120%). If
you look at tps for individual runs, there's usually 9 runs with almost
the same performance, and then one or two much faster runs. Again, the
pgbench seems not to have this issue.

I have no idea what's causing this - it might be related to the kernel,
but I'm not sure why it should affect the patches differently. Let's see
how the new kernel affects this.

  dilip / sync=off       16       32       64      128     192
--------------------------------------------------------------
  master              26198    37901    37211    14441    8315
  granular-locking    25829    38395    40626    14299    8160
  no-content-lock     25872    38994    41053    14058    8169
  group-update        26503    38911    42993    19474    8325

  dilip / sync=on        16       32       64      128     192
--------------------------------------------------------------
  master              26138    37790    38492    13653    8337
  granular-locking    25661    38586    40692    14535    8311
  no-content-lock     25653    39059    41169    14370    8373
  group-update        26472    39170    42126    18923    8366

  pgbench / sync=off     16       32       64      128     192
--------------------------------------------------------------
  master              23001    35762    41202    31789    8005
  granular-locking    23218    36130    42535    45850    8701
  no-content-lock     23322    36553    42772    47394    8204
  group-update        23129    36177    41788    46419    8163

  pgbench / sync=on      16       32       64      128     192
--------------------------------------------------------------
  master              22904    36077    41295    35574    8297
  granular-locking    23323    36254    42446    43909    8959
  no-content-lock     23304    36670    42606    48440    8813
  group-update        23127    36696    41859    46693    8345


So there is some improvement due to the patches for 128 clients (+30% in
some cases), but it's rather useless as 64 clients either give you
comparable performance (pgbench workload) or way better one (Dilip's
workload).

Also, pretty much no difference between synchronous_commit on/off,
probably thanks to running on unlogged tables.

I'll repeat the test on the 4-socket machine with a newer kernel, but
that's probably the last benchmark I'll do for this patch for now. I
agree with Robert that the cases the patch is supposed to improve are a
bit contrived because of the very high client counts.

IMHO to continue with the patch (or even with testing it), we really
need a credible / practical example of a real-world workload that
benefits from the patches. The closest we have to that is Amit's
suggestion someone hit the commit lock when running HammerDB, but we
have absolutely no idea what parameters they were using, except that
they were running with synchronous_commit=off. Pgbench shows no such
improvements (at least for me), at least with reasonable parameters.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Re: gratuitous casting away const
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Tracking wait event for latches