Re: Speed up Clog Access by increasing CLOG buffers

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: Speed up Clog Access by increasing CLOG buffers
Дата	13 октября 2016 г. 05:23:33
Msg-id	4aafed4c-8002-9a9c-0d15-73d19ab7e317@2ndquadrant.com обсуждение исходный текст
Ответ на	Re: Speed up Clog Access by increasing CLOG buffers (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: Speed up Clog Access by increasing CLOG buffers (Amit Kapila <amit.kapila16@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On 10/12/2016 08:55 PM, Robert Haas wrote:
> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> I think at higher client count from client count 96 onwards contention
>> on CLogControlLock is clearly visible and which is completely solved
>> with group lock patch.
>>
>> And at lower client count 32,64  contention on CLogControlLock is not
>> significant hence we can not see any gain with group lock patch.
>> (though we can see some contention on CLogControlLock is reduced at 64
>> clients.)
> 
> I agree with these conclusions.  I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
> 
> 1. Repeat this test on x86.  In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
> 
> 2. Repeat this test with a mixed read-write workload, like -b
> tpcb-like@1 -b select-only@9
> 

FWIW, I'm already running similar benchmarks on an x86 machine with 72
cores (144 with HT). It's "just" a 4-socket system, but the results I
got so far seem quite interesting. The tooling and results (pushed
incrementally) are available here:
   https://bitbucket.org/tvondra/hp05-results/overview

The tooling is completely automated, and it also collects various stats,
like for example the wait event. So perhaps we could simply run it on
ctulhu and get comparable results, and also more thorough data sets than
just snippets posted to the list?

There's also a bunch of reports for the 5 already completed runs
- dilip-300-logged-sync- dilip-300-unlogged-sync- pgbench-300-logged-sync-skip- pgbench-300-unlogged-sync-noskip-
pgbench-300-unlogged-sync-skip

The name identifies the workload type, scale and whether the tables are
wal-logged (for pgbench the "skip" means "-N" while "noskip" does
regular pgbench).

For example the "reports/wait-events-count-patches.txt" compares the
wait even stats with different patches applied (and master):


https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default

and average tps (from 3 runs, 5 minutes each):


https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default

There are certainly interesting bits. For example while the "logged"
case is dominated y WALWriteLock for most client counts, for large
client counts that's no longer true.

Consider for example dilip-300-logged-sync results with 216 clients:
    wait_event      | master  | gran_lock | no_cont_lock |
group_upd--------------------+---------+-----------+--------------+-----------CLogControlLock    |  624566 |    474261
|      458599 |    225338WALWriteLock        |  431106 |    623142 |       619596 |    699224                    |
331542|    358220 |       371393 |    537076buffer_content      |  261308 |    134764 |       138664 |
102057ClientRead         |   59826 |    100883 |       103609 |    118379transactionid       |   26966 |     23155 |
   23815 |     31700ProcArrayLock       |    3967 |      3852 |         4070 |      4576wal_insert          |    3948 |
   10430 |         9513 |     12079clog                |    1710 |      4006 |         2443 |       925XidGenLock
  |    1689 |      3785 |         4229 |      3539tuple               |     965 |       617 |          655 |
840lock_manager       |     300 |       571 |          619 |       802WALBufMappingLock   |     168 |       140 |
  158 |       147SubtransControlLock |      60 |       115 |          124 |       105
 

Clearly, CLOG is an issue here, and it's (slightly) improved by all the
patches (group_update performing the best). And with 288 clients (which
is 2x the number of virtual cores in the machine, so not entirely crazy)
you get this:
    wait_event      | master  | gran_lock | no_cont_lock |
group_upd--------------------+---------+-----------+--------------+-----------CLogControlLock    |  901670 |    736822
|      728823 |    398111buffer_content      |  492637 |    318129 |       319251 |    270416WALWriteLock        |
414371|    593804 |       589809 |    656613                    |  380344 |    452936 |       470178 |
745790ClientRead         |   60261 |    111367 |       111391 |    126151transactionid       |   43627 |     34585 |
   35464 |     48679wal_insert          |    5423 |     29323 |        25898 |     30191ProcArrayLock       |    4379 |
    3918 |         4006 |      4582clog                |    2952 |      9135 |         5304 |      2514XidGenLock
  |    2182 |      9488 |         8894 |      8595tuple               |    2176 |      1288 |         1409 |
1821lock_manager       |     323 |       797 |          827 |      1006WALBufMappingLock   |     124 |       124 |
   146 |       206SubtransControlLock |      85 |       146 |          170 |       120
 

So even buffer_content gets ahead of the WALWriteLock. I wonder whether
this might be because of only having 128 buffers for clog pages, causing
contention on this system (surely, systems with 144 cores were not that
common when the 128 limit was introduced).

So the patch has positive impact even with WAL, as illustrated by tps
improvements (for large client counts):
 clients | master | gran_locking | no_content_lock |
group_update---------+--------+--------------+-----------------+--------------     36 |  39725 |        39627 |
 41203 |        39763      72 |  70533 |        65795 |           65602 |        66195     108 |  81664 |        87415
|          86896 |        87199     144 |  68950 |        98054 |           98266 |       102834     180 | 105741 |
 109827 |          109201 |       113911     216 |  62789 |        92193 |           90586 |        98995     252 |
94243|       102368 |          100663 |       107515     288 |  57895 |        83608 |           82556 |        91738
 

I find the tps fluctuation intriguing, and I'd like to see that fixed
before committing any of the patches.

For pgbench-300-logged-sync-skip (the other WAL-logging test already
completed), the CLOG contention is also reduced significantly, but the
tps did not improve this significantly.

For the the unlogged case (dilip-300-unlogged-sync), the results are
fairly similar - CLogControlLock and buffer_content dominating the wait
even profiles (WALWriteLock is missing, of course), and the average tps
fluctuates in almost exactly the same way.

Interestingly, no fluctuation for the pgbench tests. For example for
pgbench-300-unlogged-sync-ski (i.e. pgbench -N) the result is this:
 clients | master | gran_locking | no_content_lock |
group_update---------+--------+--------------+-----------------+--------------     36 | 147265 |       148663 |
148985 |       146559      72 | 162645 |       209070 |          207841 |       204588     108 | 135785 |       219982
|         218111 |       217588     144 | 113979 |       228683 |          228953 |       226934     180 |  96930 |
 230161 |          230316 |       227156     216 |  89068 |       224241 |          226524 |       225805     252 |
78203|       222507 |          225636 |       224810     288 |  63999 |       204524 |          225469 |       220098
 

That's a fairly significant improvement, and the behavior is very
smooth. Sadly, with WAL logging (pgbench-300-logged-sync-ski) the tps
drops back to master mostly thanks to WALWriteLock.

Another interesting aspect of the patches is impact on variability of
results - for example looking at dilip-300-unlogged-sync, the overall
average tps (for the three runs combined) and for each of the tree runs,
looks like this:
 clients | avg_tps |     tps_1 |     tps_2 |    tps_3---------+---------+-----------+-----------+-----------      36 |
117332|    115042 |    116125 |    120841      72 |   90917 |     72451 |    119915 |     80319     108 |   96070 |
106105|     73606 |    108580     144 |   81422 |     71094 |    102109 |     71063     180 |   88537 |     98871 |
67756|     99021     216 |   75962 |     65584 |     96365 |     66010     252 |   59941 |     57771 |     64756 |
57289    288 |   80851 |     93005 |     56454 |     93313
 

Notice the variability between the runs - the difference between min and
max is often more than 40%. Now compare it to results with the
"group-update" patch applied:
 clients | avg_tps |     tps_1 |     tps_2 |    tps_3---------+---------+-----------+-----------+-----------      36 |
116273|    117031 |    116005 |    115786      72 |  145273 |    147166 |    144839 |    143821     108 |   89892 |
89957|     89585 |     90133     144 |  130176 |    130310 |    130565 |    129655     180 |   81944 |     81927 |
81951|     81953     216 |  124415 |    124367 |    123228 |    125651     252 |   76723 |     76467 |     77266 |
76436    288 |  120072 |    121205 |    119731 |    119283
 

In this case there's pretty much no cross-run variability, the
differences are usually within 2%, so basically random noise. (There's
of course the variability depending on client count, but that was
already mentioned).


There's certainly much more interesting stuff in the results, but I
don't have time for more thorough analysis now - I only intended to do
some "quick benchmarking" on the patch, and I've already spent days on
this, and I have other things to do.

I'll take care of collecting data for the remaining cases on this
machine (and possibly running the same tests on the other one, if I
manage to get access to it again). But I'll leave further analysis of
the collected data up to the patch authors, or some volunteers.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Michael Paquier
Дата: 13 октября 2016 г., 05:23:09
Сообщение: Re: Renaming of pg_xlog and pg_clog

Следующее

От: Peter Eisentraut
Дата: 13 октября 2016 г., 05:40:29
Сообщение: Re: Non-empty default log_line_prefix

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Speed up Clog Access by increasing CLOG buffers

Предыдущее

Следующее