Re: Speed up Clog Access by increasing CLOG buffers

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Speed up Clog Access by increasing CLOG buffers
Дата
Msg-id acf46406-cf6d-41fe-4118-4e1c960b4790@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Speed up Clog Access by increasing CLOG buffers  (Dilip Kumar <dilipbalaut@gmail.com>)
Ответы Re: Speed up Clog Access by increasing CLOG buffers  (Dilip Kumar <dilipbalaut@gmail.com>)
Список pgsql-hackers
On 10/20/2016 09:36 AM, Dilip Kumar wrote:
> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree with these conclusions.  I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86.  In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> I have done my test on cthulhu, basic difference is that In POWER we
> saw ClogControlLock on top at 96 and more client with 300 scale
> factor. But, on cthulhu at 300 scale factor transactionid lock is
> always on top. So I repeated my test with 1000 scale factor as well on
> cthulhu.
>
> All configuration is same as my last test.
>
> Test with 1000 scale factor
> -------------------------------------
>
> Test1: number of clients: 192
>
> Head:
> tps = 21206.108856 (including connections establishing)
> tps = 21206.245441 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
>  310489  LWLockNamed     | CLogControlLock
>  296152                  |
>   35537  Lock            | transactionid
>   15821  LWLockTranche   | buffer_mapping
>   10342  LWLockTranche   | buffer_content
>    8427  LWLockTranche   | clog
>    3961
>    3165  Lock            | extend
>    2861  Lock            | tuple
>    2781  LWLockNamed     | ProcArrayLock
>    1104  LWLockNamed     | XidGenLock
>     745  LWLockTranche   | lock_manager
>     371  LWLockNamed     | CheckpointerCommLock
>      70  LWLockTranche   | wal_insert
>       5  BufferPin       | BufferPin
>       3  LWLockTranche   | proc
>
> Patch:
> tps = 28725.038933 (including connections establishing)
> tps = 28725.367102 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
>  540061                  |
>   57810  LWLockNamed     | CLogControlLock
>   36264  LWLockTranche   | buffer_mapping
>   29976  Lock            | transactionid
>    4770  Lock            | extend
>    4735  LWLockTranche   | clog
>    4479  LWLockNamed     | ProcArrayLock
>    4006
>    3955  LWLockTranche   | buffer_content
>    2505  LWLockTranche   | lock_manager
>    2179  Lock            | tuple
>    1977  LWLockNamed     | XidGenLock
>     905  LWLockNamed     | CheckpointerCommLock
>     222  LWLockTranche   | wal_insert
>       8  LWLockTranche   | proc
>
> Test2: number of clients: 96
>
> Head:
> tps = 25447.861572 (including connections establishing)
> tps = 25448.012739 (excluding connections establishing)
>  261611                  |
>   69604  LWLockNamed     | CLogControlLock
>    6119  Lock            | transactionid
>    4008
>    2874  LWLockTranche   | buffer_mapping
>    2578  LWLockTranche   | buffer_content
>    2355  LWLockNamed     | ProcArrayLock
>    1245  Lock            | extend
>    1168  LWLockTranche   | clog
>     232  Lock            | tuple
>     217  LWLockNamed     | CheckpointerCommLock
>     160  LWLockNamed     | XidGenLock
>     158  LWLockTranche   | lock_manager
>      78  LWLockTranche   | wal_insert
>       5  BufferPin       | BufferPin
>
> Patch:
> tps = 32708.368938 (including connections establishing)
> tps = 32708.765989 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_96_ul.txt
>  326601                  |
>    7471  LWLockNamed     | CLogControlLock
>    5387  Lock            | transactionid
>    4018
>    3331  LWLockTranche   | buffer_mapping
>    3144  LWLockNamed     | ProcArrayLock
>    1372  Lock            | extend
>     722  LWLockTranche   | buffer_content
>     393  LWLockNamed     | XidGenLock
>     237  LWLockTranche   | lock_manager
>     234  Lock            | tuple
>     194  LWLockTranche   | clog
>      96  Lock            | relation
>      88  LWLockTranche   | wal_insert
>      34  LWLockNamed     | CheckpointerCommLock
>
> Test3: number of clients: 64
>
> Head:
>
> tps = 28264.194438 (including connections establishing)
> tps = 28264.336270 (excluding connections establishing)
>
>  218264                  |
>   10314  LWLockNamed     | CLogControlLock
>    4019
>    2067  Lock            | transactionid
>    1950  LWLockTranche   | buffer_mapping
>    1879  LWLockNamed     | ProcArrayLock
>     592  Lock            | extend
>     565  LWLockTranche   | buffer_content
>     222  LWLockNamed     | XidGenLock
>     143  LWLockTranche   | clog
>     131  LWLockNamed     | CheckpointerCommLock
>      63  LWLockTranche   | lock_manager
>      52  Lock            | tuple
>      35  LWLockTranche   | wal_insert
>
> Patch:
> tps = 27906.376194 (including connections establishing)
> tps = 27906.531392 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_64_ul.txt
>  228108                  |
>    4039
>    2294  Lock            | transactionid
>    2116  LWLockTranche   | buffer_mapping
>    1757  LWLockNamed     | ProcArrayLock
>    1553  LWLockNamed     | CLogControlLock
>     800  Lock            | extend
>     403  LWLockTranche   | buffer_content
>      92  LWLockNamed     | XidGenLock
>      74  LWLockTranche   | lock_manager
>      42  Lock            | tuple
>      35  LWLockTranche   | wal_insert
>      34  LWLockTranche   | clog
>      14  LWLockNamed     | CheckpointerCommLock
>
> Test4: number of clients: 32
>
> Head:
> tps = 27587.999912 (including connections establishing)
> tps = 27588.119611 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
>  117762                  |
>    4031
>     614  LWLockNamed     | ProcArrayLock
>     379  LWLockNamed     | CLogControlLock
>     344  Lock            | transactionid
>     183  Lock            | extend
>     102  LWLockTranche   | buffer_mapping
>      71  LWLockTranche   | buffer_content
>      39  LWLockNamed     | XidGenLock
>      25  LWLockTranche   | lock_manager
>       3  LWLockTranche   | wal_insert
>       3  LWLockTranche   | clog
>       2  LWLockNamed     | CheckpointerCommLock
>       2  Lock            | tuple
>
> Patch:
> tps = 28291.428848 (including connections establishing)
> tps = 28291.586435 (excluding connections establishing)
> [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
>  116596                  |
>    4041
>     757  LWLockNamed     | ProcArrayLock
>     407  LWLockNamed     | CLogControlLock
>     358  Lock            | transactionid
>     183  Lock            | extend
>     142  LWLockTranche   | buffer_mapping
>      77  LWLockTranche   | buffer_content
>      68  LWLockNamed     | XidGenLock
>      35  LWLockTranche   | lock_manager
>      15  LWLockTranche   | wal_insert
>       7  LWLockTranche   | clog
>       7  Lock            | tuple
>       4  LWLockNamed     | CheckpointerCommLock
>       1 Tuples only is on.
>
> Summary:
> - At 96 and more clients count we can see ClogControlLock at the top.
> - With patch contention on ClogControlLock is reduced significantly.
> I think these behaviours are same as we saw on power.
>
> With 300 scale factor:
> - Contention on ClogControlLock is significant only at 192 client
> (still transaction id lock is on top), Which is completely removed
> with group lock patch.
>
> For 300 scale factor, I am posting data only at 192 client count (If
> anyone interested in other data I can post).
>

In the results you've posted on 10/12, you've mentioned a regression 
with 32 clients, where you got 52k tps on master but only 48k tps with 
the patch (so ~10% difference). I have no idea what scale was used for 
those tests, and I see no such regression in the current results (but 
you only report results for some of the client counts).

Also, which of the proposed patches have you been testing?

Can you collect and share a more complete set of data, perhaps based on 
the scripts I use to do tests on the large machine with 36/72 cores, 
available at https://bitbucket.org/tvondra/hp05-results ?

I've taken some time to build a simple web-based reports from the
results collected so far (also included in the git repository), and 
pushed them here:
    http://tvondra.bitbucket.org

For each of the completed runs, there's a report comparing tps for 
different client counts with master and the three patches (average tps, 
median and stddev), and it's possible to download a more thorough text 
report with wait event stats, comparison of individual runs etc.

If you want to cooperate on this, I'm available - i.e. I can help you 
get the tooling running, customize it etc.


Regarding the results collected on the "big machine" so far, I do have a 
few observations:

pgbench / scale 300 (fits into 16GB shared buffers)
---------------------------------------------------
* in general, those results seem fine

* the results generally fall into 3 categories (I'll show results for 
"pgbench -N" but regular pgbench behaves similarly):

(a) logged, sync_commit=on - no impact    http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip

(b) logged, sync_commit=off - improvement    http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip
    The thoughput gets improved by ~20% with 72 clients, and then it    levels-off (but does not drop unlike on
master).With high client    counts the difference is up to 300%, but people who care about    throughput won't run with
suchclient counts anyway.
 
    And not only this improves throughput, it also significantly    reduces variability of the performance (i.e.
measurethroughput    each second and compute STDDEV of that). You can imagine this    as a much "smoother" chart of tps
overtime.
 

(c) unlogged, sync_commit=* - improvement    http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
    This is actually quite similar to (b).


dilip / scale 300 (fits into 16GB shared buffers)
-------------------------------------------------

* those results seem less OK

* I haven't found any significant regressions (in the sense of 
significant performance drop compared to master), but the behavior in 
some cases seem fairly strange (and it's repeatable)

* consider for example these results:
  http://tvondra.bitbucket.org/#dilip-300-unlogged-async  http://tvondra.bitbucket.org/#dilip-300-logged-async

* the saw-like pattern is rather suspicious, and I don't think I've seen 
anything like that before - I guess there's some feedback loop and we 
better find it before committing any of the patches, because this is 
something I don't want to see on any production machine (and I bet 
neither do you)

* After looking into wait even details in the full text report at
  http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt
  (section "wait events for dilip-300-unlogged-async (runs combined)")
  I see that for pg-9.6-group-update, the statistics for 72, 108 and  144 clients (low - high - low), the summary looks
likethis
 
   clients | wait_event_type |     wait_event  | wait_count | wait_pct
---------+-----------------+-----------------+------------+----------       72 |                 |                 |
374845 |    62.87        72 | Client          | ClientRead      |     136320 |    22.86        72 | LWLockNamed     |
CLogControlLock|      52804 |     8.86        72 | LWLockTranche   | buffer_content  |      15337 |     2.57        72
|LWLockNamed     | XidGenLock      |       7352 |     1.23        72 | LWLockNamed     | ProcArrayLock   |       6630 |
   1.11
 
       108 |                 |                 |     407179 |    46.01       108 | LWLockNamed     | CLogControlLock |
  300452 |    33.95       108 | LWLockTranche   | buffer_content  |      87597 |     9.90       108 | Client          |
ClientRead     |      80901 |     9.14       108 | LWLockNamed     | ProcArrayLock   |       3290 |     0.37
 
       144 |                 |                 |     623057 |    53.44       144 | LWLockNamed     | CLogControlLock |
  175072 |    15.02       144 | Client          | ClientRead      |     163451 |    14.02       144 | LWLockTranche   |
buffer_content |     147963 |    12.69       144 | LWLockNamed     | XidGenLock      |      38361 |     3.29       144
|Lock            | transactionid   |       8821 |     0.76
 
  That is, there's sudden jump on CLogControlLock from 22% to 33% and  then back to 15% (and for 180 clients it jumps
backto ~35%). That's  pretty strange, and all the patches behave exactly the same.
 


scale 3000 (45GB), shared_buffers=16GB
---------------------------------------

For the small scale, the whole data set fits into 16GB shared buffers, 
so there were pretty much no writes except for WAL and CLOG. For scale 
3000 that's no longer true - the backends will compete for buffers and 
will constantly write dirty buffers to page cache.

I haven't realized this initially and the kernel was using the default 
vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this 
resulted in rather excessive threshold (or "insane" if you want), so the 
kernel regularly accumulated up to ~15GB of dirty data and then wrote it 
out in very short period of time. Even though the machine has fairly 
powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs), 
this lead to pretty bad latency spikes / drops in throughput.

I've only done two runs with this configuration before realizing what's 
happening, the results are illustrated here"

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes
* 
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes

I'm not sure how important those results are (if throughput and smooth 
behavior matters, tuning the kernel thresholds is a must), but what I 
find interesting is that while the patches manage to improve throughput 
by 10-20%, they also (quite significantly) increase variability of the 
results (jitter in the tps over time). It's particularly visible on the 
pgbench results. I'm not sure that's a good tradeoff.

After fixing the kernel page cache thresholds (by setting 
background_bytes to 256MB to perform smooth write-out), the effect 
differs depending on the workload:

(a) dilip    http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync
    - eliminates any impact of all the patches

(b) pgbench (-N)    http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
    - By far the most severe regression observed during the testing.    With 36 clients the throughput drops by ~40%,
whichI think is    pretty bad. Also the results are much more variable with the    patches (compared to master).
 


scale 3000 (45GB), shared_buffers=64GB
---------------------------------------

I've also done some tests with increased shared buffers, so that even 
the large data set fits into them. Again, the results slightly depend on 
the workload:

(a) dilip
  * http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64  *
http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64
  Pretty much no impact on throughput or variability. Unlike on the  small data set, it the patches don't even
eliminatethe performance  drop above 72 clients - the performance closely matches master.
 

(b) pgbench
  * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64  *
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64
  There's a small benefit (~20% on the same client count), and the  performance drop only happens after 72 clients. The
patchesalso  significantly increase variability of the results, particularly for  large client counts.
 


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavan Deolasee
Дата:
Сообщение: Re: Indirect indexes
Следующее
От: Petr Jelinek
Дата:
Сообщение: Re: Indirect indexes